Skip to content

feat: dedupe hash step #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

feat: dedupe hash step #68

wants to merge 8 commits into from

Conversation

mavaball
Copy link

I tried my best with the dedupe step:

  • conform to the Step-Logic
  • settings file
  • left the main guard below in steps.py

Copy link
Collaborator

@sam-hey sam-hey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting things started! I've left some initial comments. If anything is unclear, don't hesitate to reach out. Let me know once you've addressed them, and I'll proceed with a second review.

load_dotenv()

class QdrantCompareSettings(Settings):
QDRANT_URL: str = os.getenv("QDRANT_URL", "https://qdrant.intra.oneai.yo-digital.com")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they need to be loaded from the environment. This is done automatically by the class, as described in the Pydantic documentation. Default values should not be provided.




if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can debug with:

from wurzel.steps.xxxx import xxxVStep
from wurzel.step_executor import BaseStepExecutor
from pathlib import Path

with BaseStepExecutor() as ex:
    ex(xxxVStep, [], Path("xxxVStep.json"))
    




class QdrantCompareStep(TypedStep[QdrantCompareSettings, None, dict]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests for this step

)

def run(self, inpt=None):
# 1. Daten laden
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use English for all comments and naming

# 1. Daten laden

last_2_collections = self.list_top_collections(self.settings.QDRANT_URL, headers=self.headers, prefix= self.settings.prefix, top_n=2, verbose=True)
print(last_2_collections)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls dont use prints - use logging if needed

"gpt_analysis": result_text,
"contradiction_found": "contradiction" in result_text.lower()
}
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't catch all errors; instead, handle only the specific ones you intend to address — this is a general comment

log.info(f"All results have been saved to {excel_name}.")"""

# 10. Zusammenfassung (als Log-Ausgabe)
log.info(f"Comparison between '{name_small}' and '{name_large}'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log one json obj with all the information

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the extra fields as a dictionary

extra_fields = {'user': 'sdf'}

Log a message with the extra fields

logger.info("This is an info message.", extra=extra_fields)

@sam-hey sam-hey changed the title dedupe_hash_v0 feat: dedupe hash step May 29, 2025



def run(self, inpt=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Qdrant Step as input

@@ -45,6 +45,8 @@ dependencies= [
"mdformat==0.7.17",
"spacy==3.7.5",
"tiktoken==0.7.0",
"openai==1.82.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new optional dependency group:

docs = [
    "mkdocstrings[python]"
]

settings.TLSH_MAX_DIFF = 10
step = QdrantCompareStep()
step.settings = settings
return step
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def test_identical_tlsh_analysis():
step = make_step()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fixture here

filename="/Users/A1167082/Desktop/your_file.log", # <--- Enter the desired path/filename here
filemode="a", # 'a' for append, 'w' for overwrite on each start
format="%(asctime)s - %(levelname)s - %(message)s",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log. not logging

log.info(f"All results have been saved to {excel_name}.")"""

# 10. Zusammenfassung (als Log-Ausgabe)
log.info(f"Comparison between '{name_small}' and '{name_large}'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the extra fields as a dictionary

extra_fields = {'user': 'sdf'}

Log a message with the extra fields

logger.info("This is an info message.", extra=extra_fields)



# step/settings.py
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittests please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants