-
Notifications
You must be signed in to change notification settings - Fork 5
feat: dedupe hash step #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for getting things started! I've left some initial comments. If anything is unclear, don't hesitate to reach out. Let me know once you've addressed them, and I'll proceed with a second review.
wurzel/steps/dedupe_hash/qdrant_collections_compare_AICC-5663.py
Outdated
Show resolved
Hide resolved
wurzel/steps/dedupe_hash/settings.py
Outdated
load_dotenv() | ||
|
||
class QdrantCompareSettings(Settings): | ||
QDRANT_URL: str = os.getenv("QDRANT_URL", "https://qdrant.intra.oneai.yo-digital.com") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they need to be loaded from the environment. This is done automatically by the class, as described in the Pydantic documentation. Default values should not be provided.
wurzel/steps/dedupe_hash/step.py
Outdated
|
||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can debug with:
from wurzel.steps.xxxx import xxxVStep
from wurzel.step_executor import BaseStepExecutor
from pathlib import Path
with BaseStepExecutor() as ex:
ex(xxxVStep, [], Path("xxxVStep.json"))
wurzel/steps/dedupe_hash/step.py
Outdated
|
||
|
||
|
||
class QdrantCompareStep(TypedStep[QdrantCompareSettings, None, dict]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some tests for this step
wurzel/steps/dedupe_hash/step.py
Outdated
) | ||
|
||
def run(self, inpt=None): | ||
# 1. Daten laden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use English for all comments and naming
wurzel/steps/dedupe_hash/step.py
Outdated
# 1. Daten laden | ||
|
||
last_2_collections = self.list_top_collections(self.settings.QDRANT_URL, headers=self.headers, prefix= self.settings.prefix, top_n=2, verbose=True) | ||
print(last_2_collections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls dont use prints - use logging if needed
wurzel/steps/dedupe_hash/step.py
Outdated
"gpt_analysis": result_text, | ||
"contradiction_found": "contradiction" in result_text.lower() | ||
} | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't catch all errors; instead, handle only the specific ones you intend to address — this is a general comment
log.info(f"All results have been saved to {excel_name}.")""" | ||
|
||
# 10. Zusammenfassung (als Log-Ausgabe) | ||
log.info(f"Comparison between '{name_small}' and '{name_large}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log one json obj with all the information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define the extra fields as a dictionary
extra_fields = {'user': 'sdf'}
Log a message with the extra fields
logger.info("This is an info message.", extra=extra_fields)
wurzel/steps/dedupe_hash/step.py
Outdated
|
||
|
||
|
||
def run(self, inpt=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Qdrant Step as input
@@ -45,6 +45,8 @@ dependencies= [ | |||
"mdformat==0.7.17", | |||
"spacy==3.7.5", | |||
"tiktoken==0.7.0", | |||
"openai==1.82.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new optional dependency group:
docs = [
"mkdocstrings[python]"
]
settings.TLSH_MAX_DIFF = 10 | ||
step = QdrantCompareStep() | ||
step.settings = settings | ||
return step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def test_identical_tlsh_analysis(): | ||
step = make_step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use fixture here
filename="/Users/A1167082/Desktop/your_file.log", # <--- Enter the desired path/filename here | ||
filemode="a", # 'a' for append, 'w' for overwrite on each start | ||
format="%(asctime)s - %(levelname)s - %(message)s", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use log.
not logging
log.info(f"All results have been saved to {excel_name}.")""" | ||
|
||
# 10. Zusammenfassung (als Log-Ausgabe) | ||
log.info(f"Comparison between '{name_small}' and '{name_large}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define the extra fields as a dictionary
extra_fields = {'user': 'sdf'}
Log a message with the extra fields
logger.info("This is an info message.", extra=extra_fields)
wurzel/steps/dedupe_hash/settings.py
Outdated
|
||
|
||
# step/settings.py | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unittests please
I tried my best with the dedupe step: