feat: dedupe hash step #68

mavaball · 2025-05-28T15:15:37Z

I tried my best with the dedupe step:

conform to the Step-Logic
settings file
left the main guard below in steps.py

sam-hey

Thank you for getting things started! I've left some initial comments. If anything is unclear, don't hesitate to reach out. Let me know once you've addressed them, and I'll proceed with a second review.

wurzel/steps/dedupe_hash/qdrant_collections_compare_AICC-5663.py

wurzel/steps/dedupe_hash/settings.py

sam-hey · 2025-05-29T08:34:06Z

wurzel/steps/dedupe_hash/settings.py

+load_dotenv()
+
+class QdrantCompareSettings(Settings):
+    QDRANT_URL: str = os.getenv("QDRANT_URL", "https://qdrant.intra.oneai.yo-digital.com")


No, they need to be loaded from the environment. This is done automatically by the class, as described in the Pydantic documentation. Default values should not be provided.

sam-hey · 2025-05-29T08:35:16Z

wurzel/steps/dedupe_hash/step.py

+
+
+
+if __name__ == "__main__":


You can debug with:

from wurzel.steps.xxxx import xxxVStep from wurzel.step_executor import BaseStepExecutor from pathlib import Path with BaseStepExecutor() as ex: ex(xxxVStep, [], Path("xxxVStep.json"))

sam-hey · 2025-05-29T08:36:40Z

wurzel/steps/dedupe_hash/step.py

+
+
+
+class QdrantCompareStep(TypedStep[QdrantCompareSettings, None, dict]):


Please add some tests for this step

sam-hey · 2025-05-29T08:39:59Z

wurzel/steps/dedupe_hash/step.py

+        )
+
+    def run(self, inpt=None):
+        # 1. Daten laden


Please use English for all comments and naming

wurzel/steps/dedupe_hash/step.py

sam-hey · 2025-05-29T08:41:39Z

wurzel/steps/dedupe_hash/step.py

+        # 1. Daten laden
+
+        last_2_collections = self.list_top_collections(self.settings.QDRANT_URL, headers=self.headers, prefix= self.settings.prefix, top_n=2, verbose=True)
+        print(last_2_collections)


Pls dont use prints - use logging if needed

sam-hey · 2025-05-29T08:43:39Z

wurzel/steps/dedupe_hash/step.py

+                "gpt_analysis": result_text,
+                "contradiction_found": "contradiction" in result_text.lower()
+            }
+        except Exception as e:


Please don't catch all errors; instead, handle only the specific ones you intend to address — this is a general comment

sam-hey · 2025-05-29T08:48:32Z

wurzel/steps/dedupe_hash/step.py

+        log.info(f"All results have been saved to {excel_name}.")"""
+
+        # 10. Zusammenfassung (als Log-Ausgabe)
+        log.info(f"Comparison between '{name_small}' and '{name_large}'")


Log one json obj with all the information

Define the extra fields as a dictionary

extra_fields = {'user': 'sdf'}

Log a message with the extra fields

logger.info("This is an info message.", extra=extra_fields)

wurzel/steps/dedupe_hash/step.py

sam-hey · 2025-05-30T13:27:17Z

wurzel/steps/dedupe_hash/step.py

+
+
+
+    def run(self, inpt=None):


Use Qdrant Step as input

sam-hey · 2025-06-02T13:57:50Z

pyproject.toml

@@ -45,6 +45,8 @@ dependencies= [
    "mdformat==0.7.17",
    "spacy==3.7.5",
    "tiktoken==0.7.0",
+    "openai==1.82.1",


Add new optional dependency group:

docs = [ "mkdocstrings[python]" ]

sam-hey · 2025-06-02T14:00:02Z

tests/steps/dedupe_hash/test_qdrant_compare.py

+    settings.TLSH_MAX_DIFF = 10
+    step = QdrantCompareStep()
+    step.settings = settings
+    return step


Use @pytest.fixture

https://docs.pytest.org/en/6.2.x/fixture.html#back-to-fixtures

sam-hey · 2025-06-02T14:00:19Z

tests/steps/dedupe_hash/test_qdrant_compare.py

+
+
+def test_identical_tlsh_analysis():
+    step = make_step()


Use fixture here

sam-hey · 2025-06-02T14:03:32Z

wurzel/steps/dedupe_hash/step.py

+    filename="/Users/A1167082/Desktop/your_file.log",  # <--- Enter the desired path/filename here
+    filemode="a",  # 'a' for append, 'w' for overwrite on each start
+    format="%(asctime)s - %(levelname)s - %(message)s",
+)


use log. not logging

sam-hey · 2025-06-02T14:05:19Z

wurzel/steps/dedupe_hash/step.py

+        log.info(f"All results have been saved to {excel_name}.")"""
+
+        # 10. Zusammenfassung (als Log-Ausgabe)
+        log.info(f"Comparison between '{name_small}' and '{name_large}'")


Define the extra fields as a dictionary

extra_fields = {'user': 'sdf'}

Log a message with the extra fields

logger.info("This is an info message.", extra=extra_fields)

tweigel-dev · 2025-06-04T05:51:39Z

wurzel/steps/dedupe_hash/settings.py

+
+
+# step/settings.py
+import os


unittests please

mavaball added 3 commits May 27, 2025 16:52

code for dedupe step added

a57275b

code for dedupe step refined

6d89849

code for dedupe step refined

f8a1273

sam-hey requested changes May 29, 2025

View reviewed changes

sam-hey assigned mavaball May 29, 2025

sam-hey changed the title ~~dedupe_hash_v0~~ feat: dedupe hash step May 29, 2025

sam-hey reviewed May 30, 2025

View reviewed changes

wurzel/steps/dedupe_hash/step.py Show resolved Hide resolved

mavaball added 2 commits May 30, 2025 15:18

code for dedupe step added

0eaa9d1

corrections

26e4214

sam-hey reviewed May 30, 2025

View reviewed changes

wurzel/steps/dedupe_hash/step.py Outdated

def run(self, inpt=None):

Copy link

Collaborator

sam-hey May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Qdrant Step as input

mavaball added 2 commits May 31, 2025 12:16

after fixing bugs and lintering

f4a11cc

Merge branch 'main' into main

4b05964

sam-hey requested changes Jun 2, 2025

View reviewed changes

Merge branch 'main' into main

1c13007

tweigel-dev reviewed Jun 4, 2025

View reviewed changes

wurzel/steps/dedupe_hash/settings.py Outdated

# step/settings.py

import os

Copy link

Collaborator

tweigel-dev Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittests please




		class QdrantCompareStep(TypedStep[QdrantCompareSettings, None, dict]):

feat: dedupe hash step #68

Are you sure you want to change the base?

feat: dedupe hash step #68

Uh oh!

Conversation

mavaball commented May 28, 2025

Uh oh!

sam-hey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Define the extra fields as a dictionary

Log a message with the extra fields

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Define the extra fields as a dictionary

Log a message with the extra fields

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!