generated from telekom/reuse-template
-
Notifications
You must be signed in to change notification settings - Fork 5
feat: dedupe hash step #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mavaball
wants to merge
8
commits into
telekom:main
Choose a base branch
from
mavaball:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+611
−0
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
a57275b
code for dedupe step added
mavaball 6d89849
code for dedupe step refined
mavaball f8a1273
code for dedupe step refined
mavaball 0eaa9d1
code for dedupe step added
mavaball 26e4214
corrections
mavaball f4a11cc
after fixing bugs and lintering
mavaball 4b05964
Merge branch 'main' into main
mavaball 1c13007
Merge branch 'main' into main
sam-hey File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# SPDX-FileCopyrightText: 2025 Deutsche Telekom AG ([email protected]) | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
import pandas as pd | ||
|
||
import sys | ||
import os | ||
sys.path.append(os.path.abspath("/Users/A1167082/Desktop/wurzel")) | ||
|
||
|
||
from wurzel.steps.dedupe_hash.settings import QdrantCompareSettings | ||
from wurzel.steps.dedupe_hash.step import QdrantCompareStep | ||
|
||
#/Users/A1167082/Desktop/wurzel/wurzel/steps/dedupe_hash | ||
|
||
|
||
def make_step(): | ||
settings = QdrantCompareSettings() | ||
settings.QDRANT_URL = "http://localhost:6333" | ||
settings.QDRANT_API_KEY = "dummy" | ||
settings.OPAI_API_KEY = "dummy" | ||
settings.AZURE_ENDPOINT = "https://dummy-endpoint" | ||
settings.GPT_MODEL = "dummy" | ||
settings.QDRANT_COLLECTION_PREFIX = "test_v" | ||
settings.FUZZY_THRESHOLD = 85 | ||
settings.TLSH_MAX_DIFF = 10 | ||
step = QdrantCompareStep() | ||
step.settings = settings | ||
return step | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
|
||
def test_identical_tlsh_analysis(): | ||
step = make_step() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use fixture here |
||
df1 = pd.DataFrame([{"tlsh": "A" * 70}, {"tlsh": "B" * 70}]) | ||
df2 = pd.DataFrame([{"tlsh": "A" * 70}, {"tlsh": "C" * 70}]) | ||
identical, count = step._identical_tlsh_analysis(df1, df2, "tlsh") | ||
assert count == 1 | ||
assert "A" * 70 in identical | ||
|
||
|
||
def test_fuzzy_tlsh_matches(): | ||
step = make_step() | ||
df = pd.DataFrame([{"tlsh": "A" * 70}, {"tlsh": "A" * 70}, {"tlsh": "B" * 70}]) | ||
matches = step._fuzzy_tlsh_matches(df, "tlsh", 100) | ||
assert any(isinstance(m, tuple) and len(m) == 3 for m in matches) | ||
|
||
|
||
def test_diff_snippet(): | ||
step = make_step() | ||
diff = step._diff_snippet("Hallo Welt", "Hallo Erde") | ||
assert "Hallo" in diff | ||
|
||
|
||
def test_suspicious_cases_analysis(): | ||
step = make_step() | ||
df = pd.DataFrame([{"text": "Hallo Welt", "tlsh": "A" * 70}, {"text": "Hallo Erde", "tlsh": "B" * 70}]) | ||
matches = [(0, 1, 5)] | ||
suspicious = step._suspicious_cases_analysis(df, matches, "text") | ||
assert isinstance(suspicious, list) | ||
assert suspicious[0]["fuzz_ratio"] < 100 | ||
|
||
|
||
def test_analyze_extra_docs_detail(): | ||
step = make_step() | ||
df_base = pd.DataFrame([{"text": "Hallo Welt"}]) | ||
df_extra = pd.DataFrame([{"text": "Hallo Mars"}]) | ||
result = step._analyze_extra_docs_detail(df_base, df_extra, "text", 80) | ||
assert isinstance(result, list) | ||
assert "is_truly_new" in result[0] | ||
|
||
|
||
def test_extract_gpt_shortform(): | ||
step = make_step() | ||
assert step._extract_gpt_shortform({"gpt_analysis": "Contradiction found."}) == "contradiction" | ||
assert step._extract_gpt_shortform({"gpt_analysis": "Keep both"}) == "both" | ||
assert step._extract_gpt_shortform({"gpt_analysis": "Remove document 1"}) == "a remove" | ||
assert step._extract_gpt_shortform({"gpt_analysis": "Remove document 2"}) == "b remove" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# SPDX-FileCopyrightText: 2025 Deutsche Telekom AG ([email protected]) | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
|
||
# from .settings import QdrantCompareSettings # as QdrantCompareSettings | ||
# from .step import QdrantCompareStep # as QdrantCompareStep |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# SPDX-FileCopyrightText: 2025 Deutsche Telekom AG ([email protected]) | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
|
||
from dotenv import load_dotenv | ||
from pydantic import Field | ||
|
||
from wurzel.step.settings import Settings # falls Settings eine Pydantic-Basisklasse ist | ||
|
||
# Lade .env-Datei automatisch | ||
load_dotenv() | ||
|
||
|
||
class QdrantCompareSettings(Settings): | ||
"""Configuration settings for comparing two Qdrant collections. | ||
|
||
This class defines all environment-configurable parameters required for | ||
analyzing differences, redundancies, and contradictions between two Qdrant | ||
collections. It supports integration with Azure and OpenAI for advanced | ||
deduplication and fuzzy matching. All settings can be loaded from environment | ||
variables or a .env file, making it suitable for flexible deployment and | ||
secure configuration management. | ||
|
||
Attributes: | ||
QDRANT_URL (str): Base URL for Qdrant. | ||
QDRANT_API_KEY (str): API key for Qdrant access. | ||
AZURE_ENDPOINT (str): Endpoint for Azure access. | ||
FUZZY_THRESHOLD (int): Fuzzy match threshold for Qdrant. | ||
TLSH_MAX_DIFF (int): Maximum TLSH difference for deduplication. | ||
OPAI_API_KEY (str): OpenAI API key for deduplication. | ||
GPT_MODEL (str): OpenAI model to use for deduplication. | ||
QDRANT_COLLECTION_PREFIX (str): Prefix for Qdrant collection names to extract versions. | ||
|
||
""" | ||
|
||
QDRANT_URL: str = Field( | ||
"", | ||
description="Base URL for Qdrant.", | ||
) | ||
QDRANT_API_KEY: str = Field( | ||
"", | ||
description="API key for Qdrant access.", | ||
) | ||
|
||
AZURE_ENDPOINT: str = Field("", description="ENDPOINT for AZURE acces.") | ||
|
||
FUZZY_THRESHOLD: int = Field( | ||
99, | ||
description="Fuzzy match threshold for Qdrant.", | ||
) | ||
TLSH_MAX_DIFF: int = Field( | ||
1, | ||
description="Maximum TLSH difference for deduplication.", | ||
) | ||
OPAI_API_KEY: str = Field( | ||
"", | ||
description="OpenAI API key for deduplication.", | ||
) | ||
GPT_MODEL: str = Field( | ||
"GPT4-CH", | ||
description="OpenAI model to use for deduplication.", | ||
) | ||
QDRANT_COLLECTION_PREFIX: str = Field( | ||
"", | ||
description="Prefix for Qdrant collection names to extract versions.", | ||
) | ||
|
||
class Config: | ||
"""Compares two Qdrant collections and analyzes differences, redundancies, and contradictions.""" | ||
|
||
env_prefix = "QDRANTCOMPARESTEP__" | ||
env_file = ".env" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new optional dependency group: