Skip to content

Commit 9e4c563

Browse files
committed
Add preprocessing modules for scientific text analysis and dataset management
Signed-off-by: Darkstalker <[email protected]>
1 parent e4540dd commit 9e4c563

File tree

15 files changed

+2139
-465
lines changed

15 files changed

+2139
-465
lines changed

.idea/workspace.xml

Lines changed: 732 additions & 54 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Tokenization/Main_2.py

Lines changed: 882 additions & 0 deletions
Large diffs are not rendered by default.

Tokenization/Pre-Processing scripts/CFD.py

Lines changed: 0 additions & 208 deletions
This file was deleted.

Tokenization/Pre-Processing scripts/CIF.py

Lines changed: 0 additions & 111 deletions
This file was deleted.

Tokenization/__init__.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Tokenization/__init__.py
2+
3+
from .entropy_ranker import EntropyRanker
4+
from .label_tokens import DOMAIN_TAGS, TASK_TAGS, SECTION_TAGS, ROUTING_TAGS, build_tag_string
5+
from .preprocessing import clean_text, segment_paragraphs, preprocess_sample
6+
7+
__all__ = [
8+
"EntropyRanker",
9+
"DOMAIN_TAGS",
10+
"TASK_TAGS",
11+
"SECTION_TAGS",
12+
"ROUTING_TAGS",
13+
"build_tag_string",
14+
"clean_text",
15+
"segment_paragraphs",
16+
"preprocess_sample",
17+
]

0 commit comments

Comments
 (0)