Skip to content

Commit 17d8221

Browse files
committed
Add README file with dataset details and licensing information for HF
1 parent 0536d72 commit 17d8221

File tree

1 file changed

+34
-0
lines changed

1 file changed

+34
-0
lines changed

README-hf.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Wikipedia 22-12 DE DPR
2+
3+
For details about this dataset please see
4+
[telekom/wikipedia-22-12-de-dpr](https://github.com/telekom/wikipedia-22-12-de-dpr)
5+
on GitHub.
6+
7+
## Creator
8+
9+
This data set is compiled and open sourced by [Philip May](https://may.la/)
10+
of [Deutsche Telekom](https://www.telekom.de/).
11+
12+
## Licensing
13+
14+
### The Code and Documentation
15+
16+
Copyright (c) 2023-2024 [Philip May](https://may.la/), [Deutsche Telekom AG](https://www.telekom.de/)
17+
18+
Licensed under the **MIT License** (the "License"); you may not use this file except in compliance with the License.
19+
You may obtain a copy of the License by reviewing the file
20+
[LICENSE](https://github.com/telekom/mltb2/blob/main/LICENSE) in the repository.
21+
22+
### The Wikipedia Texts, Questions and Imperative Questions
23+
24+
The Wikipedia texts are licensed under [CC BY-SA 4.0 Deed](https://creativecommons.org/licenses/by-sa/4.0/deed)
25+
by the corresponding authors of the [German Wikipedia](https://de.wikipedia.org/). The questions and
26+
imperative questions are copyright ([CC BY-SA 4.0 Deed](https://creativecommons.org/licenses/by-sa/4.0/deed)) by
27+
[Philip May](https://may.la/), [Deutsche Telekom AG](https://www.telekom.de/).
28+
Indication of changes:
29+
30+
- data source is the [Cohere/wikipedia-22-12-de-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings) dataset on Hugging Face Hub
31+
- we took `wiki_id`, `title` and `text`
32+
- did some normalization and filtering
33+
- and merged the texts to an appropriate token count
34+
- details can be found in the respective notebooks

0 commit comments

Comments
 (0)