You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--lang [LANG] filter languages. defaults to no filtering. the lang
40
-
codes must be comma separated.
41
-
--output [OUTPUT] the name of the folder where the twitter data will be
42
-
saved.
43
-
--no-rt dont't save retweets.
44
-
--only-text keep only the text
40
+
-h, --help show this help message and exit
41
+
--lang [LANG] filter languages. defaults to no filtering. the lang codes must be comma separated.
42
+
--storage [{text,db}]
43
+
the type of storage.
44
+
- Set to "text"forsaving the tweetsin text files.
45
+
- Set to "db"forsaving the tweetsin a MongoDB database.
46
+
--omit-rt omit retweets
47
+
--only-text keep only the text.
45
48
46
49
```
47
-
50
+
####Storage
51
+
You have two options forsaving the twitter data, (1) on diskin text files, or (2) in a database (MongoDB).
52
+
```
53
+
$ python twsd/main.py --storage db
54
+
```
48
55
49
56
####Languages
50
57
Select the language or languages of the tweets that you want to save, using the [corresponding language codes](https://dev.twitter.com/web/overview/languages).
@@ -53,43 +60,31 @@ The values have to be comma separated.
53
60
54
61
For example, to save only Greek and English tweets:
55
62
```
56
-
>python twsd.py --lang=el,en
57
-
```
58
-
59
-
####Output folder
60
-
You can set the name of the output folder:
61
-
```
62
-
>python twsd.py --output=myfolder
63
+
$ python twsd/main.py --lang=el,en
63
64
```
64
65
65
66
66
67
####Keep Retweets or not
67
68
You can decide whether to save retweets or not. A reason to decide not to, is that a tweet may be retweeted many times and this will skew the statistics of the dataset.
68
69
69
-
The default behavior is to save the retweets. To not save them just add the `--no-rt` parameter:
70
+
The default behavior is to save the retweets. To not save them just add the `--omit-rt` parameter:
70
71
```
71
-
>python twsd.py --no-rt
72
+
$ python twsd/main.py --omit-rt
72
73
```
73
74
74
75
####Only Text
75
-
If what you are interested in is only the message of the tweet,
76
-
then you have the option to save just that with the `--only-text` parameter:
76
+
If what you are interested in is only the message of the tweet, then you have the option to save just that with the `--only-text` parameter:
77
77
```
78
-
>python twsd.py --only-text
78
+
$ python twsd/main.py --only-text
79
79
```
80
-
This way you save space (the biggest part of the tweet object is metadata,
81
-
and the text itself is only a small percentage of it), and the unnecessary json parsing (saving time
82
-
during the processing of the dataset).
83
-
84
-
In this case, each row in the file contains the `tweet_id` with the `text` (tab separated).
85
-
The `tweet_id` will be useful for deduplication, if you want to merge datasets.
80
+
This way you save space (the biggest part of the tweet object is metadata, and the text itself is only a small percentage of it), and the unnecessary json parsing (saving time during the processing of the dataset). Each row in the file contains the `tweet_id` with the `text` (tab separated). The `tweet_id` will be useful for deduplication, if you want to merge datasets.
86
81
87
82
---
88
83
##Execution
89
-
Run the service by executing the `twsd.py` script. The service prints some useful information about it's progress.
84
+
Run the service by executing the `twsd/main.py` script. The service prints some useful information about it's progress.
90
85
Here is an example where we start downloading only the text of English tweets without keeping the retweets.
0 commit comments