Skip to content

Commit aeee44e

Browse files
committed
add database (MongoDB) storage support
1 parent 785e037 commit aeee44e

File tree

13 files changed

+336
-249
lines changed

13 files changed

+336
-249
lines changed

.flake8

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
[flake8]
2+
max-line-length = 120
3+
4+
exclude =
5+
.tox,
6+
__pycache__,
7+
build,
8+
dist
9+
10+
ignore =
11+
# F401 imported but unused
12+
# F401,
13+
# E501 line too long
14+
# E501,
15+
# E303 too many blank lines
16+
# E303,
17+
# E731 do not assign a lambda expression, use a def
18+
# E731,
19+
# F812: list comprehension redefines ...
20+
# F812,
21+
# E402 module level import not at top of file
22+
E402,
23+
# W292 no newline at end of file
24+
# W292,
25+
# E999 SyntaxError: invalid syntax
26+
# E999,
27+
# F821 undefined name
28+
# F821,

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Created by .ignore support plugin (hsz.mobi)
2+
/twsd/settings.txt

FileManager.py

Lines changed: 0 additions & 69 deletions
This file was deleted.

Listener.py

Lines changed: 0 additions & 58 deletions
This file was deleted.

README.md

Lines changed: 26 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -29,22 +29,29 @@ access_token_secret=ENTER_YOUR_ACCESS_TOKEN_SECRET
2929
### Parameters - command-line arguments
3030
You can decide what data you want to save, by setting the following parameters:
3131

32-
```
33-
>python twsd.py -h
34-
usage: twsd.py [-h] [--lang [LANG]] [--output [OUTPUT]] [--no-rt]
32+
```shell
33+
$ python twsd/main.py -h
34+
usage: main.py [-h] [--lang [LANG]] [--storage [{text,db}]] [--omit-rt]
3535
[--only-text]
3636

37+
twitter stream downloader
38+
3739
optional arguments:
38-
-h, --help show this help message and exit
39-
--lang [LANG] filter languages. defaults to no filtering. the lang
40-
codes must be comma separated.
41-
--output [OUTPUT] the name of the folder where the twitter data will be
42-
saved.
43-
--no-rt dont't save retweets.
44-
--only-text keep only the text
40+
-h, --help show this help message and exit
41+
--lang [LANG] filter languages. defaults to no filtering. the lang codes must be comma separated.
42+
--storage [{text,db}]
43+
the type of storage.
44+
- Set to "text" for saving the tweets in text files.
45+
- Set to "db" for saving the tweets in a MongoDB database.
46+
--omit-rt omit retweets
47+
--only-text keep only the text.
4548

4649
```
47-
50+
####Storage
51+
You have two options for saving the twitter data, (1) on disk in text files, or (2) in a database (MongoDB).
52+
```
53+
$ python twsd/main.py --storage db
54+
```
4855
4956
####Languages
5057
Select the language or languages of the tweets that you want to save, using the [corresponding language codes](https://dev.twitter.com/web/overview/languages).
@@ -53,43 +60,31 @@ The values have to be comma separated.
5360
5461
For example, to save only Greek and English tweets:
5562
```
56-
>python twsd.py --lang=el,en
57-
```
58-
59-
####Output folder
60-
You can set the name of the output folder:
61-
```
62-
>python twsd.py --output=myfolder
63+
$ python twsd/main.py --lang=el,en
6364
```
6465
6566
6667
####Keep Retweets or not
6768
You can decide whether to save retweets or not. A reason to decide not to, is that a tweet may be retweeted many times and this will skew the statistics of the dataset.
6869
69-
The default behavior is to save the retweets. To not save them just add the `--no-rt` parameter:
70+
The default behavior is to save the retweets. To not save them just add the `--omit-rt` parameter:
7071
```
71-
>python twsd.py --no-rt
72+
$ python twsd/main.py --omit-rt
7273
```
7374
7475
####Only Text
75-
If what you are interested in is only the message of the tweet,
76-
then you have the option to save just that with the `--only-text` parameter:
76+
If what you are interested in is only the message of the tweet, then you have the option to save just that with the `--only-text` parameter:
7777
```
78-
>python twsd.py --only-text
78+
$ python twsd/main.py --only-text
7979
```
80-
This way you save space (the biggest part of the tweet object is metadata,
81-
and the text itself is only a small percentage of it), and the unnecessary json parsing (saving time
82-
during the processing of the dataset).
83-
84-
In this case, each row in the file contains the `tweet_id` with the `text` (tab separated).
85-
The `tweet_id` will be useful for deduplication, if you want to merge datasets.
80+
This way you save space (the biggest part of the tweet object is metadata, and the text itself is only a small percentage of it), and the unnecessary json parsing (saving time during the processing of the dataset). Each row in the file contains the `tweet_id` with the `text` (tab separated). The `tweet_id` will be useful for deduplication, if you want to merge datasets.
8681
8782
---
8883
##Execution
89-
Run the service by executing the `twsd.py` script. The service prints some useful information about it's progress.
84+
Run the service by executing the `twsd/main.py` script. The service prints some useful information about it's progress.
9085
Here is an example where we start downloading only the text of English tweets without keeping the retweets.
9186
```
92-
>python twsd.py --lang=en --no-rt --only-text
87+
$ python twsd/main.py --lang=en --no-rt --only-text
9388
Downloading...
9489
Total: 104 Rate: 19.54 tweets/sec time: 0:00:11
9590
```

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
1+
pymongo==3.5.1
12
tweepy==3.5.0
23
ujson==1.35

settings.txt

Lines changed: 0 additions & 4 deletions
This file was deleted.

twsd.py

Lines changed: 0 additions & 85 deletions
This file was deleted.

EventLogger.py renamed to twsd/EventLogger.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,5 +53,8 @@ def get_total_events(self):
5353
return self.human_format_number(self.total)
5454

5555
def print_status(self):
56-
return "Total: %5s \t Rate: %5.2f %s/sec \t time: %s" % (
57-
self.total, self.get_rate(), self.event_name, self.get_total_time())
56+
message = "Total: %5s \t Rate: %5.2f %s/sec \t time: %s"
57+
return message % (self.total,
58+
self.get_rate(),
59+
self.event_name,
60+
self.get_total_time())

0 commit comments

Comments
 (0)