Add README

brannondorsey · brannondorsey · commit 6161da902fd8 · 2017-12-19T19:34:19.000-06:00
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 experiments/
+venv/
 data/*.txt
 *.pyc
 *.txt
diff --git a/README.md b/README.md
@@ -2,3 +2,48 @@
 
 This repository contains code for the [_PassGAN: A Deep Learning Approach for Password Guessing_](https://arxiv.org/abs/1709.00440) paper. 
 
+The model from PassGAN is taken from [_Improved Training of Wasserstein GANs_](https://arxiv.org/abs/1704.00028) and it is assumed that the authors of PassGAN used the [improved_wgan_training](https://github.com/igul222/improved_wgan_training) tensorflow implementation in their work. For this reason, I have modified that reference implementation in this repository to make it easy to train (`train.py`) and sample (`sample.py`) from. This repo contributes:
+
+- A command-line interface
+- A pretrained PassGAN models trained on the RockYou dataset
+
+## Getting Started
+
+```bash
+# requires CUDA to be pre-installed
+pip install -r requirements.txt
+```
+
+### Generating password samples
+
+Use the pretrained model to generate 1,000,000 passwords, saving them to `gen_passwords.txt`.
+
+```bash
+python sample.py \
+	--input-dir pretrained \
+	--checkpoint pretrained/checkpoints/195000.ckpt \
+	--output gen_passwords.txt \
+	--batch-size 1024 \
+	--num-samples 1000000
+```
+
+### Training your own models
+
+Training a model on a large dataset (100MB+) can take several hours on a GTX 1080.
+
+```bash
+# download the rockyou training data
+# contains 80% of the full rockyou passwords (with repeats)
+# that are 10 characters or less
+curl -L -o data/train.txt https://github.com/brannondorsey/PassGAN/releases/download/data/rockyou-train.txt
+
+# train for 200000 iterations, saving checkpoints every 5000
+# uses the default hyperparameters from the paper
+python train.py --output-dir output --training-data data/train.txt
+```
+
+You are encouraged to train using your own password leaks and datasets. Some great places to find those include:
+
+- [LinkedIn leak](https://hashes.org/download.php?hashlistId=68&type=hfound)(2.9GB, direct download)
+- [Exploit.in torrent](https://thepiratebay.org/torrent/16016494/exploit.in) (10GB+, 800 million accounts. Infamous!)
+- [Hashes.org](https://hashes.org/leaks.php): a shared password recovery site.
diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/sample.py b/sample.py
@@ -113,7 +113,7 @@ def save(samples):
             save(samples)
             samples = [] # flush
 
-            print('wrote {} samples to {} in {:.2f} seconds. {} total.'.format(1000 * args.batch_size, 'samples.txt', time.time() - then, i * args.batch_size))
+            print('wrote {} samples to {} in {:.2f} seconds. {} total.'.format(1000 * args.batch_size, args.output, time.time() - then, i * args.batch_size))
             then = time.time()
     
     save(samples)
diff --git a/train.py b/train.py
@@ -20,7 +20,7 @@ def parse_args():
     parser.add_argument('--training-data', '-i',
                         default='data/train.txt',
                         dest='training_data',
-                        help='Path to training data file (one password per line) (default: data/train.py)')
+                        help='Path to training data file (one password per line) (default: data/train.txt)')
 
     parser.add_argument('--output-dir', '-o',
                         required=True,

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`experiments/`
	`2`	`+venv/`
`2`	`3`	`data/*.txt`
`3`	`4`	`*.pyc`
`4`	`5`	`*.txt`