A Python tool for automatically scraping and extracting logo images from websites.
- Extracts logo images from websites using various detection methods
- Processes multiple websites concurrently to improve efficiency
- Uses intelligent heuristics to identify the most likely logo candidate
- Provides comprehensive logging and progress tracking
- Randomized delays and user agents to avoid rate limiting
- Clone this repository
- Install the required dependencies:
pip install -r requirements.txt
The script requires an input CSV file with a column named website
containing the URLs to scrape.
python logo_scraper.py --input websites.csv --output logos.csv
--input
,-i
: Input CSV file with websites (default:websites.csv
)--output
,-o
: Output CSV file for logo URLs (default:logos.csv
)--workers
,-w
: Number of worker threads (default:5
)--delay
,-d
: Delay between requests in seconds (default:1.0
)--timeout
,-t
: Request timeout in seconds (default:10
)
python logo_scraper.py --input companies.csv --output company_logos.csv --workers 10 --delay 2 --timeout 15
The input CSV file should contain a column named website
with the URLs to scrape:
website
example.com
google.com
github.com
The script outputs a CSV file with the following columns:
website
: The original website URLlogo_url
: The URL of the extracted logo (if found)status
: The status of the extraction (success
,no logo found
, or an error message)
The logo scraper uses multiple methods to identify logo images:
- Looks for images with "logo" in their URL, class name, ID, or alt text
- Checks for images positioned in header elements or as home page links
- Examines SVG elements with "logo" in their class names
- Checks for meta tags with OpenGraph images
- Looks for favicon and apple-touch-icon links
For each website, the script:
- Sends an HTTP request with a random user agent
- Parses the HTML content
- Applies logo detection heuristics
- Scores potential logo candidates
- Returns the highest-scoring logo URL
MIT