Advanced semantic clustering solutions for organizing and analyzing large keyword datasets. These tools use machine learning and natural language processing to group semantically similar keywords, enabling more effective content strategy and SEO planning.
A comprehensive collection of keyword clustering tools powered by SentenceTransformers and various clustering algorithms.
- Multiple Clustering Methods: Standard clustering and HDBScan for large datasets
- Semantic Understanding: Uses sentence embeddings for true semantic similarity
- CLI & Notebook Versions: Choose your preferred interface
- Visualization: Generate treemaps and sunburst charts for cluster analysis
- Configurable Parameters: Adjust similarity thresholds, models, and output formats
- CLI Application: Command-line interface for production workflows
- HDBScan CLI: Optimized for very large keyword datasets
- Python Script: Direct integration into existing workflows
- Jupyter Notebooks: Interactive analysis and experimentation
- Standard semantic clustering with similarity thresholds
- HDBScan for density-based clustering of large datasets
- Configurable minimum similarity parameters
cd semantic-clustering/semantic-clustering-cli-app/CLI
python cluster.py keywords.csv --column_name "Keyword" --output_path "output.csv"
python cluster.py keywords.csv \
--column_name "Keyword" \
--output_path "clustered_keywords.csv" \
--chart_type "sunburst" \
--device "cpu" \
--model_name "all-MiniLM-L6-v2" \
--min_similarity 0.80 \
--remove_dupes True \
--volume "Volume" \
--stem True
cd semantic-clustering/semantic-clustering-cli-app/CLI-HDBScan
python cluster-hdbscan.py large_keywords.csv
- Group keywords into topical clusters for content planning
- Identify content gaps and opportunities
- Plan content pillars and supporting pages
- Analyze competitor keyword strategies
- Group keywords by search intent
- Optimize content for semantic keyword groups
- Create logical ad groups based on semantic similarity
- Improve Quality Scores through better keyword grouping
- Reduce keyword cannibalization
- Identify related keywords for existing content
- Optimize for semantic search and entities
- Improve topical authority
- CSV file with keyword data
- Keyword column containing the terms to cluster
- Optional volume column for weighted analysis
- Minimum 50+ keywords for meaningful clustering
- CSV file with cluster assignments
- Excel file with pivot tables
- Interactive visualizations (treemap/sunburst charts)
- Cluster statistics and similarity scores
- all-MiniLM-L6-v2 (default, balanced speed/accuracy)
- all-mpnet-base-v2 (higher accuracy, slower)
- distilbert-base-nli-stsb-mean-tokens
- Custom SentenceTransformer models
- CPU: 4+ cores recommended for large datasets
- RAM: 8GB minimum, 16GB+ for large datasets
- GPU: Optional CUDA support for faster processing
- Storage: Varies by dataset size
pip install sentence-transformers pandas numpy plotly scikit-learn hdbscan
The repository includes several legacy implementations for reference and backward compatibility:
- Google Colab versions for cloud processing
- Search Engine Journal optimized versions
- Historical clustering approaches
- < 1,000 keywords: Standard CLI version
- 1,000 - 10,000 keywords: CLI with optimized settings
- 10,000+ keywords: HDBScan version
- 100,000+ keywords: Contact for enterprise solutions
- 1,000 keywords: ~2-5 minutes
- 10,000 keywords: ~10-30 minutes
- 50,000 keywords: ~1-3 hours (HDBScan)
For detailed implementation guides and advanced configurations, visit the tool-specific directories. Each version includes comprehensive documentation and example usage.
Lee Foot - SEO Consultant specializing in semantic search and content optimization.
Part of the Search Solved Public SEO toolkit - Advanced clustering for modern SEO workflows.