keyword-clustering

Add comprehensive documentation and professional repository structure

Jun 7, 2025

cf4b5dc · Jun 7, 2025

This branch is up to date with searchsolved/search-solved-public-seo:main.

Name	Name	Last commit message	Last commit date
parent directory ..
semantic-clustering	semantic-clustering	Add comprehensive documentation and professional repository structure	Jun 7, 2025
README.md	README.md	Add comprehensive documentation and professional repository structure	Jun 7, 2025

README.md

Keyword Clustering Tools

Advanced semantic clustering solutions for organizing and analyzing large keyword datasets. These tools use machine learning and natural language processing to group semantically similar keywords, enabling more effective content strategy and SEO planning.

Tools Overview

🧠 Semantic Clustering Suite

A comprehensive collection of keyword clustering tools powered by SentenceTransformers and various clustering algorithms.

Features

Multiple Clustering Methods: Standard clustering and HDBScan for large datasets
Semantic Understanding: Uses sentence embeddings for true semantic similarity
CLI & Notebook Versions: Choose your preferred interface
Visualization: Generate treemaps and sunburst charts for cluster analysis
Configurable Parameters: Adjust similarity thresholds, models, and output formats

Available Versions

CLI Application: Command-line interface for production workflows
HDBScan CLI: Optimized for very large keyword datasets
Python Script: Direct integration into existing workflows
Jupyter Notebooks: Interactive analysis and experimentation

Clustering Algorithms Supported

Standard semantic clustering with similarity thresholds
HDBScan for density-based clustering of large datasets
Configurable minimum similarity parameters

Quick Start

Basic Usage

cd semantic-clustering/semantic-clustering-cli-app/CLI
python cluster.py keywords.csv --column_name "Keyword" --output_path "output.csv"

Advanced Configuration

python cluster.py keywords.csv \
  --column_name "Keyword" \
  --output_path "clustered_keywords.csv" \
  --chart_type "sunburst" \
  --device "cpu" \
  --model_name "all-MiniLM-L6-v2" \
  --min_similarity 0.80 \
  --remove_dupes True \
  --volume "Volume" \
  --stem True

For Large Datasets (10k+ keywords)

cd semantic-clustering/semantic-clustering-cli-app/CLI-HDBScan
python cluster-hdbscan.py large_keywords.csv

Use Cases

🎯 Content Strategy

Group keywords into topical clusters for content planning
Identify content gaps and opportunities
Plan content pillars and supporting pages

📊 SEO Analysis

Analyze competitor keyword strategies
Group keywords by search intent
Optimize content for semantic keyword groups

🔍 PPC Campaign Organization

Create logical ad groups based on semantic similarity
Improve Quality Scores through better keyword grouping
Reduce keyword cannibalization

📈 Content Optimization

Identify related keywords for existing content
Optimize for semantic search and entities
Improve topical authority

Input Requirements

CSV file with keyword data
Keyword column containing the terms to cluster
Optional volume column for weighted analysis
Minimum 50+ keywords for meaningful clustering

Output Formats

CSV file with cluster assignments
Excel file with pivot tables
Interactive visualizations (treemap/sunburst charts)
Cluster statistics and similarity scores

Technical Specifications

Models Supported

all-MiniLM-L6-v2 (default, balanced speed/accuracy)
all-mpnet-base-v2 (higher accuracy, slower)
distilbert-base-nli-stsb-mean-tokens
Custom SentenceTransformer models

Hardware Requirements

CPU: 4+ cores recommended for large datasets
RAM: 8GB minimum, 16GB+ for large datasets
GPU: Optional CUDA support for faster processing
Storage: Varies by dataset size

Dependencies

pip install sentence-transformers pandas numpy plotly scikit-learn hdbscan

Legacy Versions

The repository includes several legacy implementations for reference and backward compatibility:

Google Colab versions for cloud processing
Search Engine Journal optimized versions
Historical clustering approaches

Performance Guidelines

Dataset Size Recommendations

< 1,000 keywords: Standard CLI version
1,000 - 10,000 keywords: CLI with optimized settings
10,000+ keywords: HDBScan version
100,000+ keywords: Contact for enterprise solutions

Processing Time Estimates

1,000 keywords: ~2-5 minutes
10,000 keywords: ~10-30 minutes
50,000 keywords: ~1-3 hours (HDBScan)

Support & Documentation

For detailed implementation guides and advanced configurations, visit the tool-specific directories. Each version includes comprehensive documentation and example usage.

Author

Lee Foot - SEO Consultant specializing in semantic search and content optimization.

🌐 Website
🐦 Twitter/X
✉️ Contact

Part of the Search Solved Public SEO toolkit - Advanced clustering for modern SEO workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

keyword-clustering

keyword-clustering

README.md

Keyword Clustering Tools

Tools Overview

🧠 Semantic Clustering Suite

Features

Available Versions

Clustering Algorithms Supported

Quick Start

Basic Usage

Advanced Configuration

For Large Datasets (10k+ keywords)

Use Cases

🎯 Content Strategy

📊 SEO Analysis

🔍 PPC Campaign Organization

📈 Content Optimization

Input Requirements

Output Formats

Technical Specifications

Models Supported

Hardware Requirements

Dependencies

Legacy Versions

Performance Guidelines

Dataset Size Recommendations

Processing Time Estimates

Support & Documentation

Author

Collapse file tree

Files

keyword-clustering

Directory actions

More options

Directory actions

More options

Latest commit

History

keyword-clustering

Folders and files

parent directory

README.md

Keyword Clustering Tools

Tools Overview

🧠 Semantic Clustering Suite

Features

Available Versions

Clustering Algorithms Supported

Quick Start

Basic Usage

Advanced Configuration

For Large Datasets (10k+ keywords)

Use Cases

🎯 Content Strategy

📊 SEO Analysis

🔍 PPC Campaign Organization

📈 Content Optimization

Input Requirements

Output Formats

Technical Specifications

Models Supported

Hardware Requirements

Dependencies

Legacy Versions

Performance Guidelines

Dataset Size Recommendations

Processing Time Estimates

Support & Documentation

Author