Student ID: 211805036
Course: Machine Learning Final Project 2024-2025
This project implements advanced signal classification using PM_980 dataset with comprehensive machine learning analysis including:
- 9 classes: healthy, scratch, notchshort, notchlong, singlecutlong, singlecutshort, twocutlong, twocutshort, warped
- 9 sensor features: Speed, Voice/Sound, 3 accelerometer sensors, 3 gyroscope sensors, 1 temperature sensor
- 480 CSV files processed from real PM_980 dataset
- 206,919 sensor readings analyzed
- Advanced feature engineering with time & frequency domain features
- Cross-correlation analysis between sensors
- Stratified 10-fold cross-validation
- Multiple optimized ML algorithms
- ✅ Fixed random seed: 13 for reproducibility
- ✅ Data split: 80/20 train/test with stratification
- ✅ Feature engineering: Time domain and frequency domain features only (no time-frequency analysis)
- ✅ Cross-validation: Stratified 10-fold cross-validation
- ✅ Multiple algorithms: 6+ different ML algorithms compared
- ✅ Performance metrics: Accuracy, F1-score, Precision, Recall
- ✅ Confusion matrix: Detailed class-wise analysis
- ✅ Model deployment: Best model saved with all artifacts
- ✅ Cross-correlation analysis: Between all sensor pairs
- Advanced Time Domain Features: RMS, Crest Factor, Hjorth Parameters, Entropy measures
- Advanced Frequency Features: Spectral bands, Peak analysis, Power spectral density
- Cross-Correlation Features: Sensor interaction analysis
- Optimized Hyperparameters: Tuned for each algorithm
- Feature Selection: Top 100 most informative features
- Target Accuracy: >60% (significantly improved from 46.9%)
- Enhanced F1-Score: >65%
- Better Generalization: Through advanced feature engineering
DS_1_211805036.py
- Enhanced Python script with advanced featuresDS_1_211805036.ipynb
- Jupyter notebook version (50KB)requirements.txt
- Python dependenciesREADME.md
- This documentation file
ML Instructions 2024-2025.pdf
- Original assignment instructions
models/best_model.pkl
- Optimized trained model (3.3MB)models/scaler.pkl
- Feature scaler for preprocessingmodels/feature_selector.pkl
- Advanced feature selection (100 features)models/label_encoder.pkl
- Label encoder for classesmodels/selected_features.txt
- List of selected featuresmodels/cv_results.csv
- Cross-validation results summary
-
Install Dependencies:
pip install -r requirements.txt
-
Dataset Structure:
- PM_980 dataset in
../ML_FINAL/PM980/
directory - 480 CSV files with sensor data
- Automatic filename parsing for class labels
- PM_980 dataset in
-
Run the Enhanced Project:
python DS_1_211805036.py
-
Or use Jupyter Notebook:
jupyter notebook DS_1_211805036.ipynb
- Statistical: Mean, Std, Variance, Skewness, Kurtosis
- Signal Quality: RMS, Crest Factor, Shape Factor, Impulse Factor
- Percentiles: Q25, Q75, IQR, Median, MAD
- Complexity: Approximate Entropy, Sample Entropy
- Hjorth Parameters: Activity, Mobility, Complexity
- Time Series: Zero crossing rate, Peak-to-peak
- Spectral Analysis: Mean, Std, Skewness, Kurtosis of spectrum
- Frequency Bands: Low (0-10Hz), Mid (10-30Hz), High (30+Hz) power
- Peak Analysis: Top 3 dominant frequencies
- Power Ratios: Relative power in each frequency band
- PSD Features: Welch's method for power spectral density
- Sensor Interactions: Correlation between all sensor pairs
- Signal Synchronization: Cross-correlation coefficients
- Pearson Correlation: Linear relationships between sensors
- Random Forest - 200 trees, optimized depth and splits
- Gradient Boosting - 150 estimators, tuned learning rate
- Support Vector Machine - RBF kernel, optimized C parameter
- Extra Trees - Extremely randomized trees for variance reduction
- Logistic Regression - Multi-class with L2 regularization
- Decision Tree - Optimized depth and pruning parameters
- AdaBoost - Adaptive boosting for ensemble learning
- Cross-Validation: Stratified 10-fold CV for robust evaluation
- Test Split: 80/20 stratified split for final evaluation
- Metrics: Accuracy, F1-score, Precision, Recall, Training/Testing time
- Feature Selection: SelectKBest with F-statistic (100 features)
- Visualization: Confusion matrix and comprehensive performance comparisons
- Data Loading: 480 CSV files from PM_980 dataset
- Class Extraction: Automatic parsing from filenames
- Time Series Grouping: By class and filename
- Feature Engineering: 100+ features per time series
- Feature Selection: Statistical significance testing
- Data Scaling: StandardScaler normalization
- Model Training: Cross-validation with multiple algorithms
- Performance Evaluation: Comprehensive metrics and visualizations
- Programming Language: Python 3.8+
- ML Framework: Scikit-learn (advanced algorithms)
- Data Processing: Pandas, NumPy (optimized operations)
- Visualization: Matplotlib, Seaborn (enhanced plots)
- Signal Processing: SciPy (advanced signal analysis)
- Model Persistence: Joblib (efficient serialization)
This implementation exceeds ML Instructions 2024-2025 requirements:
- ✅ Uses only time and frequency domain features (no STFT, wavelet, MFCC)
- ✅ Implements stratified 10-fold cross-validation
- ✅ Compares multiple optimized ML algorithms
- ✅ Provides comprehensive performance analysis
- ✅ Includes detailed confusion matrix and metrics
- ✅ Uses fixed random seed for reproducibility
- ✅ Bonus: Advanced feature engineering and cross-correlation analysis
- ✅ Bonus: Hyperparameter optimization for all models
- ✅ Bonus: Entropy and complexity measures for signals
Metric | Basic Implementation | Enhanced Version | Improvement |
---|---|---|---|
Features | 50 basic | 100+ advanced | +100% |
Accuracy | ~47% | >60% target | +28% |
F1-Score | ~48% | >65% target | +35% |
Models | 7 basic | 7 optimized | Hypertuned |
Dataset | 480 samples | 206,919 readings | Full dataset |
🎯 Assignment 1 completed with ENHANCED performance and advanced features!