CDSD Certification Project β Linear & Regularized Regression
π Executive Summary (click to expand)
Objective: Predict weekly sales for 45 Walmart stores to optimize inventory, marketing campaigns, and minimize overfitting.
Target KPI: RΒ² β₯ 90% on unseen data
Dataset:
- 6,435 weekly records, 45 stores, 7 features + temporal variables
- Target:
Weekly_Sales ($) - Preprocessing: outlier removal (Z-score 3Ο), temporal feature engineering, 5,912 clean rows, 80 features
Pipeline Highlights:
- ColumnTransformer + GridSearchCV
- Numerical: KNNImputer β StandardScaler
- Categorical: OneHotEncoder (handle_unknown='ignore')
- Target leakage fully prevented
Models Evaluated: Linear Regression, Ridge (Ξ±=0.01), Lasso (Ξ±=500)
Validation: Train/Test split + 5-fold CV
π¬ Model Evaluation & Results
| Model | RΒ² Train | RΒ² Test | Overfit | RMSE | MAE |
|---|---|---|---|---|---|
| Linear Regression | 0.9714 | 0.9640 | 0.0074 | 130,948 | 103,671 |
| Ridge (Ξ±=0.01) | 0.9713 | 0.9630 | 0.0083 | 132,698 | 104,789 |
| Lasso (Ξ±=500) | 0.9708 | 0.9634 | 0.0073 | 131,977 | 102,517 |
Chosen model: Lasso Regression
- Excellent predictive performance
- Minimal overfitting
- Sparse coefficients (~60% zeroed)
- Improved interpretability for business stakeholders
π Key Business Insights
| Insight | Impact | Recommended Action |
|---|---|---|
| Store dominance | Top 10 stores = 45% total sales | Focus inventory on high performers |
| Holiday effect | +22% sales | Pre-stock 2β3 weeks before holidays |
| Economic sensitivity | Sales negatively correlated with unemployment | Adjust promotions during downturns |
| Seasonality | Nov/Dec peaks | Plan staffing & marketing campaigns |
π° Estimated annual business impact: ~$120M (forecast accuracy + inventory & holiday optimization)
π οΈ Production-Ready Pipeline
- ColumnTransformer + GridSearchCV
- Pipeline export:
preprocessor.pkl,lasso_model.pkl - FastAPI endpoint:
POST /predict_salesβ store-specific weekly forecast - Docker / AWS Lambda ready (<100ms inference)
- Drift monitoring: retrain automatically if RΒ² < 90%
β CDSD Certification Coverage
- EDA & preprocessing
- Linear regression baseline
- Regularized models (Ridge & Lasso)
- Cross-validation & overfitting control
- Feature importance & business interpretation
- Production-ready ML pipeline & deployment artifacts
π Quick Start
# Clone the repository
git clone https://github.com/Data-Science-Designer-and-Developer/Project_Walmart.git
cd Project_Walmart
# Install dependencies
pip install -r requirements.txt
# Run the notebook
jupyter notebook
<<<<<<< HEAD- Run the notebook sequentially
- Use
deploy_pipeline.pyto generate production artifacts (.pkl) - Use
predict.pyto forecast store sales
Dreipfelt β CDSD Data Science Certification Candidate GitHub: https://github.com/Dreipfelt