Business Challenge
A fast-growing e-commerce company, noticed a 20% increase in customer churn over six months. Their existing analytics system provided post-churn insights but failed to predict at-risk customers early. They needed a real-time predictive model to:
- Identify high-risk customers before churn
- Enable targeted retention campaigns (discounts, personalized offers)
- Reduce customer acquisition costs by improving retention
Solution: Automated ML Pipeline for Churn Prediction
We designed a scalable data pipeline that ingests transactional, behavioral, and engagement data to generate churn probability scores updated daily.
Architecture Overview:

High-Level
Key Components
1. Data Ingestion
- PostgreSQL: Historical orders, returns, and customer metadata (updated hourly).
- CRM API: Real-time customer service interactions (complaints, refunds).
- S3 Buckets: User clickstreams (page views, cart abandonment) processed daily.
Tools:
- Python (Boto3, Psycopg2, Requests) for extraction
- Airflow to manage dependencies (e.g., “Wait for S3 data before feature engineering”)
2. Transformation & Feature Engineering
- Pandas: Cleaned null values, standardized formats (e.g., USD currencies).
- PySpark: Computed aggregated features:
- 30-day_purchase_frequency
- avg_cart_abandonment_rate
- customer_service_complaints_last_week
3. Machine Learning Model
- Algorithm: XGBoost (via scikit-learn API) for handling imbalanced data.
- Optuna: Automated hyperparameter tuning (optimized for precision@top-10% to focus on highest-risk customers).
- Validation: Time-based split (train on 6 months, test on next 30 days).
Key Features:
- Recency/frequency metrics (RFM)
- Engagement decay rate (e.g., “Days since last login”)
- Sentiment score from customer support tickets
4. Deployment & Output
- AWS Lambda: Served predictions via API (cost-effective for sporadic retraining).
- Snowflake: Stored predictions with customer IDs for joinable analytics.
- Downstream: Marketing teams used Tableau to filter customers by churn risk and LTV.
Results
| Metric | Before | After |
|---|---|---|
| Churn Rate | 22% | 16% |
| Retention Campaign ROI | 1.5x | 3.8x |
| Model Accuracy (AUC-ROC) | — | 0.89 |
Business Impact:
- Saved $2.3M/year by reducing churn in high-LTV segments.
- Enabled dynamic email campaigns
(e.g., “We miss you!” discounts for 50% predicted churn risk).
Lessons Learned
- Cold-start problem:
Added synthetic data for new users. - Lambda limitations:
Switched to batch predictions for >10K users to avoid timeouts. - Feature drift:
Implemented Evidently.ai monitors to track data shifts.