+91 955 582 1832 

ashish22june@gmail.com

Project: Dalmia Bharat

  • Home
  • arrow-right-1
  • ashish22june@gmail.com

Project: Dalmia Bharat

Project: Dalmia Bharat

Problem Statement: The GCP environment faces cost inefficiencies and performance bottlenecks due to a range of misconfigurations and operational oversights. Cost-related issues dominate, such as idle or forgotten resources, high logging/storage overheads, and unoptimized BigQuery usage may lead to high cloud spend. Additionally, architectural inefficiencies—like ineffective data partitioning, poorly designed queries, and unnecessarily activated services—further …

Read more

ETL and Reporting Solution in Retail (Supermarket) Industry

Problem Statement A supermarket chain faced critical challenges that impeded their ability to make timely and informed business decisions: Proposed Solution To address these challenges, the supermarket chain implemented a robust ETL pipeline using SQL Server Integration Services (SSIS), consolidated data into a SQL Server data warehouse, and restructured reporting with Power BI. Scenario Details …

Read more

For a Cosmetics Retail Company on ETL and Reporting Implementation:

Problem Statement The cosmetics retail company faced the following challenges, limiting their ability to make data-driven decisions: Proposed Solution To address these challenges, the company implemented a modern data architecture using Azure Data Factory (ADF) for ETL, Azure Data Lake Storage (ADLS) as a centralized data repository, and Power BI for unified reporting and insights. …

Read more

ELT Process in Leading US Retail Pharmacy

Business Context A leading US retail pharmacy chain handles vast amounts of transactional, inventory, and customer data daily. This data comes from diverse sources, including APIs (REST and GraphQL), SQL/NoSQL databases, and third-party platforms such as marketing and logistics systems. The company requires a scalable ELT pipeline to ensure high-quality data is transformed and analyzed …

Read more

Optimizing Data Pipelines and Engineering Solutions in GCP using Apache Spark and pyspark

Background A leading provider of advanced data solutions, was experiencing significant performance bottlenecks in its data processing pipelines. The company relied on legacy systems that could not keep pace with rapidly growing data volumes, leading to delays in analytics and reporting. These issues not only hampered operational efficiency but also increased operational costs and customer …

Read more

Current phase – Migration to GCP (Google Cloud Platform)

As the client plans to migrate to Google Cloud Platform (GCP), the architecture and processes must accommodate a seamless transition. Key Considerations for GCP Migration Cloud-Native Services: Replace current on-premises and hybrid tools with GCP-native services to reduce operational overhead. Scalability and Performance: GCP’s distributed architecture ensures high availability and the ability to scale dynamically …

Read more

Implementing Informatica MDM for a Banking Institution (On-Prem)

Introduction In the modern banking industry, maintaining a single, trusted view of customer data is critical for regulatory compliance, risk management, and personalized customer experiences. This case study outlines the implementation of an Informatica Master Data Management (MDM) on-premises solution for a large banking institution with multiple legacy systems.   Problem Statement The problem statement …

Read more

Customer Churn Prediction Pipeline for an E-Commerce Company

Business Challenge

A fast-growing e-commerce company, noticed a 20% increase in customer churn over six months. Their existing analytics system provided post-churn insights but failed to predict at-risk customers early. They needed a real-time predictive model to:

  • Identify high-risk customers before churn
  • Enable targeted retention campaigns (discounts, personalized offers)
  • Reduce customer acquisition costs by improving retention

 

Solution: Automated ML Pipeline for Churn Prediction

We designed a scalable data pipeline that ingests transactional, behavioral, and engagement data to generate churn probability scores updated daily.

 

Architecture Overview:

 


High-Level

 

Key Components

1. Data Ingestion
  • PostgreSQL: Historical orders, returns, and customer metadata (updated hourly).
  • CRM API: Real-time customer service interactions (complaints, refunds).
  • S3 Buckets: User clickstreams (page views, cart abandonment) processed daily.

      Tools:

  • Python (Boto3, Psycopg2, Requests) for extraction
  • Airflow to manage dependencies (e.g., “Wait for S3 data before feature engineering”)

2. Transformation & Feature Engineering
  • Pandas: Cleaned null values, standardized formats (e.g., USD currencies).
  • PySpark: Computed aggregated features:
    • 30-day_purchase_frequency
    • avg_cart_abandonment_rate
    • customer_service_complaints_last_week
3. Machine Learning Model
  • Algorithm: XGBoost (via scikit-learn API) for handling imbalanced data.
  • Optuna: Automated hyperparameter tuning (optimized for precision@top-10% to focus on highest-risk customers).
  • Validation: Time-based split (train on 6 months, test on next 30 days).

      Key Features:

  • Recency/frequency metrics (RFM)
  • Engagement decay rate (e.g., “Days since last login”)
  • Sentiment score from customer support tickets
4. Deployment & Output
  • AWS Lambda: Served predictions via API (cost-effective for sporadic retraining).
  • Snowflake: Stored predictions with customer IDs for joinable analytics.
  • Downstream: Marketing teams used Tableau to filter customers by churn risk and LTV.

 

Results

Metric Before After
Churn Rate 22% 16%
Retention Campaign ROI 1.5x 3.8x
Model Accuracy (AUC-ROC) 0.89

 

Business Impact:
  • Saved $2.3M/year by reducing churn in high-LTV segments.
  • Enabled dynamic email campaigns
    (e.g., “We miss you!” discounts for 50% predicted churn risk).

 

Lessons Learned

  • Cold-start problem:
    Added synthetic data for new users.

  • Lambda limitations:
    Switched to batch predictions for >10K users to avoid timeouts.

  • Feature drift:
    Implemented Evidently.ai monitors to track data shifts.

Implementing Power BI in a U.S. Based Health and Life Sciences Company

Background:

A mid-sized Health and Life Sciences company, “HealthFirst,” operates across multiple states in the U.S., offering a range of services from patient care to pharmaceutical research. The company faced challenges with fragmented data sources, delayed reporting, and compliance with healthcare regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

 

Objectives:

  • Centralize Data Sources:
    Integrate disparate data systems to provide a unified view of operations.

  • Enhance Reporting Efficiency:
    Reduce the time required to generate actionable reports.

  • Ensure Regulatory Compliance:
    Implement solutions that comply with HIPAA and other relevant regulations.

  • Empower Decision-Making:
    Provide stakeholders with real-time insights to facilitate informed decisions.

 

Implementation Steps:

 Step 1: Assessment of Existing Infrastructure
    • Data Inventory:
      Conducted a comprehensive audit of existing data sources, including Electronic Health Records (EHR), financial systems, and patient feedback databases.

    • Stakeholder Interviews:
      Engaged with department heads to understand reporting needs and pain points.
 Step 2: Data Integration and Modeling:
    • ETL Processes:
      Employed Extract, Transform, Load (ETL) tools to consolidate data into a centralized repository.
       
    • Data Modeling:
      Designed a data model that reflects the company’s operations, ensuring scalability and flexibility.
 Step 3: Development of Interactive Dashboards:
    • Clinical Operations Dashboard:
      Monitored patient admissions, discharge rates, and treatment outcomes. 

    • Financial Performance Dashboard:
      Tracked revenue cycles, operational costs, and profitability metrics. 

    • Compliance Dashboard:
      Ensured adherence to HIPAA regulations by monitoring data access and usage.
 Step 4: User Training and Adoption:
    • Workshops:
      Conducted training sessions for staff to familiarize them with Power BI tools and dashboards. 

    • Support Resources:
      Provided user manuals and established a helpdesk for ongoing assistance.
 Step 5: Deployment and Continuous Improvement:
    • Pilot Testing:
      Launched dashboards with select user groups to gather feedback and make necessary adjustments.
       
    • Full Deployment:
      Rolled out the solution company-wide, with continuous monitoring and optimization.

 

Outcomes:

  • Improved Data Accessibility:
    Stakeholders gained real-time access to critical data, enhancing responsiveness. 

  • Reduced Reporting Time:
    Automated data processes decreased report generation time by 50%.
     
  • Enhanced Compliance Monitoring:
    The compliance dashboard provided proactive oversight of data practices, ensuring regulatory adherence.
     
  • Informed Decision-Making:
    Data-driven insights led to strategic initiatives that improved patient care and operational efficiency.

 

Clinical Operations Dashboard:

Clinical Operations Dashboard in a healthcare setting serves as a centralized platform that provides real-time insights into various operational aspects of a medical facility. By aggregating and visualizing key performance indicators (KPIs), this dashboard enables healthcare administrators and clinicians to monitor, analyze, and enhance the efficiency and quality of patient care.

Key Components of a Clinical Operations Dashboard:

  1. Patient Flow Metrics:
     
    • Admission and Discharge Rates:
      Tracks the number of patients admitted and discharged over specific periods, helping to identify trends and manage hospital capacity effectively.
       
    • Emergency Department (ED) Statistics:
      Monitors metrics such as patient wait times, length of stay in the ED, and the rate of patients leaving without being seen, which are critical for assessing the responsiveness of emergency services.
       
  2. Resource Utilization:
     
    • Bed Occupancy Rates:
      Displays the current and historical utilization of hospital beds, assisting in resource planning and identifying potential bottlenecks in patient care.
       
    • Staffing Levels:
      Provides insights into the allocation and sufficiency of medical and support staff across different departments, ensuring that patient care is not compromised due to understaffing.

  3. Clinical Performance Indicators:
     
    • Surgery Cancellation Rates:
      Monitors the frequency and reasons for surgical cancellations, offering opportunities to improve scheduling and preoperative processes.
       
    • Readmission Rates:
      Tracks the percentage of patients who return to the hospital within a certain timeframe after discharge, serving as a measure of care quality and effectiveness.

  4. Financial Metrics:
     
    • Revenue Cycle Data:
      Analyzes billing cycles, outstanding payments, and insurance claim statuses to optimize financial operations.
       
    • Operational Costs:
      Monitors expenses related to staffing, equipment, and facilities maintenance, aiding in budget management and cost reduction strategies.

  5. Compliance and Quality Assurance:
     
    • Regulatory Compliance Tracking:
      Ensures that the facility adheres to healthcare regulations and standards, such as those set by HIPAA, by monitoring relevant activities and protocols.
    • Patient Satisfaction Scores:
      Gathers and displays data from patient feedback surveys to assess the quality of care and identify areas for improvement.

Migrating AWS Redshift to GCP BigQuery

Let’s consider a real-world example: CP Plus

 

Background

A US-based mid-sized e-commerce company that relies heavily on data analytics to drive personalized customer experiences. The company currently uses Amazon Redshift on AWS as its data warehouse to process and analyze transactional and historical data.

Problem Statement & Migration Approach

As part of our migration strategy with a partner, we identified Amazon Redshift as a major cost driver. To optimize expenses on GCP, we began with a preliminary analysis using small datasets to evaluate cost-saving opportunities. This led to a full-scale migration, with a key focus on ELT job transitions.

Agenthum as a migration partner, played a pivotal role by:

  • Assessing the sample data to determine in-scope and out-of-scope objects.
  • Identifying the most efficient migration approach based on data insights.
  • Developing a structured implementation plan, dividing the migration process into six critical phases/milestones.
  • Executing cost optimization strategies both during and after migration.
  • Overcoming various challenges and hurdles to ensure a seamless transition.
  • This case study highlights the strategic ELT job migration process and the key learnings from optimizing workloads on GCP.

 

Pre-Migration Setup on AWS Redshift

The company’s Redshift data warehouse manages analytics for 500,000 monthly active users, processing approximately 1 TB of data monthly, with a total stored size of 50 TB of historical data.

AWS Redshift Architecture Components:

  • Data Warehouse:
    Amazon Redshift (2 dc2.large nodes, 1 TB monthly processed data, 50 TB total storage).

  • Storage Integration:
    Amazon S3 (50 TB of raw historical data, staged for Redshift loading).

  • ETL Process:
    AWS Glue to extract, transform, and load data from S3 into Redshift.

  • Networking:
    Redshift resides in an Amazon VPC, accessed by internal BI tools via JDBC/ODBC.

 

Existing AWS Redshift Architecture Diagram

 

Migration Goals

  • Reduce data warehousing costs by at least 15%.
  • Improve query performance for real-time analytics.
  • Simplify ETL processes and leverage serverless analytics.

Migration Strategy

The Company adopted a lift-and-shift with optimization approach, moving Redshift data to BigQuery and optimizing for its columnar storage and serverless architecture. The migration was completed in one month, with minimal disruption to analytics workflows.

 

New GCP BigQuery Architecture Diagram

 

Sample Data Sizes:

  • Monthly Processed Data:
    1 TB (loaded into BigQuery for analytics).

  • Historical Data:
    50 TB (stored in cloud storage, queried on-demand by BigQuery).

 

Migration Process

Step 1: Assessment
Analyzed Redshift schema and data using Google’s BigQuery Migration Assessment tool.

Step 2: Data Export
Exported 50 TB from Redshift to S3 using the UNLOAD command (2 days).

Step 3: Data Transfer
Used Google Cloud Transfer Service to move 50 TB from S3 to Cloud Storage
(1 week, $0.09/GB AWS egress cost = $4,500 one-time fee).

Step 4: Schema Conversion
Converted Redshift SQL schema to BigQuery-compatible DDL using automated scripts, adjusting for columnar storage.

Step 5: Loading Data
Imported data from Cloud Storage into BigQuery using the bq load command (1 day).

Step 6: Validation
Ran parallel queries on Redshift and BigQuery for one week to ensure consistency.

 

Cost Optimization Strategies

  1. Storage Optimization

    • Replacing Amazon S3 Standard with GCP Standard / SSD Storage offers ~20–50% lower latency, 5–15% higher throughput, and 10–20% better request rates; SSD (Persistent Disk) could yield 90%+ latency reduction.

    • Enable automatic storage scaling to avoid over-provisioning.

  2. Network Cost Reduction

    • Replacing AWS Global Accelerator with GCP’s Premium Tier Network for predictable pricing and low latency, as GCP’s Premium Tier Network can provide 10-20% lower latency compared to AWS Global Accelerator, especially for global workloads, due to Google’s extensive private fiber network.

    • Leverage AWS Direct Connect and GCP Cloud Interconnect to reduce data transfer costs between AWS and GCP.

  3. Serverless and Managed Services

    • Replacing AWS Athena with GCP BigQuery for analytics instead of running expensive ETL pipelines on AWS, gaining 20-40% faster query performance, 30-50% cost savings on infrastructure, and 15-25% reduced development time.

    • Leverage Cloud SQL’s automated backups and maintenance to reduce operational overhead.

  4. Monitoring and Cost Management

    • Use GCP’s Cost Management Tools to monitor and optimize spending.

    • Set up budget alerts to avoid unexpected costs.

  5. Data Lifecycle Management

    • Shifting from Amazon S3 Glacier to GCP Coldline Storage to Archive historical data  for cost-effective long-term storage, gaining 30-50% faster retrieval times and 10-20% lower storage costs for long-term archival data compared to Amazon S3 Glacier.

    • Use BigQuery partitioning to reduce query costs for large datasets.

 

Additional Benefits

  • Performance:
    Query execution time dropped from 10 minutes (Redshift) to 1.5 minutes (BigQuery) due to serverless scaling and columnar optimization.

  • Scalability:
    BigQuery automatically scales to handle peak loads (e.g., holiday sales analytics) without manual node management.

  • Simplified ETL:
    BigQuery’s native integration with Cloud Storage eliminated the need for a separate ETL tool like AWS Glue.

 

Challenges and Resolutions

  • Egress Costs:
    AWS S3-to-GCP transfer incurred a one-time $4,500 fee, offset by monthly savings within 14 months.

  • SQL Differences:
    Minor syntax adjustments (e.g., Redshift’s DISTSTYLE vs. BigQuery partitioning) were resolved with Google’s migration guides.

  • Downtime:
    Limited to 4 hours during final cutover, mitigated by pre-loading data into BigQuery.

 

Conclusion

The Company Innovations successfully migrated its AWS Redshift data warehouse to GCP BigQuery, achieving a 21.8% cost reduction. The migration improved query performance, simplified operations, and positioned the company to leverage BigQuery’s ML capabilities (e.g., BigQuery ML) for future personalization projects. This case study highlights the value of migrating to a serverless, cost-efficient data warehouse like BigQuery.