Project: Dalmia Bharat

November 21, 2025

Problem Statement:

The GCP environment faces cost inefficiencies and performance bottlenecks due to a range of misconfigurations and operational oversights. Cost-related issues dominate, such as idle or forgotten resources, high logging/storage overheads, and unoptimized BigQuery usage may lead to high cloud spend. Additionally, architectural inefficiencies—like ineffective data partitioning, poorly designed queries, and unnecessarily activated services—further inflate bills without adding value. On the performance side, null value propagation, complex merge queries, and overconsumption of slots hinder pipeline reliability and query execution. Together, these challenges demand a proactive governance model combining automation, monitoring, and architectural best practices to ensure scalable and cost-efficient GCP operations.

Major Challenges shared by Dalmia Bharat post migration from AWS to GCP-

High Cloud Cost-

High Storage Cost
High Cloud Logging Cost
High Cloud Composer Cost /High Cloud Run Cost
High Compute Engine cost
High DataProc Cost

Other Challenges-

Application / Pipeline Not Running Properly due to null values in source data
Null Values Causing Extra Cost

Detailed overview of Challenges and approach to fix the issues:

1. High Storage Cost

Problem Statement:

High Cloud Storage usage, idle resource and its residuals, retention of old/unused files, and duplicate datasets inflates the costs.
Developers or users often forget to free temporary resources like VM instances, Dataproc clusters, or Composer environments after their tasks complete.

Probable Causes:

No lifecycle rules or archival method.
Not using Nearline/Coldline for infrequent data.
No Residual Management
Storing raw data in CSV/JSON formats without compression leads to large storage and query costs.
No proper Data lifecycle rules for Unused data
Manual provisioning of compute services.
Lack of automation and alerts.
Using default export formats.
Lack of awareness about compressed file benefits.

Impact:

High GCP Cost for unused compute/storage/network.
Wasted resources sitting idle.
Higher Cloud Storage and BigQuery storage/query cost.

Approach:

Upon further detailed analysis and diagnosis, below methods can be worked upon to reduce the storage.

Use storage class tiers (Nearline, Coldline).
Apply lifecycle rules to archive old data after a certain period of time.
Compress data before storing.
Set up auto-shutdown scripts for VMs and Dataproc.
Use labels and budget alerts to track test/dev environments.
Enable idle cluster detection in Dataproc.
Enforce IAM policy to restrict provisioning permissions if not already taken care.
Store files in Parquet or Avro format (if not already doing)
Use GZIP for compressing CSV/JSON when Parquet is not feasible.
Set lifecycle policies on Cloud Storage buckets to delete raw data after conversion.
Ingest compressed files directly to BigQuery using auto-detect schemas.

Recommendations:

Use Storage Insights for recommendations.
Use GCP monitoring tool (Google Cloud Monitoring)

2. High Cloud Logging Cost

Problem Statement:

Too much log ingestion and long retention increases costs in Cloud Logging.

Probable Causes:

Default retention settings.
Logging at DEBUG level unnecessarily.

Impact:

Costs increase with GBs of logs stored.

Mitigation Approach:

Upon further detailed analysis and diagnosis, below methods can be worked upon.

Set custom retention periods.
Use log exclusion filters.
Route logs to BigQuery with partitioning if needed.
Filter out noisy logs like health checks or verbose component logs.

3. High Cloud Composer Cost / Cloud Run Cost

Problem Statement:

Cloud Composer has high base pricing.
Resources like VMs, Cloud Composer, Cloud Run containers are running but not utilized.
Cloud Run is billed for memory, CPU, and execution time per request. Poor design causes long runtimes.

Probable Causes:

Composer environments always running.
Multiple environments for similar jobs.
Inefficient orchestration.
Services not scaled to zero (e.g., Cloud Run with minimum instances).
High memory/CPU allocations.
Minimum instances > 0 (always on).

Impact:

Costs accrue even when no DAGs are running.
Pay for compute/memory even when idle.
Increased runtime costs and idle instance costs.

Approach:

Upon further detailed analysis and diagnosis, below methods can be worked upon.

Use Cloud Scheduler + Cloud Functions to pause/resume environments.
Consolidate DAGs into fewer environments.
Explore Workflows for simpler orchestration needs.
Configure Cloud Run to scale to zero.
Use Cloud Monitoring to identify underutilized resources.
Use Cloud Scheduler to shut down idle VMs or pause Composer environments.
Implement auto-scaling policies based on usage metrics.
Optimize container logic to reduce runtime.

4. High Compute Engine Cost

Problem Statement:

Compute Engine instances are running beyond their required capacity or lifecycle, including idle or oversized instances that continue to incur costs unnecessarily.

Probable Causes:

Instances left running after usage (forgotten/abandoned resources).
Over-provisioned VMs (unoptimized machine types for workloads).
Lack of automation for stopping or terminating instances.
Persistent disks and snapshots not deleted post VM termination.
VMs running in premium zones instead of cost-effective regions.

Impact:

Elevated monthly compute costs.
Inefficient resource utilization.
Budget overruns and reduced ROI from cloud migration.

Mitigation Approach:

Upon further detailed analysis and diagnosis, below methods can be worked upon.

Implement VM lifecycle automation (auto-stop and auto-delete policies).
Right-size VM instances based on usage metrics (use Recommender API).
Use instance schedules for dev/test environments.
Regularly audit and delete unused persistent disks and snapshots.
Prefer standard zones over premium zones for deployment.
Enforce labels and budgets with alerts to track and govern usage.

5. High DataProc Cost

Problem Statement:

Dataproc clusters often run continuously or at full scale even during non-peak hours.

Probable Causes:

Always-on mode.
No auto-decommissioning.

Impact:

Large compute and storage bills.

Mitigation Approach:

Upon further detailed analysis and diagnosis, below methods can be worked upon.

Enable auto-cluster termination.
Use single-node clusters for dev/testing.
Schedule shutdowns.
Move batch workloads to BigQuery or Dataflow if feasible.

Other Major Challenges:

1. Application / Pipeline Not Running Properly due to Null Values in Source Data

Problem Statement:

Pipelines fail or behave unpredictably when nulls are present in critical fields (e.g., primary keys, partition fields), causing downstream job failures and unnecessary reruns.

Causes:

Ingesting unclean source data from Cloud Storage, Pub/Sub, etc.
Lack of schema enforcement.
No null checks during transformation.

Impact:

Dataflow/Dataproc/Composer jobs fail, triggering retries.
High costs due to repeated job execution.
Unreliable analytics and reporting.

Mitigation Approach:

Add null-checks at ingestion in Dataflow/Dataproc jobs.
Use Cloud Functions to validate files before triggering pipelines.
Store only validated data in BigQuery to avoid appending nulls.

2. Null Values Causing Extra Cost

Problem Statement:

Null values can indirectly cause extra cost in BigQuery slot-based pricing, depending on query logic and pipeline design — even though you pay a flat rate for the slots. For example- If ETL jobs or queries scan large tables with mostly null or sparse columns, it wastes slot time without producing value. Also, Nulls can complicate joins or filters, requiring extra processing and re-runs if results are invalid due to incorrect handling. Queries scanning tables with large amounts of null values lead to unnecessary processing and cost.

Causes:

Poor data quality at source.
No filters to exclude nulls.

Impact:

More scanned bytes = higher BigQuery query costs.

Mitigation Approach:

Using Python scripts in Apache Airflow to filter nulls before loading into BigQuery.
Apply WHERE field IS NOT NULL in all queries.
Use data preprocessing pipelines (Dataflow) to cleanse before ingestion.
Use BigQuery storage optimization with column pruning and partition filters.