Google Cloud Migration

Problem

Legacy on-prem data pipelines and monolithic ETL scripts were causing high compute costs, fragile data workflows, and slow analytics turnaround. The organization needed a scalable, governable cloud-based solution to process high-volume data reliably while reducing cost and enabling self-serve analytics.

Approach

Build high-throughput ETL pipelines
Developed 3 Spark-based ETL pipelines orchestrated with Airflow to reliably process 500M+ daily rows, focusing on compute efficiency and robust failure handling.
Replatform & modernize
Replatformed 80+ datasets and refactored 70+ legacy scripts from on-prem MSSQL/Oracle to GCP (BigQuery/Dataproc), optimizing storage and query performance with columnar formats (Parquet) and schema improvements.
Embed data quality and governance
Introduced dbt and Great Expectations to automate data validation and lineage checks, turning reporting pipelines into analytics-ready data products and improving trust in downstream models.

Challenges & Lessons Learned

Migration complexity
Mapping legacy schemas and ETL logic to modern BigQuery models required careful refactor planning and cross-team coordination to avoid data regressions.
Cost vs. performance tradeoffs
Balancing compute cost, latency, and maintainability led to targeted optimizations (e.g., partitioning, Parquet layout, and Spark tuning) rather than one-size-fits-all changes.
Operationalizing quality
Embedding automated tests and lineage into CI/CD pipelines reduced firefighting but required investment in test design and clear SLAs for dataset owners.

Outcomes & Next Steps

Impact: Reduced compute time by 92%, saving an estimated $220K annually, while processing 500M+ daily rows.
Operational gains: Replatforming and refactoring cut storage by ~2.5 TB and improved maintainability across 80+ datasets.
Data trust: Data quality automation (dbt + Great Expectations) increased confidence in analytics and shortened time-to-insight for downstream teams.
Next steps: Expand CDC-based ingestion for near-real-time analytics, evaluate Iceberg for time-travel and schema evolution, and scale the data quality framework across additional domains.