
Problem
Legacy on-prem data pipelines and monolithic ETL scripts were causing high compute costs, fragile data workflows, and slow analytics turnaround. The organization needed a scalable, governable cloud-based solution to process high-volume data reliably while reducing cost and enabling self-serve analytics.
Approach
- Build high-throughput ETL pipelines
Developed 3 Spark-based ETL pipelines orchestrated with Airflow to reliably process 500M+ daily rows, focusing on compute efficiency and robust failure handling.
- Replatform & modernize
Replatformed 80+ datasets and refactored 70+ legacy scripts from on-prem MSSQL/Oracle to GCP (BigQuery/Dataproc), optimizing storage and query performance with columnar formats (Parquet) and schema improvements.
- Embed data quality and governance
Introduced dbt and Great Expectations to automate data validation and lineage checks, turning reporting pipelines into analytics-ready data products and improving trust in downstream models.
Challenges & Lessons Learned
- Migration complexity
Mapping legacy schemas and ETL logic to modern BigQuery models required careful refactor planning and cross-team coordination to avoid data regressions.
- Cost vs. performance tradeoffs
Balancing compute cost, latency, and maintainability led to targeted optimizations (e.g., partitioning, Parquet layout, and Spark tuning) rather than one-size-fits-all changes.
- Operationalizing quality
Embedding automated tests and lineage into CI/CD pipelines reduced firefighting but required investment in test design and clear SLAs for dataset owners.
Outcomes & Next Steps
- Impact: Reduced compute time by 92%, saving an estimated $220K annually, while processing 500M+ daily rows.
- Operational gains: Replatforming and refactoring cut storage by ~2.5 TB and improved maintainability across 80+ datasets.
- Data trust: Data quality automation (dbt + Great Expectations) increased confidence in analytics and shortened time-to-insight for downstream teams.
- Next steps: Expand CDC-based ingestion for near-real-time analytics, evaluate Iceberg for time-travel and schema evolution, and scale the data quality framework across additional domains.