Problem

Legacy on-prem data pipelines and monolithic ETL scripts were causing high compute costs, fragile data workflows, and slow analytics turnaround. The organization needed a scalable, governable cloud-based solution to process high-volume data reliably while reducing cost and enabling self-serve analytics.

Approach

  1. Build high-throughput ETL pipelines

    Developed 3 Spark-based ETL pipelines orchestrated with Airflow to reliably process 500M+ daily rows, focusing on compute efficiency and robust failure handling.

  2. Replatform & modernize

    Replatformed 80+ datasets and refactored 70+ legacy scripts from on-prem MSSQL/Oracle to GCP (BigQuery/Dataproc), optimizing storage and query performance with columnar formats (Parquet) and schema improvements.

  3. Embed data quality and governance

    Introduced dbt and Great Expectations to automate data validation and lineage checks, turning reporting pipelines into analytics-ready data products and improving trust in downstream models.

Challenges & Lessons Learned

  1. Migration complexity

    Mapping legacy schemas and ETL logic to modern BigQuery models required careful refactor planning and cross-team coordination to avoid data regressions.

  2. Cost vs. performance tradeoffs

    Balancing compute cost, latency, and maintainability led to targeted optimizations (e.g., partitioning, Parquet layout, and Spark tuning) rather than one-size-fits-all changes.

  3. Operationalizing quality

    Embedding automated tests and lineage into CI/CD pipelines reduced firefighting but required investment in test design and clear SLAs for dataset owners.

Outcomes & Next Steps

  • Impact: Reduced compute time by 92%, saving an estimated $220K annually, while processing 500M+ daily rows.
  • Operational gains: Replatforming and refactoring cut storage by ~2.5 TB and improved maintainability across 80+ datasets.
  • Data trust: Data quality automation (dbt + Great Expectations) increased confidence in analytics and shortened time-to-insight for downstream teams.
  • Next steps: Expand CDC-based ingestion for near-real-time analytics, evaluate Iceberg for time-travel and schema evolution, and scale the data quality framework across additional domains.