Background
This document describes the Data team’s plans and implementation of its data engineering framework. Most of the pipelines are ELT, data are loaded immediately after extracting it, handling raw — often unstructured — data. Using this approach, the team can handle and process large amount of data at a faster rate.
Table below summarize the services that are used for the major steps in the pipeline. Detailed implementation of each part can be found in the links.
Process | Source | Description | Link |
---|---|---|---|
Extract | API | Cloud functions will be used for batch extracts | |
Confluent Kafka | Kafka Sink Connector (IT Team) | ||
Load | Bigquery (Data Lake) | Cloud functions will be used | |
Transform | API and Kafka | DBT Core will be used |
Architecture
Below illustrates the different tools that will be used for each step of the data pipeline :
Data Engineering Best Practices
Code Modularity - python scripts developed reused the same functions for the following services (Bigquery, API Request, Cloud Storage, Secret Manager).
Standardized Naming Convention
Version Control - all codes are stored in Github
Ensure Data Quality
Pipeline Monitoring