Background

This document describes the Data team’s plans and implementation of its data engineering framework. Most of the pipelines are ELT, data are loaded immediately after extracting it, handling raw — often unstructured — data. Using this approach, the team can handle and process large amount of data at a faster rate.

Table below summarize the services that are used for the major steps in the pipeline. Detailed implementation of each part can be found in the links.

Process

Source

Description

Link

Extract

API

Cloud functions will be used for batch extracts

https://safibank.atlassian.net/l/cp/P7A1KJE9

Confluent Kafka

Kafka Sink Connector (IT Team)

Load

Bigquery (Data Lake)

Cloud functions will be used

https://safibank.atlassian.net/l/cp/P7A1KJE9

Transform

API and Kafka

DBT Core will be used

https://safibank.atlassian.net/l/cp/BtXShZG9

Architecture

Below illustrates the different tools that will be used for each step of the data pipeline :

Data Engineering Best Practices

  1. Code Modularity - python scripts developed reused the same functions for the following services (Bigquery, API Request, Cloud Storage, Secret Manager).

  2. Standardized Naming Convention

  3. Version Control - all codes are stored in Github

  4. Ensure Data Quality

  5. Pipeline Monitoring

Attachments: