SaFi Bank Space : DE Technical Documentation

This page will be used for documenting GCP and other services used by Data Team.

Services Naming Conventions

The naming convention for services used will follow the following format businessFunction-subcategory-frequency-source-task. The strings in the name will be “-” dash separated.

businessFunction - Refers to the main functionality that the service is used for and will be grouped under
- E.g. ingest-xxxxx
subcategory - Within the supposed business function, there can be various subtasks that each function performs, related to the categories.
- Subcategory details are “-” underscore separated
- Can accept up to 2 subcategories, should be hierarchical in nature, where the second category is intuitively a subset of the first
- E.g. ingest-batch-xxxxxx
frequency - Refers to the frequency run of the service (daily, hourly, etc)
- E.g. ingest-batch-daily-xxxxxx
source - Refers to the name of the data source
- Can accept up to 2 source name (source-subset/report), should be hierarchical in nature, where the second category is intuitively a subset of the first
- E.g. ingest-batch-daily-jira-issues-xxxxxx
task - Will contain more specific descriptions of the subtask or job handled by each function. Can be anything as long as it is documented.
- E.g. ingest-batch-daily-jira-issues-bqwrite

There are two (2) main workload that the data engineering team is working on : ingestion and transformation.

Ingestion

Data ingestion is running using cloud functions, implementation can be found in this page https://safibank.atlassian.net/l/cp/iuGrcfHp

Cloud Scheduler	Pub/Sub	Subscription	Cloud Function	Cloud Storage	Bigquery (Source Layer)	Description
`ingest-batch-daily-api-schedule`	`ingest-daily-api-topic`	`ingest-batch-daily-jira-issues`	`ingest-batch-daily-jira-issues-bqwrite`	`data-automation-raw-data/jira/issues/yyyy/mm/dd`	`jira.raw_jira_servicedesk_issues_v1`	Every 5AM
`ingest-batch-daily-api-schedule`	`ingest-daily-api-topic`	`ingest-batch-daily-genesys-conversation`	`ingest-batch-daily-genesys-conversation-bqwrite`	`data-automation-raw-data/genesys/conversations/yyyy/mm/dd`	`genesys.raw_genesys_conversations_details_v1`	Every 5AM
`ingest-batch-hourly-api-schedule`

Transformation

Cloud Scheduler	Cloud Build	Cloud Run	Description
`transform-batch-daily-dbt-schedule`		`transform-batch-daily-dbt-job`	Every 5:30 AM

Loading

Bigquery Naming Conventions

The naming convention for Bigquery tables will follow the following format dag stage_source_subset_version. The strings in the name will be “_” underscore separated.

dag stage - Refers to the sequence in the transformation pipeline
- E.g. raw_xxxxx
source - Refers to the name of the data source
- E.g. raw_jira_xxxxxx
subset - For each data source, there can be multiple sub data sources or reports.
- Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first
- E.g. raw_jira_servicedesk_issue_xxxxxx
version -

Cloud Storage Naming Conventions

The naming convention for Cloud Storage will follow the following format <bucketname>/<source>/<subset>/<yyyy>/<mm>/<dd>. The strings in the name will be “-” hyped separated.

bucketname - Refers to the name of the bucket in cloud storage
- E.g. data-automation-raw-data/xxxxxx
source - Refers to the name of the data source
- E.g. data-automation-raw-data/jira/xxxxxx
subset - For each data source, there can be multiple sub data sources or reports.
- Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first
- E.g. data-automation-raw-data/jira/servicedesk-issues/xxxxxx
date - Refers to the ingestion date. Format is yyyy/mm/dd

Limits

Cloud Function - name must start with a letter and must be lowercase followed by up to 62 letters, numbers, hyphens and must end with a letter or a number.

Pub/sub topic - start with a letter and can contain up to 255 characters. Name should only contain the following characters: Letters [A-Za-z], numbers [0-9], dashes -, underscores _, periods ., tildes ~, plus signs +, and percent signs %

Bigquery (dataset and tables) - dataset names and table names can contain up to 1024 characters and must be letters (uppercase or lowercase), numbers, and underscores.

Cloud storage - Bucket names must contain 3 to 63 characters. Object names can contain any sequence of valid Unicode characters, of length 1-1024 bytes when UTF-8 encoded, and must not contain Carriage Return or Line Feed characters.