SaFi Bank Space : DE Technical Documentation

This page will be used for documenting GCP and other services used by Data Team.

Services Naming Conventions

The naming convention for services used will follow the following format businessFunction-subcategory-frequency-source-task. The strings in the name will be “-” dash separated.

  • businessFunction - Refers to the main functionality that the service is used for and will be grouped under

    • E.g. ingest-xxxxx

  • subcategory - Within the supposed business function, there can be various subtasks that each function performs, related to the categories.

    • Subcategory details are “-” underscore separated

    • Can accept up to 2 subcategories, should be hierarchical in nature, where the second category is intuitively a subset of the first

    • E.g. ingest-batch-xxxxxx

  • frequency - Refers to the frequency run of the service (daily, hourly, etc)

    • E.g. ingest-batch-daily-xxxxxx

  • source - Refers to the name of the data source

    • Can accept up to 2 source name (source-subset/report), should be hierarchical in nature, where the second category is intuitively a subset of the first

    • E.g. ingest-batch-daily-jira-issues-xxxxxx

  • task - Will contain more specific descriptions of the subtask or job handled by each function. Can be anything as long as it is documented.

    • E.g. ingest-batch-daily-jira-issues-bqwrite

There are two (2) main workload that the data engineering team is working on : ingestion and transformation.

Ingestion

Data ingestion is running using cloud functions, implementation can be found in this page https://safibank.atlassian.net/l/cp/iuGrcfHp

Cloud Scheduler

Pub/Sub

Subscription

Cloud Function

Cloud Storage

Bigquery (Source Layer)

Description

ingest-batch-daily-api-schedule

ingest-daily-api-topic

ingest-batch-daily-jira-issues

ingest-batch-daily-jira-issues-bqwrite

data-automation-raw-data/jira/issues/yyyy/mm/dd

jira.raw_jira_servicedesk_issues_v1

Every 5AM

ingest-batch-daily-genesys-conversation

ingest-batch-daily-genesys-conversation-bqwrite

data-automation-raw-data/genesys/conversations/yyyy/mm/dd

genesys.raw_genesys_conversations_details_v1

ingest-batch-hourly-api-schedule

Transformation

Cloud Scheduler

Cloud Build

Cloud Run

Description

transform-batch-daily-dbt-schedule

transform-batch-daily-dbt-job

Every 5:30 AM

Loading

Bigquery Naming Conventions

The naming convention for Bigquery tables will follow the following format dag stage_source_subset_version. The strings in the name will be “_” underscore separated.

  • dag stage - Refers to the sequence in the transformation pipeline

    • E.g. raw_xxxxx

  • source - Refers to the name of the data source

    • E.g. raw_jira_xxxxxx

  • subset - For each data source, there can be multiple sub data sources or reports.

    • Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first

    • E.g. raw_jira_servicedesk_issue_xxxxxx

  • version -

Cloud Storage Naming Conventions

The naming convention for Cloud Storage will follow the following format <bucketname>/<source>/<subset>/<yyyy>/<mm>/<dd>. The strings in the name will be “-” hyped separated.

  • bucketname - Refers to the name of the bucket in cloud storage

    • E.g. data-automation-raw-data/xxxxxx

  • source - Refers to the name of the data source

    • E.g. data-automation-raw-data/jira/xxxxxx

  • subset - For each data source, there can be multiple sub data sources or reports.

    • Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first

    • E.g. data-automation-raw-data/jira/servicedesk-issues/xxxxxx

  • date - Refers to the ingestion date. Format is yyyy/mm/dd

Limits

Cloud Function - name must start with a letter and must be lowercase followed by up to 62 letters, numbers, hyphens and must end with a letter or a number.

Pub/sub topic - start with a letter and can contain up to 255 characters. Name should only contain the following characters: Letters [A-Za-z], numbers [0-9], dashes -, underscores _, periods ., tildes ~, plus signs +, and percent signs %

Bigquery (dataset and tables) - dataset names and table names can contain up to 1024 characters and must be letters (uppercase or lowercase), numbers, and underscores.

Cloud storage - Bucket names must contain 3 to 63 characters. Object names can contain any sequence of valid Unicode characters, of length 1-1024 bytes when UTF-8 encoded, and must not contain Carriage Return or Line Feed characters.