This page will be used for documenting GCP and other services used by Data Team.
Services Naming Conventions
The naming convention for services used will follow the following format businessFunction-subcategory-frequency-source-task
. The strings in the name will be “-” dash separated.
businessFunction
- Refers to the main functionality that the service is used for and will be grouped underE.g.
ingest-xxxxx
subcategory
- Within the supposed business function, there can be various subtasks that each function performs, related to the categories.Subcategory details are “-” underscore separated
Can accept up to 2 subcategories, should be hierarchical in nature, where the second category is intuitively a subset of the first
E.g.
ingest-batch-xxxxxx
frequency
- Refers to the frequency run of the service (daily, hourly, etc)E.g.
ingest-batch-daily-xxxxxx
source
- Refers to the name of the data sourceCan accept up to 2 source name (source-subset/report), should be hierarchical in nature, where the second category is intuitively a subset of the first
E.g.
ingest-batch-daily-jira-issues-xxxxxx
task
- Will contain more specific descriptions of the subtask or job handled by each function. Can be anything as long as it is documented.E.g.
ingest-batch-daily-jira-issues-bqwrite
There are two (2) main workload that the data engineering team is working on : ingestion
and transformation
.
Ingestion
Data ingestion is running using cloud functions, implementation can be found in this page https://safibank.atlassian.net/l/cp/iuGrcfHp
Cloud Scheduler | Pub/Sub | Subscription | Cloud Function | Cloud Storage | Bigquery (Source Layer) | Description |
---|---|---|---|---|---|---|
|
|
|
|
|
| Every 5AM |
|
|
|
| |||
|
Transformation
Cloud Scheduler | Cloud Build | Cloud Run | Description |
---|---|---|---|
|
| Every 5:30 AM |
Loading
Bigquery Naming Conventions
The naming convention for Bigquery tables will follow the following format dag stage_source_subset_version
. The strings in the name will be “_” underscore separated.
dag stage
- Refers to the sequence in the transformation pipelineE.g.
raw_xxxxx
source
- Refers to the name of the data sourceE.g.
raw_jira_xxxxxx
subset
- For each data source, there can be multiple sub data sources or reports.Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first
E.g.
raw_jira_servicedesk_issue_xxxxxx
version
-
Cloud Storage Naming Conventions
The naming convention for Cloud Storage will follow the following format <bucketname>/<source>/<subset>/<yyyy>/<mm>/<dd>
. The strings in the name will be “-” hyped separated.
bucketname
- Refers to the name of the bucket in cloud storageE.g.
data-automation-raw-data/xxxxxx
source
- Refers to the name of the data sourceE.g.
data-automation-raw-data/jira/xxxxxx
subset
- For each data source, there can be multiple sub data sources or reports.Can accept up to 2 subset name, should be hierarchical in nature, where the second category is intuitively a subset of the first
E.g.
data-automation-raw-data/jira/servicedesk-issues/xxxxxx
date
- Refers to the ingestion date. Format is yyyy/mm/dd
Limits
Cloud Function - name must start with a letter and must be lowercase followed by up to 62 letters, numbers, hyphens and must end with a letter or a number.
Pub/sub topic - start with a letter and can contain up to 255 characters. Name should only contain the following characters: Letters [A-Za-z]
, numbers [0-9]
, dashes -
, underscores _
, periods .
, tildes ~
, plus signs +
, and percent signs %
Bigquery (dataset and tables) - dataset names and table names can contain up to 1024 characters and must be letters (uppercase or lowercase), numbers, and underscores.
Cloud storage - Bucket names must contain 3 to 63 characters. Object names can contain any sequence of valid Unicode characters, of length 1-1024 bytes when UTF-8 encoded, and must not contain Carriage Return or Line Feed characters.