Background

This document describes Data teams framework for ingesting and loading data from API. Python scripts are used for requesting data from 3rd party endpoints and will be loaded to the Data Lake. Cloud Functions, Google’s serverless compute engine, will be used for running the scripts.

The following GCP Services are used:

Service

Description

Cloud Scheduler

Used to schedule the run of Cloud Functions.

Pub/Sub

Used as a trigger for Cloud Functions.

Cloud Functions

Used to send requests to the API and ingest to the data lake.

Bigquery

For storing the data in table form.

Cloud Storage

For storing the data in its raw format.

Secret Manager

For storing sensitive information such as client id and client token.

Architecture

Repository Folder Structure

.
├── poc
     ├── __init__.py
     └── cloud_functions_template
        ├── __init__.py
        ├── config
        │   ├── genesys
        │   │   ├── conversation.json
        │   │   ├── evaluation.json
        │   │   └── general.json
        │   └── jira
        │       └── servicedesk_issues.json
        ├── main.py
        ├── requirements.txt
        ├── schema
        │   ├── genesys
        │   │   ├── conversation.json
        │   │   └── evaluation.json
        │   └── jira
        │       └── servicedesk_issues.json
        └── utils
              ├── __init__.py
              ├── bq_helper.py
              ├── gcs_helper.py
              ├── general.py
              ├── requests_api.py
              └── secret_manager_helper.py

main.py - contains the entry point function that will be executed.

config - folder where the json configuration files are stored.

requirements.txt -  text file that contains the library dependencies of the python script.

schema -  folder where the json schema files are stored.

utils -  folder where python utility classes are stored.

Environment Variables

The following are the required environment variables used in the Cloud Functions:

  1. PROJECT_ID - This sets up the project id that is used throughout the entire python script (e.g. PROJECT_ID=“datatest-348502”).

  2. SECRET_ID_[DATA SOURCE] - This calls the secret id from Secret Manager used in the job (e.g. SECRET_ID_GENESYS=”data-genesys-secret”)

Secret Manager

The jobs in the Cloud Function utilizes the Secret Manager to store the sensitive information used in the job. The value of a secret should be in JSON format. For example:

{
  "clientid": "ac29bed-b20210-805bf4-3621b0eae72d0",
  "clientsecret": "yo1wr30u3fb7dochoieboiz2b2jrk9clih5i02ki"
}

The value of the JSON is used as a dictionary in Python.

Attachments:

Unknown (application/octet-stream)
Untitled Diagram-5-2-cloud functions.jpg (image/jpeg)