Background
This document describes Data teams process for extracting data from Jira cloud API. Currently, the team is extracting data from Jira service management product. It contains the list of issues/tickets per project.
Report | Endpoint | Description |
---|---|---|
Servicedesk Issues |
| Searches issues using JQL. |
Workflow
Ingestion
Preparing the configuration files:
{ "base_url": "https://safibank.atlassian.net/rest/api/2", "url": "/search", "headers": { "Accept": "application/json" }, "parameters": { "query": { "jql": "project IN (DT, BOSD) and (updated >= '<start>' and updated < '<end>')" }, "startAt": 0, "maxResults": 100 }, "gcs_bucket_uri": "gs://data-automation-raw-data", "gcs_bucket_name": "data-automation-raw-data", "gcs_source": "jira", "gcs_format": "json", "gcs_report": "issues", "gcs_date_format": "%Y%m%d", "gcs_filename": "jira_servicedesk-issues_<date>_<page>.json", "bq_dataset": "jira", "bq_table": "issues_v7", "schema_filename": "schema/jira_servicedesk_issues.json" }
base_url - The main URL of the Jira API.
url - Specific endpoint that will be invoked in the Cloud Function.
headers - Represent the meta-data associated with the API request and response.
parameters - The values passed in the API GET request. It includes the JQL for filtering the project and updated fields.
gcs_* - These fields are used for the Cloud Storage bucket.
bq_* - These fields are used for the BigQuery load.
schema_filename - The full path of the schema file used for configuring the BigQuery table. Sample schema is located in the GitHub repository.
Python Script
The code for the Jira data ingestion is contained inside the main.py. The Cloud Function will pass data and context as a parameter to the jira_api function when it is run.
def jira_api(data, context): sm = SecretManagerHelper(os.environ['SECRET_ID_JIRA']) secret_in_json = sm.get_secret() api_config = gen_util.get_api_config('config/jira/servicedesk_issues.json') auth = (secret_in_json['username'], secret_in_json['password']) end_date = date.today() start_date = end_date - timedelta(1) gcs_filename = api_config['gcs_filename'] gcs_file_ext = gcs_filename.split('.')[1] date_dct = gen_util.get_year_month_day(start_date) gcs_blob_path = f'{api_config["gcs_source"]}/' \ f'{api_config["gcs_report"]}/' \ f'{date_dct["year"]}/' \ f'{date_dct["month"]}/' \ f'{date_dct["day"]}' page_size = api_config['parameters']['maxResults'] start_at = api_config['parameters']['startAt'] more_results_flag = True page = 1 jql_query = api_config['parameters']['query']['jql'] if jql_query.find('updated') != -1: parameters = { 'jql': jql_query .replace('<start>', str(start_date)) .replace('<end>', str(end_date)) } else: parameters = api_config['parameters']['query'] # Loop for fetching paginated data while more_results_flag: url = f"{api_config['url']}?startAt={start_at}&maxResults={page_size}" jira_api_client = RequestsApi(api_config['base_url'], headers=api_config['headers']) response = jira_api_client.get(url, params=parameters, auth=auth) data = json.loads(response.text) full_filename = gen_util.generate_gcs_filename(gcs_filename, start_date, api_config['gcs_date_format'], page) if response.status_code == 200: start_at += page_size final_data = gen_util.add_string_to_key_names(data['issues']) results = '\n'.join(map(json.dumps, final_data)) # Upload files to GCS gcs_client = GcsHelper(api_config['gcs_bucket_name']) gcs_client.blob = f'{gcs_blob_path}/{full_filename}' gcs_client.upload_from_string(results, gcs_client.blob, gcs_file_ext) more_results_flag = start_at < data["total"] page += 1 else: more_results_flag = False bq = BqHelper(api_config['bq_dataset'], api_config['bq_table']) job = bq.insert_to_bq_from_gcs(api_config['schema_filename'], f"{api_config['gcs_bucket_uri']}/{gcs_blob_path}/*.json")
Load
For storing the data in BigQuery, the Load Configuration should be set accordingly. It can be found inside the bq_helper file in GitHub:
@staticmethod def get_load_job_config(table_schema): return bigquery.LoadJobConfig( write_disposition=bigquery.WriteDisposition.WRITE_APPEND, source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON, schema=table_schema, ignore_unknown_values=True, time_partitioning=bigquery.table.TimePartitioning() )
write_disposition=bigquery.WriteDisposition.WRITE_APPEND
- If the table already exists, BigQuery appends the data to the table.
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
- The format of the source is newline-delimited JSON.
schema=table_schema
- Represents the destination table’s schema in JSON format.
ignore_unknown_values=True
- Ignore extra values not represented in the table schema.
time_partitioning=bigquery.table.TimePartitioning()
- Specifies time-based partitioning for the destination table.
The BigQuery table will be populated when the function insert_to_bq_from_gcs()
is called.
Transformation
Structuring Models
Within the dbt project, there is a parent directory called "models" that contains two sub-directories, each representing one of the types of models.
models |---jira |---marts |---prd_jira_serviceproject_issues_v1.sql |---staging |---stg_jira_serviceproject_issues_v1.sql dbt.project.yml
prd_jira_serviceproject_issues_v1.sql - contains the "final" version of the data, and is updated by transforming and merging data from the source.
stg_jira_serviceproject_issues_v1.sql - temporary area where data is tested and cleaned up before being loaded into the production table.
dbt_project.yml - a configuration file that contains settings for the dbt project. It is used to define settings such as the project name, the target database, and the schema to which the models should be deployed.
Lineage Graph
Schema
fullname | mode | type | description | testing |
key | NULLABLE | STRING | ||
id | NULLABLE | INTEGER | not_null, is unique | |
PROJECT | NULLABLE | STRING | ||
reporter_name | NULLABLE | STRING | ||
creator_name | NULLABLE | STRING | ||
department | NULLABLE | STRING | ||
watch_count | NULLABLE | INTEGER | ||
description | NULLABLE | STRING | ||
priority | NULLABLE | STRING | ||
issue_type | NULLABLE | STRING | ||
status | NULLABLE | STRING | ||
resolution | NULLABLE | STRING | ||
assignee | NULLABLE | STRING | ||
request_type | NULLABLE | STRING | ||
resposible_team | NULLABLE | STRING | ||
summary | NULLABLE | STRING | ||
back_office_option | NULLABLE | STRING | ||
customer_channel | NULLABLE | STRING | ||
customer_type | NULLABLE | STRING | ||
product_type | NULLABLE | STRING | ||
created | NULLABLE | TIMESTAMP | ||
updated | NULLABLE | TIMESTAMP | is_updated | |
time_to_first_response_ongoing_start | NULLABLE | TIMESTAMP | ||
time_to_first_response_completed_start | NULLABLE | STRING | ||
time_to_first_response_completed_breached | NULLABLE | BOOLEAN | ||
time_to_first_resolution_ongoing_start | NULLABLE | TIMESTAMP | ||
time_to_first_resolution_completed_start | NULLABLE | STRING | ||
time_to_first_resolution_completed_breached | NULLABLE | BOOLEAN | ||
due_date | NULLABLE | DATE |
Macros
Macros are defined in .sql files within the macros directory of the dbt project.
generate_schema_name.sql - this macro is used to generate a schema name dynamically.
{% macro generate_schema_name(custom_schema_name, node) -%} {%- set default_schema = target.schema -%} {%- if custom_schema_name is none -%} {{ default_schema }} {%- else -%} {{ custom_schema_name | trim }} {%- endif -%} {%- endmacro %}
Source Freshness
Source freshness refers to the timeliness of the data in the source tables that are used to create the dbt models
Here's the configuration of the .yml file to determine the source freshness of the source table jira.issues_v7 :
version: 2 sources: - name: jira_src description: "Loading dataset for staging" database: datatest-348502 schema: jira tables: - name: issues_v7 freshness: warn_after: count: 1 period: day error_after: count: 2 period: day loaded_at_field: _PARTITIONTIME
To check the source freshness, run dbt source freshness --select models/jira
.
Data Quality
To monitor data quality of the tables, the following tests are executed:
Generic Tests
source_not_null_jira_prd_prd_jira_serviceproject_issues_v1_id - check if the "id" column contains any null values.
source_unique_jira_prd_prd_jira_serviceproject_issues_v1_id - check if the "id" column contains any duplicate values.
Singular Tests
assert_singular_prd_jira_issues_is_updated - compare the "updated" column from the prod table against the source table and check whether the latest "updated" date is returned to the prod table.
To execute data quality testing, run dbt test --select jira
.
Deployment
Running dbt in production means setting up a system to run a dbt job on a schedule, rather than running dbt commands manually from the command line. Here's the job configuration:
Job Name: jira_serviceproject_issues_daily_run
Environment: Deployment
Target Name: default
Run Timeout: Never
Generate docs on run: True
Run source freshness: True
Commands: dbt build --select jira
Custom Cron Schedule (UTC): At 01:15 AM
dbt build --select jira
- this command runs all tests and builds all views and tables.
Attachments:
?code=MWYzNzZlNzM5ODM1NzM5NzJkZDNhYTc4OTA1NWZlMWFfaU5kU0Q5VXZwR0c5aFZUYUprWUNrZkJObzZXM2VJUXdfVG9rZW46Ym94dXM3b3VTem90c0s2Ym9vQ1BtSlpSeEJlXzE2NzQ3MDI3ODg6MTY3NDcwNjM4OF9WNA (image/png)
?code=YjVjMDcxNjEyZWFjYzZhOWJlZDQ4ZmE4MmQ3MjE2YTdfNzljRFd2TnNrS2pYQVE5dXdEUmdoM3psZ01QY3RrQzhfVG9rZW46Ym94dXM5MEdIZnJRTkpZa25jWXBad2x3b2xnXzE2NzQ3MTA2NzM6MTY3NDcxNDI3M19WNA (image/png)