Background

High quality data are precondition for analyzing and using big data. This document describes the list of data quality checks the data team is applying in each of the data sources that will be used by the company.

Checklist

Dimension

Elements

Indicators

Uniqueness

Is there a unique identififier in the data source?

Check for unique identifier of the data source

How are changes made and processed?

Check for update date/change date fields

How far back in the data set will changes be made?

Timeliness

Whether the time interval from data collection and processing to release meets requirements?

Check documentation for data update frequency

Are the data reported as soon as possible after collection?

Integrity

Is the data format clear?

Correct metadata definition

Required fields or not nullable

Data are consistent with content integrity

Units are appropriate and defined

Value ranges are valid

Timezones are consistent

Data and the data from other data sources are consistent or verifiable

Lookup table or foreign key constraints

Completeness

How to identify if data is complete

Compare record count of source and extracted data

Data Quality Check in DBT

Most of the data quality checks are implemented in dbt. Below are the sample checks applied by data team.

  1. not_null

models:
  - name: raw_genesys_conversations_v1
    columns:
      - name: conversationId
        tests:
          - not_null

2. Unique

models:
  - name: raw_genesys_conversations_v1
    columns:
      - name: conversationId
        tests:
          - unique 

3. accepted_values

models:
  - name: raw_genesys_conversations_v1
    columns:
      - name: direction 
        tests:
          - accepted_values:
              
values
: [inbound, outbound]

4. Relationships

models:
  - name: raw_genesys_conversations_v1
    columns:
      - name: user_id
        tests:
          - relationships:
              to: ref('raw_genesys_users_v1')
              field: id

5. Singular Tests

select
   evaluationId 
    
from
 {{ 
ref
('evaluations' )}}
where
 answer.score < 0