Background
High quality data are precondition for analyzing and using big data. This document describes the list of data quality checks the data team is applying in each of the data sources that will be used by the company.
Checklist
Dimension | Elements | Indicators |
Uniqueness | Is there a unique identififier in the data source? | Check for unique identifier of the data source |
How are changes made and processed? | Check for update date/change date fields | |
How far back in the data set will changes be made? | ||
Timeliness | Whether the time interval from data collection and processing to release meets requirements? | Check documentation for data update frequency |
Are the data reported as soon as possible after collection? | ||
Integrity | Is the data format clear? | Correct metadata definition |
Required fields or not nullable | ||
Data are consistent with content integrity | Units are appropriate and defined | |
Value ranges are valid | ||
Timezones are consistent | ||
Data and the data from other data sources are consistent or verifiable | Lookup table or foreign key constraints | |
Completeness | How to identify if data is complete | Compare record count of source and extracted data |
Data Quality Check in DBT
Most of the data quality checks are implemented in dbt. Below are the sample checks applied by data team.
not_null
models: - name: raw_genesys_conversations_v1 columns: - name: conversationId tests: - not_null
2. Unique
models: - name: raw_genesys_conversations_v1 columns: - name: conversationId tests: - unique
3. accepted_values
models: - name: raw_genesys_conversations_v1 columns: - name: direction tests: - accepted_values: values : [inbound, outbound]
4. Relationships
models: - name: raw_genesys_conversations_v1 columns: - name: user_id tests: - relationships: to: ref('raw_genesys_users_v1') field: id
5. Singular Tests
select evaluationId from {{ ref ('evaluations' )}} where answer.score < 0