Datalake (or Data warehouse, used interchangeably) is a place where all the data from the individual services within the bank system are consolidated, thus serving the purpose of building ETLs for various reporting purposes.

The Datalake (DL) consumes the data provided by bank systems and provides additional transformations over the data. It is incapable of changing the original data coming from the bank system.

Construction of Datalake

The DL does not connect directly to the databases of the individual microservices, as this breaks the principle of the microservice being the sole owner and user of its database. Instead, the DL leverages subscriptions to (almost) all of the Kafka topics which provide data published by microservices within the bank system.

The DL is subscribed to:

  • event and snapshot topics of all the microservices within the system

  • topics exposed by TM related to facts about entities the TM manages.

The data flows to BigQuery through GCP sink connector.

As the snapshot messages are entity views emitted with every change to the respective entity, this design grants the DL visibility over the data owned by microservices also with its historical representation.

Data privacy

Some fields considered as PII shall not be exposed to specific roles querying the Datalake as per requirement on Customer Identification File.

PII Data may be stored in Data Lake, but only in secure vault. Secure vault will not provide visibility to PII data for all. PII data may be accessible based on roles (BigQuery has such function)

  • Such as fraud investigations

  • Such as reporting to credit bureaus

  • Everybody else will have masked (hashed) PII data