SaFi Bank Space : Avro major version upgrade strategy

Status

DONE

Impact

MEDIUM

Driver

Juraj Macháč (Unlicensed)

Approver

Juraj Macháč (Unlicensed)

Contributors

Michal Glaus (Unlicensed)

Informed

Due date

Resources

(blue star) Background

There are 3 different types of messages ( Kafka message types ) in the Avro topics within the bank. Please read the article to get more detail into the meaning of such messages. The schemas for these messages are leveraging Avro schema evolution for minor upgrades, but there is sometimes need for a major, breaking upgrade.

While it’s discouraged for this to happen very often, there is a need for a process to follow which will have the following properties:

  • No message loss for the consumers - the consumers should be assured that when following the process, they will consume all the messages produced by the producer (either in the old major version or the new one)

  • No duplicate message processing - the consumers should not create the same side effects twice when processing the messages from two different major versions.

  • Process has the least blocking parts - the process should allow for a safe, but also efficient way how to upgrade the major version of the schema. It should not contain many dependent parts where the different parties have to wait for one another.

This decision log is split into 3 separate parts, one per message, with 3 separate decisions on versioning approach. Each decision is assuming there is an upgrade happening from V1 to V2 for illustrational purposes.

Snapshots

The snapshots are used in a consumer to materialize a view of an entity provided by producer. It should be possible for a new consumer to connect to the snapshot topic and build a full materialized view of all entities. Additionally, these are consumed in the Datalake to build a history of changes of the entity and an end-of-day snapshot of it.

Snapshots are owned by the producer. There is a single producer of the snapshot, and multiple consumers.

(blue star) Options considered

Consumer-first upgrade in separate topics

Consumer-first upgrade in a single topic

Producer-first upgrade in separate topics

Description

The consumers will first start accepting both V1 and V2 snapshots in the different topics.

Once all consumers are able to process V2, the producer will atomically switch to send V2 into the new topic and stop sending V1.

The producer will “republish” the snapshots of entities into the new V2 topic so that the consumers of the new topic have full information.

The consumers can stop listening on V1 snapshots.

The consumer will create a new listener and consume both V1 and V2 versions of the snapshot.

Once all consumers are able to process V2, the producer will atomically switch from producing V1 to producing V2 into the same topic.

The producer of the snapshot starts publishing the snapshots of both versions, V1 and V2 to two separate topics.

The producer will “republish” the snapshots of entities into the new V2 topic so that the consumers of the new topic have full information.

During a transition period, the consumers will upgrade their listeners to listen on V2 instead of V1. The switch may be atomic, but it’s not required. The consumer can listen on both V1 and V2 until the consumer is fully up to date with V2 to avoid any lag.

After all consumers have transitioned, producer stops publishing the V1 snapshot.

Pros and cons

(plus) The latest topic (almost) always contains all snapshots - new consumers only need to understand the latest schema

(plus) There are no “duplicate” messages published. Every message is in either V1 or V2.

(minus) Upgrade which is desired by the owner (producer) is blocked by consumers until all of them transitioned. Not a single consumer is able to process new version V2 until all of the consumers are ready to transition.

(minus) Snapshot republishing is required with every new major version

(plus) No need for snapshot republishing

(plus) A single topic containing all of the messages - simpler topic management

(plus) There are no “duplicate” messages published. Every message is in either V1 or V2.

(minus) A new consumer will need to understand all previous major versions in order to build a materialized view.

(minus) Upgrade which is desired by the owner (producer) is blocked by consumers until all of them transitioned. Not a single consumer is able to process new version V2 until all of the consumers are ready to transition.

(minus) Will need to change the schema management, as the default setting is a single schema per topics

(plus) The latest topic always contains all snapshots - new consumers only need to understand the latest schema

(plus) The upgrade is less blocking as it allows the owner (producer) to upgrade without having to wait for consumers and hence, allows an interested consumer to upgrade as soon as possible.

(info) The consumers have to count with processing the same snapshot both in V1 and V2. However, this should not be a problem as the consumers can simply find out what is the latest representation by comparing the publishedAt kafka timestamp.

(minus) Snapshot republishing is required with every new major version

Estimated cost

Commands

Commands are used in the consumer as an async way to invoke the consumer’s API. The service exposes a command topic where it’s clients (the producers) can send a command.

Commands are owned by the consumer. There is a single consumer of the command and multiple producers.

(blue star) Options considered

Consumer-first upgrade in separate topics

Consumer-first upgrade in a single topic

Producer-first upgrade in separate topics

Description

Much like in REST API major upgrades, the consumer will create a new listener and support consuming both V1 and V2 versions of the command.

The clients (producers) can switch one by one and start sending V2 commands during a transition period.

After the clients are switched to V2, the consumer can stop accepting V1 messages and remove the topic.

The consumer will enhance the listener to also accept V2 commands in the single topic.

The clients (producers) can switch one by on and start sending V2 commands into the same topic during the transition period.

After the clients are switched to V2, the consumer can stop accepting V1 messages in the topic.

All of the clients (producers) will start publishing both the V1 and V2 commands to separate topics

Once all of the producers are upgraded, the consumer will switch from V1 to consuming the V2 commands.

The producers will now be able to stop publishing the V1 command.

Pros and cons

(plus) Upgrade is not blocking for the owner which desires to upgrade. Interested clients (producers) can start leveraging the new functionality as soon as possible.

(plus) No need to correlate commands in V1 and V2 as each command is published only in a single version

(plus) Clear contract in topics with a single-schema-per-topic approach

(plus) Upgrade is not blocking for the owner which desires to upgrade. Interested clients (producers) can start leveraging the new functionality as soon as possible.

(plus) No need to correlate commands in V1 and V2 as each command is published only in a single version

(plus) Less topic management as a single topic is used

(minus) Will need to change the schema management, as the default setting is a single schema per topics

(minus) The consumer needs to ensure it doesn’t process the same command twice when switching from V1 to V2. It’s impossible to start consuming V2 commands from “where it left off in V1”. (info) Could be done by correlating the messages by idempotencyKey if the producer ensures that the same key is used in V1 and V2

(minus) The upgrade is blocking the owner (consumer) as it needs to wait until all of the producers are upgraded before allowing any of its clients (producers) to use new functionality in V2.

(minus) More work for the consumers, which is more work in general, because in commands there is one consumer and many producers.

Estimated cost

Events

Events are published by services whenever a business action occurs. These are used by the clients (consumers) to react to different types of business events and kick off their business processes.

Events are owned by the producer. There is a single producer of the event, and multiple consumers.

(blue star) Options considered

Consumer-first upgrade in separate topics

Consumer-first upgrade in a single topic

Producer-first upgrade in separate topics

Description

The consumers will first start accepting both V1 and V2 in separate topics.

Once all consumers are accepting V2, the producer will switch to send V2 version of the event to the new topic.

The consumers can stop supporting the V1 event.

The consumers will enhance their listeners to also accept V2 events in the single topic.

Once all consumers are accepting V2, the producer will switch to send V2 version of the event to the same topic.

The consumers can stop supporting the V1 event.

The producer (owner) will first start publishing both the V1 and V2 events to separate topics.

The clients (consumers) can start accepting V2 instead of V1 during the transition period.

Once all consumers accept the new V2, the producer can stop sending V1 events and remove the topic.

Pros and cons

(plus) The switch is “atomic” on the producer side. The consumers do not need to correlate V1 and V2 messages and ensure that exactly one of them was processed.

(minus) Upgrade is blocking for the owner (producer) as it needs to wait until all of the consumers are upgraded. No consumer will be able to trigger new functionality with new V2 events until all consumers are upgraded

(plus) Clear contract in topics with a single-schema-per-topic approach

(plus) The switch is “atomic” on the producer side. The consumers do not need to correlate V1 and V2 messages and ensure that exactly one of them was processed.

(minus) Upgrade is blocking for the owner (producer) as it needs to wait until all of the consumers are upgraded. No consumer will be able to trigger new functionality with new V2 events until all consumers are upgraded

(plus) Less topic management as a single topic is used

(minus) Will need to change the schema management, as the default setting is a single schema per topics

(minus) The consumer needs to ensure it doesn’t process the same command twice when switching from V1 to V2. It’s impossible to start consuming V2 events from “where it left off in V1”. (info) Could be done by correlating the messages by idempotencyKey if the producer ensures that the same key is used in V1 and V2

(plus) The upgrade is not blocking the owner (producer). An interested consumer of the V2 event can start leveraging the new functionality as soon as possible, without having to wait for the rest of the consumers to upgrade.

Estimated cost

(blue star) Action items

(blue star) Outcome

Major upgrades in all types of messages are chosen to be “owner-first” in separate topics in order to make the upgrade most efficient in terms of team communication. More specifically:

  • Snapshots: Producer-first upgrade in separate topics

  • Commands: Consumer-first upgrade in separate topics

  • Events: Producer-first upgrade in separate topics