SaFi Bank Space : DLQ Management & Error Handling in Kafka



General approach for Kafka error handling

Typically there are two types of errors when we are dealing with Kafka message consuming. They are Transient Error and Non-Transient Error

  • Transient Error is the type of error which we can do retry on the consumer side when it occurs. For example if the database is temporarily unavailable, this would be a Transient error as when the database becomes available again the message will be processed successfully.

  • Non-Transient Error is the type of error which is unlikely to be successfully handled even we do the automatic retry at a later time. For example, if the message cannot be deserialized by the consumer, attempts to retry processing will not be successful.

There will be different steps to take for these two types of error

  • For Transient Errors, we could first retry it for a pre-configured times and if it succeed after the retry, this is considered as a successful processing for the message. And if it still fails after we do all the retries, we could consider it as a Non-Transient Error and it can be sent to a pre-defined DLQ topic for further investigation.

  • For Non-Transient Errors, if we think it needs to be recorded and investigated later, we should send it to a pre-defined DLQ topic and then current consumer can continue to process new messages.

The general step can be showed like this:

What is DLQ ?

Dead Letter Queues (DLQs) are an error handling mechanism for Kafka. In cases messages that cannot be handled correctly by a consumer when read from a topic are instead sent to another topic, the DLQ, for further investigation and recovery. This ensures that subsequent messages on the original topic can continue to be processed, while any messages that cannot be processed are retained in full, i.e. there is no data loss.

Once a message has been sent to a DLQ this indicates that the standard handling of the message was not able to complete. We need to take additional steps in order to recover the original operation that was unable to complete.

When to do retry & send to DLQ

There are no fixed rules which can simply defined what kind of messages need to have retry or need to be sent to DLQ, here are some guidelines to help to do the decision:

When to do retry

For the handling logic when we do retry, we need to make sure the logic we use to process the message is idempotent, for more details on how we should do idempotency please refer to Idempotency

  1. The error is Transient Error which means it can be some error like network timeout etc. We can do retry to process the message again and it may succeed on the retry.

  2. Based on business requirement to decide whether we need retry and how many times we need to take to do the retry.

How to do retry

We can do retry on consumer side by using the https://micronaut-projects.github.io/micronaut-kafka/latest/guide/#kafkaErrors built-in retry mechanism.

When to send to DLQ

  1. The error is Non-Transient Error and the reason for this error is caused by deserialization problem such as the data is invalid for the required format(invalid Avro data or protobuf data)

  2. After retry for all the pre-defined times, if the error still exists. And based on the business scenario we do not want to loss this data(such as payment error), then we need to put this to a DLQ to have further investigation and recovery.

How to catch deserialization error

We can use https://micronaut-projects.github.io/micronaut-kafka/latest/guide/#kafkaErrors to define customized exception handler which can help us catch deserialization error and send to DLQ. Please refer to https://micronaut-projects.github.io/micronaut-kafka/latest/guide/#kafkaErrors for more details about define customized exception handling.

DLQ naming

For a message from a topic which needs to be sent to a DLQ, the DLQ should only contains the message from this topic and the naming for the DLQ should follow the same name with the original topic and followed by .dlq in the end.

For example if a message from topic transaction.interbank failed to process and we decide to send it to DLQ, the DLQ topic for this message should be named as transaction.interbank.dlq .

DLQ message format

When a message is sent to a DLQ topic, the format of the message on the DLQ will be as follows:

  • The body of the message is identical to the original message, in the same format (Protobuf/Json or Avro). For example if the originally consumed message is an LoanSummaryAvro, then the DLQ’s message body will be the same LoanSummaryAvro message.

  • DLQ metadata detailing information about the original message is present in the x-dlq-metadata message header in the form of a standard DLQ metadata message (see below for the full format). DLQ metadata values will always be formatted as JSON key / value pairs.

The DLQ metadata contains four mandatory fields and one optional field all as JSON key / value pairs:

  • original_headers. This is a repeated field containing all the Kafka headers present in the original message.

  • original_topic. This is a single string whose value is the topic that the original message was consumed from.

  • original_consumer_group. This is a single string whose value is the consumer group used by the consumer which encountered the error while processing the message.

  • original_timestamp. This is a single string whose value is the timestamp of when the original message was created.

  • error_info. This is an optional single string field which, if present, will contain the error info.

Message handling in DLQ

When the message has been put on DLQ, we need to do manual check on it. For example if the DLQ message is payment related message which is really important for us, then we may need to do manual check on it and see what steps we need to take for solving this issue.

And if we find out the error need to be solved in a long run we could create a Jira ticket to track it. For more details about creating tickets, please refer to Jira Ticketing System.

Monitor for DLQ

We need to setup monitor mechanism for DLQ topics, one of way to do this is to use Prometheus and AlertManager to handle the monitoring and alerting for DLQ status.

For more details for how we are doing monitoring, please refer to Monitoring.

Handling DLQ messages sent by Thought Machine

ThoughtMachine has defined different DLQs for different topic/message types. Please refer to the document for the suggested & required handling steps for all the DLQs. We need to follow the TM guideline for handling the DLQ they provide, especially for posting related DLQs, we should always setup proper error handling process for all these errors. Please follow the document from TM to do all the required work and implementation.

Detailed information about each individual DLQ can be found on the Documentation Hub under the associated API section. This includes the processing being performed, the impact of a message being placed on that DLQ, and potential recovery steps to follow

Attachments:

~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~Steps for handling Transient error.tmp (application/vnd.jgraph.mxfile)
Steps for handling Transient error (application/vnd.jgraph.mxfile)
Steps for handling Transient error.png (image/png)