SaFi Bank Space : Idempotency

Idempotency guarantees that the side-effect requested to be executed happens at-most-once. This property is important to maintain in all entry points (REST APIs and Kafka processors) of the backend services so that they are able to handle retries without creating unwanted side-effects (DB updates or API calls) in case of a temporary network issue.

The contract provided by backend

A request is idempotently retried if:

the idempotencyKey specified is the same as in the previous request
the payload used is the same as in the previous request
the request arrived within the idempotency retention period (Period to be specified).

The backend service should ensure that for each request that is idempotently retried the following holds:

Any side-effect created while handling the request was executed at most once throughout all the idempotent retries coming from client, or was idempotently executed again (e.g. called a subsequent API idempotently).
Same response is returned if the operation was successful

Idempotency entry points

Any entry points to services which mutate the data of the service, thus creating a side-effect.

REST APIs
- POST, PUT, DELETE (TBC: DELETE / PUT are idempotent by nature)
Kafka message types:
- commands - meant to make a change in the consuming service, hence is subject to idempotency
- events - consuming an event can kick-off a business process, hence creating side-effects (TBC: Should events contain idempotencyKeys then?)

Implementation approach - Phase 1

The entry points of the service can invoke different types of operations. Some of them may be transactional (only operating on the DB), some are expected to invoke different APIs in sequence - essentially a business process.

Transactional APIs

If the API is transactional, then the service can rely on reverting the changes whenever an error occurs. To cater for successful requests, the service can store a mapping between idempotencyKey and the result of the operation in transaction with execution of the operation. Retries of the request would first check the mapping table before actually executing the logic.

Thus ensuring that the side-effect will be applied at-most-once as the request mapping is stored if and only if the operation was successful.

Specifics for Kafka consumer - The Kafka consumer should commit offsets only after successfully processing the message. That is, after the DB has committed the transaction. In case the DB commit succeeds and Kafka offset commit fails, then the request would be idempotently retried through the request-response mapping committed to DB. (TBC - order of commiting Kafka consumer + DB)

Business processes

If the service API executes calls against other APIs and updates its DB in a specific sequence, then the original approach with request-response mapping can’t be generally applied as the other API calls made during execution of the API endpoint are not reversible in transaction.

Why can't the original request-response mapping be applied?

Imagine a situation when a service A offers an API in which it fetches an entity from its DB and calls service B using data from that entity.

Request R1 arrives in A
A fetches entity E1 from DB
A sends a request constructed from R1 and E1 to service B
B processes the request but fails to deliver the response to A.
Entity E1 gets changed to E2 as result of an independent process
Request R1 gets idempotently retried in A
A fetches entity E1 from the DB which had changed data in the meantime to E2
A sends request constructed from R1 and E2 to service B.
B receives a request which is not an idempotent retry.

Instead, the service needs to be able to cache the side-effects before making them in order to ensure that they are executed idempotently (with the same data) upon every retry, even if the data used as input to that request have changed in the meantime.

Use of business process engine

Such service can delegate the responsibility of managing the process execution with idempotent retries to a business process engine (chosen as temporal.io as per the decision log).

The service would create a Temporal workflow which guarantees that:

any activity (side-effect) which was successfully executed within the workflow is not executed again.
any activity executed within the workflow but failed to finish is executed again with exactly the same data as it was executed before, hence idempotently.

While the Temporal.io is capable of idempotent retry by definition, there are a few downsides:

The flow becomes completely asynchronous, incapable of synchronous response after the process is executed idempotently
The temporal.io does not offer enterprise support good enough for us to make it a business critical component. Hence, the Temporal.io should not be used for business critical flows for now.

Implementation approach - Phase 2 - Guardrails

If all consumers behave correctly and idempotently, the Phase 1 approach works properly. However, if a client does not fully follow the rules and contains an error, this might have unwanted effects on execution down the stream.

Hence, in addition to the entry points behaving idempotently on an idempotent retry, we need to properly handle broken idempotent retries. An idempotent retry is “broken” when:

it contains an idempotencyKey identical to a request in the past
it contains payload which is different than the request in the past with the same idempotencyKey
the request arrived within the idempotency retention period (Period to be specified).

In such case, the service would provide a response same to the old request, which might not make sense at all for the current request arriving to that service. As a result, the faulty client calling the API with broken idempotent retry will combine data in its business logic which together do not make sense (new request data with old response data), leading to an error down the stream. This yields code difficult to debug as the original error is potentially not caught early in the execution.

To prevent broken idempotent retries from breaking the execution down the flow, the entry points should detect and validate such retries.

This detection can be done by:

storing the request payload
validating the original request payload against the current whenever an idempotent retry is detected by matching idempotencyKey within the retention period

If the request does not pass the validation, the entry point should violently reject processing it and the development team should look into fixing the broken client.