Introduction

This document describes how we implement the SMS failover strategy from the technical point of view. Currently, we have two SMS vendors, namely Infobip and Macrokiosk. We use Infobip as the primary delivery vendor and Macrokiosk the secondary one based on their delivery performance and reliability. And if the primary vendor fails to deliver a SMS to the customer mobile phone, we then ask the secondary vendor to deliver it again. That’s the rough idea.

Data Flow

  1. output-manager, infobip-sms-gateway and macrokiosk-sms-gateway listen on related Kafka topics at startup; and output-manager loads pending database records from database and starts checkers for them waiting for delivery statuses

  2. a caller requests output-manager to deliver a piece of SMS message

  3. output-manager saves the following items to database: [idempotencyKey, expireTimeInMillisecond, message, PENDING]; and publish SMS delivery request message to topic commons.send-message-infobip-sms.command.v2; then starts a checker waiting for delivery status for this SMS. Note that idempotencyKey is part of the callback data to Infobip

  4. infobip-sms-gateway listens on topic commons.send-message-infobip-sms.command.v2, and if there are any new messages, infobip-sms-gateway handles the messages and sends a SMS delivery request to Infobip via Infobip’s REST endpoint

  5. Infobip tries to deliver the SMS to the customer mobile phone, and then notifies output-delivery-status-manager of the delivery status via a webhook/callback endpoint exposed by output-delivery-status-manager

  6. output-delivery-status-manager receives the delivery status notifications and retrieves idempotencyKey and mobileNumber from callback data, then publishes a message to topiccommons.notification-delivery-status.event.v1

  7. output-manager listens on topic commons.notification-delivery-status.event.v1, and if there are any new delivery status messages, output-manager tries to find the checker by key pair idempotencyKey; if found, output-manager cancels and deletes the checker; then output-manager checks the database to see if the record consisting of the given idempotencyKey, mobileNumber is PENDING or not; if it is PENDING, then output-manager fetches the message part from the record, and marks the record as DONE and publishes a SMS req to topic commons.send-message-macrokiosk-sms.command.v2

  8. macrokiosk-sms-gateway listens on topic commons.send-message-macrokiosk-sms.command.v2, and if there are any new messages, macrokiosk-sms-gateway builds and sends SMS requests to Macrokiosk using the later’s REST endpoint.

  9. if one of the checker expires, and if there is an associated record in the database, output-manager deletes then checker, fetches the message from the database, builds & publishes the SMS request to topic commons.send-message-macrokiosk-sms.command.v2 directly, and then marks the database record as DONE.

Key Considerations

Database Record

A database record consists of the following fields.

Field

Type

Description

idempotency_key

String

IdempotencyKey of the SMS message

expire_in_milliseconds

Long

time when this record and associated checker should expire; the unit is millisecond

message

String

the SMS message

metadata

JSON

medatadata map

status

Enum

status of this record; options: PENDING, DONE

These records will be kept for at most 7 days for possible debugging purpose.

Checker

Java DelayQueue is recommended to manage checkers because it is actually a concurrent priority queue, and a little bit similar to minimum heap. A DelayQueue instance is backed by a timer wheel instance. Please refer to https://www.geeksforgeeks.org/delayqueue-class-in-java-with-example/ for details. A checker is an object in the DelayQueue, and the rough implementation of that object is shown below.

class DeliveryStatusChecker(idempotencyKey: String, expireInMillisecond: Long): Delayed {
       override fun getDelay(unit: TimeUnit): long {
             var difference = System.currentTimeMillis() - expireInMillisecond
             return unit.convert(difference, TimeUnit.MILLISECONDS)
       }
       
       override fun compareTo(obj: Delayed): int {
             if (obj !is DeliveryStatusTimer) {
                 return 1
             }
             var instance = obj as DeliveryStatusTimer
             if (this.idempotencyKey.equals(instance.idempotencyKey) && 
                         (this.expireInMillisecond == instance.expireInMillisecond)) {
                return 0
             }
             return 1
       }
}

Several things to note here.

A checker is uniquely identified by idempotencyKey, so we can cancel/delete a checker this way.

delayQueue.removeIf { it.idempotencyKey == "xxxxx"}

A dedicated thread/coroutine is used to manage the DelayQueue to get checkers. The getDelay method tells the dedicated thread/coroutine the remaining milliseconds of a checker, and if the remaining time is less than or equal to 0, then the checker is regarded as out-of-date, and the thread/coroutine can call DelayQueue.take() to get the expired checkers, and then send SMS messages to the secondary vendor accordingly. If there are expired checkers, the take() method will block the current caller.

thread(start = true, isDaemon = true, name = "checkerManager") {
    val checker = checkerQueue.take()
    // check database record status
    // if the record is PENDING, mark the record as DONE & send SMS to secondary vendor
}

Database records work as the persistence in the case of restart/reboot. output-manager loads PENDING records and create related checkers at startup. And this would prevent possible loss.

Multiple Kafka Topic Partitions & Multiple output-manager Instances

If there are multiple running output-manager instances, and possibly the Kafka topic sms.delivery.status has more than 1 partitions, then it is possible that the output-manager instance that has published the SMS request message and the ticking checker in place may not receive the delivery status message from that topic because consumers in the same consumer group will receive messages from different partitions. To address this, an output-manager instance has to check the status of the database record, and if the record is PENDING, output-manager will send the SMS to the secondary vendor. Memory-based databases like REDIS may be considered in the future if necessary.

How to check and update a record in an atomic operation? The following SQL can be used.

UPDATE delivery_records SET status = 'DONE' WHERE idempotency_key = 'key value' AND status = 'PENDING'

If there is an updated record, the database returns 1, otherwise it returns 0. This return value can be used to check if the related SMS has been resent to the secondary vendor or not. If it is 1, the SMS should be resent to the secondary vendor for delivery.

Kafka Transaction and Retrying

Errors may occur when publishing SMS delivery requests to Kafka topics due to possible network disruptions & congestions and etc. So, Kafka transaction and retrying are necessary. Please refer to https://micronaut-projects.github.io/micronaut-kafka/latest/guide/ for details. Also note that there would be a common Kafka library that will enable all these in the future, but currently we can enable all these in this feature and replace it when that common library is available.

Attachments:

smsfailover.jpeg (image/jpeg)
smsfailover.jpeg (image/jpeg)
smsfailover.jpeg (image/jpeg)
smsfailover.jpeg (image/jpeg)
smsfailover.jpeg (image/jpeg)
smsfailover.jpeg (image/jpeg)