Exponential backoff for AWS Lambda
August 19, 2019
I recently set up a Lambda function that reads data from an SQS Queue and makes an API call to one of our microservices. Naturally, this calls for an error handling mechanism, considering that the microservice could be down or unresponsive.
AWS Lambda provides its own retry mechanism where a message is picked up from the queue by the Lambda consumer and becomes invisible to other consumers for a specific duration called the visibility timeout. If the consumer completes execution successfully, it automatically deletes the message from the queue. In case of unsuccesful execution (such as a Runtime Exception), the approximate receive count of the message is incremented and it becomes available to other consumers after the visibility timeout passes. The number of times a message can be re-read from the queue before it is finally sent to a Dead Letter Queue(DLQ) is configured in the Redrive policy of the SQS Queue and is tracked via the approximate receive count.
This retry mechanism was not exactly what I had in mind for our use case. I was thinking along the lines of a backoff strategy that keeps retrying the API call with exponentially increasing wait time; finally sending the message to a DLQ after a set number of retries. This would give us ample time to fix any issues with our miscroservice and prevent it from being bombarded with failing API calls.
This is what I ended up with:
First, a very basic Java function to calculate the exponential wait time, given the number of retries recvCount:
int randomInt = rand.nextInt(60);
Long result = new Double(Math.pow(2, recvCount)).longValue() + 30 +randomInt; //adding jitter to new random visibility timeout
Notice the addition of randomInt. That is ‘jitter’. A bit of randomness. I read about it in some documentation by Google Cloud and included it as a good practice.
Next up, set the visibility timeout of the message to the value that we just calculated above. The maximum value allowed by AWS is 43200 seconds or 12 hours.
sqs.changeMessageVisibility(queueUrl, msg.getReceiptHandle(), newVisibilityTimeout.intValue());
Finally, we check the response to our API call. If it is a 400 or 500 series response, we throw a Runtime Exception and change the visibility timeout of the message. This is the easiest way I could come up with to signal unsuccessful execution of the Lambda function. Plus, we can only throw unchecked exceptions in our handler method.
...
// api call
...
if (response.getStatusLine().getStatusCode() >= 400){
new ExponentialBackoff().setVisibilityTimeout(msg);
throw new RuntimeException("Request to server failed");
}
ExponentialBackoff is my utility class where the code that calculates and sets the visibility timeout lives. It also has some other utility functions that are not essential for this demonstration.
There you have it; A bare bones exponential backoff implementation for AWS Lambda.