When implementing event driven architectures based on queues, it is essential to monitor that the pipeline is functioning.
For AWS queues using the SQS service, you can create these 3 Cloudwatch alarms to help you ensure everything is working correctly:
Message production
Consumer is inactive, there are messages in the queue but the consumer is not retrieving them
Presence of messages in the dead letter queue (DLQ)
In my case, the alarm sends a notification to an SNS topic where I have a Lambda subscribed that sends webhooks to a Slack channel, allowing me to be aware of it.
📤 Message production
In some cases, the production of messages follows unpredictable patterns, so you might not produce any messages for a long time.
In other scenarios, there is a continuous production of messages, as in my case where in a Change Data Capture (CDC) context, I used an SQS queue to propagate updates from one database to another.
It can be useful to check that messages are consistently produced to monitor any issues in the process. This can lead to false positives if it's correct that messages aren't produced for a certain period. It's up to you to find the right threshold based on how your architecture works.
You can use the NumberOfMessagesSent metric that AWS provides for each queue you create.
Using the serverless framework here's the snippet you can configure in your serverless.yml file:
Resources:
MySQLChangesMaxwellMainQueueMessagesSentAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: MainQueueMessagesSentAlarm-${self:provider.stage}
AlarmDescription: Triggered when no messages are sent to the main queue for a prolonged period.
MetricName: NumberOfMessagesSent
Namespace: AWS/SQS
Dimensions:
- Name: QueueName
Value: ${self:custom.Queue.name}
Statistic: Sum
Period: 1800 # 30 minutes
EvaluationPeriods: 3 # 3 consecutive datapoints (1 every 30 minutes)
DatapointsToAlarm: 3 # all the last 3 datapoints must verify the condition
Threshold: 0
ComparisonOperator: LessThanOrEqualToThreshold
AlarmActions:
- ${self:custom.snsTopicArn}
😴 Consumer inactive
A process failure or other issues could occur, preventing the consumer from retrieving messages even if your queue contains them.
You can use the ApproximateNumberOfMessagesVisible and NumberOfMessagesReceived metrics together. If there are messages in the queue (ApproximateNumberOfMessagesVisible) but no one is retrieving them (NumberOfMessagesReceived), it indicates a problem with the consumer, and you should be notified
Resources:
ConsumerNotReceivingMessagesAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ConsumerNotReceivingMessagesAlarm-${self:provider.stage}
AlarmDescription: Triggered when there are messages in the queue but no messages are being consumed.
Metrics:
- Id: visibleMessages
ReturnData: false
MetricStat:
Metric:
Namespace: AWS/SQS
MetricName: ApproximateNumberOfMessagesVisible
Dimensions:
- Name: QueueName
Value: ${self:custom.Queue.name}
Period: 300 # 5 minutes
Stat: Sum
- Id: receivedMessages
ReturnData: false
MetricStat:
Metric:
Namespace: AWS/SQS
MetricName: NumberOfMessagesReceived
Dimensions:
- Name: QueueName
Value: ${self:custom.Queue.name}
Period: 300 # 5 minutes
Stat: Sum
- Id: expression
ReturnData: true # the AWS alarm can use only ONE metric
Expression: "visibleMessages > 0 AND receivedMessages == 0"
Label: IsConsumerNotReceivingMessages
ComparisonOperator: GreaterThanThreshold
Threshold: 0
EvaluationPeriods: 3 # 3 consecutive datapoints (1 every 5 minutes)
DatapointsToAlarm: 3 # all the last 3 datapoints must verify the condition
AlarmActions:
- ${self:custom.snsTopicArn}
📬 Dead letter not empty
The DLQ retains messages that the consumer can't process. With SQS, you can set a retention period of up to 14 days for these messages. After this period, they are deleted, so it's crucial not to lose them.
You can use the ApproximateNumberOfMessagesVisible metric to check if there are any messages in the queue.
Resources:
DLQNotEmptyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: DLQNotEmptyAlarm-${self:provider.stage}
AlarmDescription: Triggered when the DLQ has at least one message.
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Dimensions:
- Name: QueueName
Value: ${self:custom.DLQ.name}
Statistic: Sum
Period: 300 # 5 minutes
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- ${self:custom.snsTopicArn}
And that’s it for today! If you are finding this newsletter valuable, consider doing any of these:
👥 Follow me on Linkedin.
📣 Provide your feedback — Please share your opinions or suggestions for improving the newsletter, your input helps us adapt the content to your tastes.
I wish you a great day! ☀️
Marco