Benchmarked Serverless observability tools, got very disappointed

Benchmarked Serverless observability tools, got very disappointed

05 March, 2019
Benchmarked Serverless observability tools, got very disappointed

Observability is about the ability to troubleshoot unknown issues that might happen in your application. If you are not familiar with it, I recommend you to watch How to Build Observable Distributed Systems and The Present and Future of Serverless Observability from QCon 2018.

In this article, I’m going to explain you how some of the most prominent Serverless observability tools¹ have performed against my test scenarios, meanwhile I complement it by providing an overview of each tool’s pros and cons. I have tested those tools against my Node.js based Serverless app. It’s deployed on AWS and uses Proxy Integration. You can find the code in my Github.

P.S: Since I’ve published this post, some vendors have improved their negative points / test results. I’m planning to write a revised blog post, test their claims and reflect their changes. Until then, please read their comments at the end of this article, to get notified about their pro-claimed improvements.

Table of Contents

Test Scenarios

I have tested all those tools against three scenarios by performing load testing with Gatling:

  • Test Scenario 1: Lambda function times out, thanks to high number of concurrent users and repetition. Just as reminder: Currently, Lambda built in metric “Throttles” doest not show time out errors.
  • Test Scenario 2: Lambda function doesn’t have proper format; “body” property hasn’t been stringified. This error can happen due to negligence.
  • Test Scenario 3: Dynamodb throws ConditionalCheckFailedException because the app tries to record an item with duplicate value for partition key.

In scenarios 1 and 2, user receives 502 Bad Gateway and the vague response “internal server error”. That’s why a proper observability tool is needed to troubleshoot these cases, especially in case of big distributed application.

If a tool has passed a test, means it was able to detect the problem, and to show it to the admin. This can enable him/her for faster troubleshooting. If tool has failed to meet the before-mentioned criteria, result is marked as failed. However, some tools have partly passed tests, in this case I’ve explained their behaviour.



  • Test 1: Partly passed:

    There has been a new improvement since few weeks ago: Now in case of time out error, X-Ray shows that there is an error, but still doesn’t clarify what is the problem. So it expects the user to guess it or troubleshoot it with other tools. Also, UI is confusing: in one place it indicates that there is an error, but few lines under, it indicates that there is no error(pic1). Apparently, their new improvement is still under development.

  • Test 2: Failed.

    Everything looks ok according to X-Ray, even though end user gets error.

  • Test 3: Passed.

    It indicates that there is error and also shows the exception stack trace.

Pic1: X-Ray doesn’t clarify what is the error type. Also UI is contradictory, by hovering on the clock icon, it shows “no faults or error”.


  • Lambda has built in agent for X-Ray. The agent sends data as batches and in asynchronous way, so using it doesn’t add extra latency to your function.
  • Responsive support team. You can communicate with them inAWS Developers forum
  • Managed service by AWS. So, its supposed to be more rich in functionalities, and to be more compliant with AWS best practices.


  • Confusing UI: Besides pic1, you can look at pic2: even though I have chosen to see buggy traces, I see a confusing “200” response for each trace. The 200 response indicates the X-Ray service has returned response and doesn’t mean that trace is successful. This is not what most users expect to see and it can lead to wrong interpretation. Yan Cui has addressed this in his blog post aws x-ray and lambda : the good, the bad and the ugly. Worse, this issue hasn’t been solved even though a year has passed.
  • Slow paced integration: Still, few of AWS services are integrated with X-Ray. So if you are using DynamoDb or S3, X-Ray provides you inferred segments (which means lack of details), because even till the time of writing this article, those services haven’t been actively integrated with X-Ray.
  • Immature: There is still room for improvement. For example, they need to add more functionalities for better debugging, especially to include custom errors in an easier and more neat way. Apparently, they are working to make Lambda generated segment accessible.



  • Test 1: Passed
  • Test 2: Failed
  • Test 3: Partly passed

    . Dashbird’ UI is not clear and can be confusing: At high level, it shows the trace as successful but digging into the trace, it shows the error (pic3). Surprisingly, this is unlike X-ray’s result, even though Dashbird’s trace is based on X-Ray. Its behaviour can be justified so that “an exception might not necessarily mean the function has failed.” This is true, but if there is an exception in the trace, user should be at least notified of it (e.g. from high level picture), and showing just a green and pretty “success” on the trace, is misleading. Meanwhile in my opinion, having exception usually means something is wrong and worths investigation, unless the exception has been proactively caught and handled.


  • Is very easy to setup, Dashbird’s CloudFormation template does almost everything. Once CloudFormation stack is set up, Dashbid starts observing all your functions in all regions (this can be bad from security point of view. I have addressed this in Cons section)
  • Has nice UI and some nice features like live tailing, enabling for real time monitoring of specific functions, as well as alerting feature.
  • Gets data from CloudWatch logs and AWS X-Ray. So it doesn’t add extra latency to your functions.
  • Has a friendly and supportive custom service.
  • Shows logs from the whole function execution time, not just the error. This makes debugging easier (pic4)


  • Documentation is outdated and misleading: In “getting started” section, it asks you to set up a specific IAM policy. But this policy doesn’t have any effect, and it’s there just because documentation hasn’t been updated.
  • Provides few, basic and somewhat misleading statistics, e.g. average duration and average memory usage. Average is a misleading factor for web performance analysis, so the team needs to provide percentile-based statistics.
  • Shows cold start but individually. Cold start statistics are needed.
  • Inherits limitations of CloudWatch (e.g. granularity, delay) and X-Ray, because it’s based on them.
  • Has a major security concern: it has access to all your data, and you cannot limit it. You can filter data in the Dashbird app, but this doesn’t stop it from receiving your data.


Thanks to Taavi Rehemägi, co-founder of Dashbird, for extending my trial period and enabling me to investigate their SaaS.



  • Test 1: failed
  • Test 2: failed
  • Test 3: failed


  • Easy to set up.
  • Instruments the code, this enables Thundra to provide deep technical overview. Also, this can be an alternative to AWS X-Ray.
  • Takes advantage of asynchronous publishing so doesn’t add extra latency to function execution time.
  • Has informative diagram and dashboards. Also, it provides a clear list of functions, accompanied with their statistics (including total number of cold starts that each function has experienced)
  • Has more conservative approach toward security, than e.g. Dashbird. It doesn’t need to access to all your data.
  • Is based on a good idea, and it’s great that there is at least an alternative for X-Ray.
  • Has friendly and supportive customer support

I talked with its product manager about support for Node.js app. Apparently, Thundra is focused on Java applications, and its Node.js related features are far behind. I haven’t had time to investigate its Java features, but if someone is using Java based Serverless app, I recommend him/her to take a look at Thundra.


  • Bad support for Node.js. At the time of writing this article, I don’t see any convincing reason to observe my Node.js app with Thundra.
  • Statistics are based on average, which is misleading.



  • Test 1: passed
  • Test 2: failed
  • Test 3: failed


  • Provides few percentile based statistics
  • Easy to set up
  • Has alerting feature
  • Has search functionality: You can search through your invocations by different keywords, e.g. requestId from CW. However my rough and initial guess is that, to troubleshoot a complex distributed application, you may need to use a sophisticated and comprehensive logging tool rather than IOPipe’s. But this is of course up to your use case.


  • In their current approach, IOPpipe sends data to it’s own system synchronously; this adds extra latency to your function execution time and is against best practices described in Serverless Architectures with AWS Lambda “Capture the metric within your Lambda function code and log it using the provided logging mechanisms in Lambda.” IOPipe approach has been investigated further in Tips and tricks for logging and monitoring AWS Lambda functions as well as Dashbird vs Datadog vs IOpipe. This is a serious concern, that’s why I don’t recommend use of IOPpipe, unless they resolve this issue. It seems that their team working on this and trying to come up with an asynchronous and optimised alternative. Let’s wait for the result!
  • Doesn’t show statistics for cold start
  • Doesn’t provide tracing.
  • Shows log just for the error stack trace, which might not be very convenient. It would be helpful if it shows log from the whole function, similar to what Dashbird does.


To detect errors, you can act proactively and use monitoring, instead of or in collaboration with observability tools. You are advised to use CloudWatch. Lambda has built in agent to send logs to CloudWatch, and using it doesn’t add extra latency to your functions, unless you publish Custom Metrics.

To achieve the goal in an optimised way, you cancreate Metric Filter. For example, for error scenario 1, your Metric Filter can have a Filter Pattern such as “Task timed out after”. Then, Metric filter searches in your log events and whenever it finds a match, it increments value of corresponding CloudWatch metric. Subsequently, you can set an CloudWatch alarm for that metric and publish it e.g. via SNS. Also, to get full potentiality of CloudWatch, you can use Structured Logging by using JSON.


I haven’t had time to investigate all serverless observability tools, however based on my my investigation on the most prominent ones, all of tools are immature, or incomplete at some level, and need to improve. There is no single solution that you can use to thoroughly observe your distributed app perfectly.

Surprisingly some issues, like error 2, hasn’t been addressed by any of the solutions (not even in CloudWatch logs). But that error can happen due to negligence, as I encountered it when my friend was wondering why end user just gets error and asked me to debug his app. Everything looked ok, no error or exception, but end user was getting error. After debugging it for around an hour, the only thing came to my mind was the outputting format. And I was right. He forgot to JSON.strinigfy() the body property of his function’s output, and AWS Proxy Integration was failing silently. It was a simple application, but imagine if this would happen in a big and complex distributed app? How are you supposed to find it out?

To achieve observability, you need to use different solutions in tandem and also take help of deep monitoring, by Structured Logging. Pierre Vincent has addressed this during his QCon presentation How to Build Observable Distributed Systems.

My 3 test scenarios are just examples. Do you think what are other issues & errors that should be in priority to be observed? Do you know about any other tool that excel the above mentioned tools? What’s your opinion about current status of serverless observability IN PRACTICE?


  1. This is just my opinion.