At this point TraceView should be installed and traces should be flowing into your dashboard. You should also be familiar with the major features after having gone through the tour traceview section. What now?

Tracing overhead is proportionate to the work done in generating a page, effectively minimizing the impact on user experience. Optimal configuration comes from tracing only a fraction of incoming requests, see smart tracing. In a sampled environment, average overhead is lower—tracing selected requests means that the average trace overhead is reduced proportionately.

Using varnish or a similar full page html cache will further reduce tracing overhead by preventing requests from hitting the application server.

An example Trace

How much of your application stack you want to trace is entirely up to you. TraceView can interpret a single trace from just a web server hosting static content, or right through a load balancer into one of many app servers, through to an API call, and back again. Where a trace starts and how it flows through your stack depends on settings within each instrumentation component, primarily the ‘trace mode’ parameter.

This diagram illustrates the most basic form of tracing, which is collecting only web server statistics. The web server starts traces when the trace mode is set to ‘always’, but the traces start and end there.


But if you’re already monitoring server stats and would rather get some insight into your application code you can do that too. The resulting traces don’t offer much insight.


This diagram shows a stack with Trace Mode configured to start tracing only at the PHP layer. Here the PHP layer is set to trace mode ‘always’ and traces begin there and flow out into the supported library calls.


Somewhat more interesting, now we’re getting some insight into the application.

Going deeper

When TraceView really starts to shine though is when you trace your whole stack end to end! Here’s a fully instrumented LAMP stack configured to pass traces all the way along. Apache starts the traces with trace mode ‘always’ and php continues them, never starting its own by having trace mode set to ‘through’.


Now we’re starting to see the big picture, we’ve even captured an error!


The bottom line for configuring traces through your stack is that wherever you want traces to start the instrumentation should be configured with Trace Mode “always” and at each point beyond you want traces to flow into the instrumentation should be configured with Trace Mode “through”.
Now that we’ve gained some insight into a single host LAMP stack let’s start thinking bigger. Big sites run on multiple hosts and make tie together data from multiple services, and the added complexity can make finding bottlenecks that much more difficult. Here’s a web app making an RPC call to an authentication service. Trace data is starting to pile up!


If you thought the diagram had a lot of data, check out the dashboard for all this infrastructure.


Breaking it down

Now we’re at the point where separating traffic into individual App definitions can turn complexity into visibility. Moving from generic concepts to specifics, the above example diagram shows two particular web stacks, an Apache fronted public facing web server powered by a PHP web app with its own database, and a back end authentication service hosted by nginx with a wsgi interface to a python app using a separate user database. By selecting Apache as the entry point for the front end App and nginx as the entry point for the authentication App I’ve moved the traces from these apps from my Default App to two custom Apps.


Here is the resulting compartmentalized App dashboard for the parent app, in this case ProductionWeb.


Notice that all the infrastructure associated with the back end authentication service has been bundled into one layer with our App icon. Clicking that layer will take you straight to the dashboard for the AuthService and all of it’s components.


When several services are combined to power a front end app and that diagram above get’s even larger, compartmentalizing services this way makes finding latency culprits much easier. From this perspective that diagram starts to look like this.

Now we’re down to the important question. We have two web servers. We have traces moving all over the place. What should the trace mode be on the back end App? In nearly every situation where a trace is starting in a service which will make calls to another instrumented service, that back end service should be set to trace mode ‘through’. When debugging custom instrumentation, or when testing / staging new code flipping this to ‘always’ is sometimes useful, but if you’re not sure then default to ‘through’.


Rooting out performance issues

Here are some posts from our blog on how to find and attack common problems in complex web applications.

Responding to production incidents

A successful trip to Traceview is one where you come back with traces. When we talk about bringing back traces, we’re talking about the trace details page. The reason this is the end result you’re looking for is that you can take the link to that page and throw it into an issue tracker or you can send it to a colleague, and they’ll immediately have everything they need to understand the problem you’re experiencing. In some cases the trace details page also includes error data, which you can use to dig into logs and figure out what happened.

If you get into TraceView and you don’t come back with a trace details page, you might have missed what you went in there looking for. Many times people will go to TraceView with no specific question in mind. A common pattern is that someone mentions that something is slow, you wander around until you find some traces that qualify, and then take those traces to someone else and say ‘here are the things that didn’t perform very well, please take care of this.’ You can consider this approach to be basically a fishing trip. During an incident you’re not on a fishing trip, you are on a hunting trip, and you are after a specific problem. It’s not that fishing is a bad approach. You’re findings might very well be noteworthy, just not during incident response.

In the this article what we’re going to walk through some ways to go on that hunting trip:

  1. Disproportionate layer growth
  2. Layers stop moving together
  3. Variations in table data

Disproportionate layer growth

Your pager has gone off and something has gone horribly wrong. You’ve come to TraceView and now you are not sure where to look next. The easiest thing to note is something we call ‘disproportionate layer growth’. Your typical entry point into TraceView is the app server page where your eyes are probably drawn immediately to the layer breakdown. What you’ll notice nearly every time you look at a layer breakdown is that all the peaks and troughs generally move together. When one moves non-uniformly with the others, it is usually an indicator that something interesting is going on there. Of course your pager has gone off so you’ve probably got a timestamp you’ll want to zoom in on, but if you are looking at a time range scratching your head and wondering what to look for, disproportionate layer growth is one of those things.


Now, we’re going to apply some filters. When you do this yourself, open up the results in new tabs so you can compare the views. This will make the differences pop out. First, we’ll filter down to just the layer of interest. You notice that the spike is nowhere near as interesting as it was in comparison to the other layers. So it’s really the delta between the other layers that makes it of note. The other thing you’ll notice though is the bar chart under the stacked area chart. Whenever a particular layer is isolated, the bar chart indicates number of times that layer was executed in a given time range. You can see in this example that there’s a huge spike in trace requests. These are events are a SOAP-client making outbound calls.


1, 2, 3 workflow

It’s no coincidence that the charts on the app server page are organized the way they are: layer breakdown, heatmap, view traces. It’s designed to be what we call a 1, 2, 3 workflow, and it’s how you’re going to get from looking at a bunch of layers to looking at individual traces. Starting from the layer breakdown, step one is to filter to a layer, which we’ve already done.
Number two is to go look at the heatmap, and right away you’ll see a cluster of requests. We’re going to draw a bounding box around that, which applies a filter, and move on to the third step where the results are pared down based on the box we drew.
On the view traces tab, notice that although we selected several events there’s only a single trace. This is the difference between looking at inbound requests to the app and looking at events that happen inside of a trace. All of these SOAP calls happened in a single request; in other words, one call to this web app incurred 207 outbound SOAP calls. Next we’ll look at the trace details page, and again it’s better to do this in a new browser tab.



Anomalous spike to single trace

So finally we’ve gone from anomalous spike to a single trace that clearly illustrates a sub-optimal design. These SOAP calls are happening over and over again. Maybe it needs to happen over and over again, maybe they can be batched, or maybe they’re retries that should have failed early on. But this trace should be enough to hand someone who is involved with this application and say ‘hey are you sure 207 times in one request is the right amount? Maybe we can do something different.’


Layers stop moving together

The next anomaly we’re going to look at is one that is similar but subtly different than a span that grows disproportionately compared to its neighbors. As we covered in the previous section, layers tend to move together, and the noteworthy bits can lie where they stop exhibiting that behavior. In this example, you can see that when the SOAP client is up, the database layer is up, and the spring render time is up. That makes sense: maybe the SOAP client returned a bunch of values that were used for multiple concurrent database requests, which were then rendered into a very complicated template.


Averages lie

Right around the circled area, the database and SOAP client are up but the spring doesn’t correspond. This is a good indicator that there’s something interesting happening. Following the 1, 2, 3 pattern laid out in the previous section, let’s filter down to database layer. Isolating that layer shows a a bit of a change, 40 milliseconds to 10 milliseconds, not that crazy.

Since we know that averages sometimes lie, let’s pull up the heatmap and see what’s going on there. On the heatmap we’ll look at 100% of calls rather than the default 95% so that we can see the outliers. Once that filter is applied, it’s pretty obvious that one single outlier caused an expansion of the database layer without a corresponding expansion of the spring layer underneath. At this point you can probably guess what we’re going to do next: draw a bounding box and move on to view traces.


Before we move on, something worth calling out is that is you might have been able to find those outliers without pre-filtering down to the database layer first. However, assuming that those outliers are the calls causing you grief doesn’t always work. Ultimately, the 1, 2, 3 workflow gets you more quickly to the calls you need to be looking at.

Something noteworthy to individual request

So just like in the previous section we’ve moved from something noteworthy down to one individual request. In this example, the request problematic because it has a 60-second call.


When you find requests that have a nice even time amount on them, it usually means you’ve hit a timeout somewhere. If we take a look at the query span details, it is pretty obvious that it was one specific database query. It’s not super convoluted, but whatever it was it pinned at 60 seconds and then finished. This is something you might not take to your application developer like we saw with the SOAP client, instead you might take this to a DBA and say ‘what is happening with the server, do we need some indexes, or better queries? Can you give us some help?’ Regardless, this would be the single artifact that comes from just noticing a variance and then drilling into that.


Variations in table data

1, 2, 3 is a good starting point. It can get you to those individual traces with those useful bits of data in them like RPC calls, database queries, cash operations, etc.—but not always. This next app will show you why, and how to use table data to spot trouble.

Traffic to the app in this example is very cyclic. It goes from having both layers move in unison to having one layer grow disproportionately. As usual we’ll start off by isolating the layer that grows disproportionately and pull up an individual trace in the area that’s growing. The trace shows disproportionate growth in the nginx layer, but really there isn’t a lot go on: we don’t have a query, cache operation, or anything else of that nature. So 1, 2, 3 isn’t panning out, now we need to look at other methods. Let’s clear all the filters and go back to the default view.


The real action is in the data tables

If you’ve read the other sections in this article, you might have noticed that we’ve spent most of our time up at the top of the app server page. This is something that users often do when they start using the product for the first time. Generally it’s because the graphs catch your eye and cause you to think this is where the action is. Meanwhile the data tables are tucked away below the fold. But there’s some interesting information down there too, for example call counts, average duration, and hosts names.
For the most part, the data tables reveal this example application to be pretty boring. It does no queries so there’s no sequel data, nor does it have any controller action data. One thing it does show however is that one host is taking quite a bit longer than the other host, on average. Understanding why this is significant requires some knowledge of the application: this app is basically two web worker nodes, sitting behind a load balancer, and the load balancer is designed to be 100% fair. But, because of the disproportionate amount of time spent on one worker node, the other one actually wound up servicing more requests.


Let’s open the same time slice for these two worker nodes and compare them. In doing that we see that the graphs are a lot different, even though these workers were meant to be identical. nginx is indeed growing disproportionately but only on one node; on the other one it’s growing pretty much in unison with the php layer.



Introducing host metrics

Okay, there are two nodes with two different performance profiles and obviously there is something that’s causing this difference. The next step, is to start adding some host metrics, things like load average, memory usage, etc. Specifically we’ll bring up cpu usage and load average for both workers, compare them side-by-side, and see what emerges.

Worth noting, when it’s the webserver layer that’s growing in disproportion, this is where you want to start thinking about performance—about cpu abusage, load average, things like that. Were it the app server layer growing disproportionately you should start to wonder about what’s going on in your business logic. The latter would be a really good candidate for custom instrumentation, which would get you a little bit more visibility into what is happening in that layer.

You can see that load average on web1 is pretty good. In case you’re unfamiliar, load average is a measure of how many processes are waiting to get access to the cpu. Any number less than one basically means no one is waiting and everyone gets the cpu when they want it. Over one means there is lots of queueing going on. Based on what we see with these nodes, the first thing that should come to mind is that the requests are waiting for access to cpu. What has happened is that web2 is under provisioned. Alternatively, there are just more requests happening than there are workers to execute them.

The interesting thing to note is that the layer that grows disproportionately is the webserver layer, but the problem is not the webserver. The problem is a lack of resources to execute the inbound requests. The growth is happening in nginx is because nginx is sitting there holding the requests just spinning, waiting for someone to come along and do the work. It’s still noteworthy that we see disproportionate growth, just remember that the problem might not be with with layer that grows disproportionately.