At this point you or someone in your organization has deployed TraceView in a production environment, and the data is steadily coming into TraceView. Now if you’re new to TraceView, the density of the information presented can be overwhelming. And before you can use TraceView to identify performance issue, you’re going to need to know how to read and filter the incoming data.
The app server page is the highest level view of a single app, and it’s your starting point for identifying performance issues and drilling down on the component or code causing those issues.
Total and traced requests
Below the layer breakdown are two bar charts. The time slice for each is 30 seconds—hovering over two adjacent bars will quickly confirm that, plus the number of requests during that period.
The top chart, ‘total’, is straight-forward enough: if a request hits your endpoint, and instrumentation is configured to trace it—which by default it is—that request is counted. Having a total request count helps you correlate load with application latency and also identify resources that struggle to keep up.
The lower chart, ‘traced’, is a little more involved. As you might guess, not every inbound request is traced. This is because, like any other scientific inquiry, we don’t need to trace every single request to determine the health of your app, a representative sample is both sufficient and keeps overhead to a minimum. Whether an inbound request is traced or not is governed by an algorithm we call ‘smart tracing’. Smart tracing dynamically adjusts the number of requests traced as application load changes. It does this as follows:
- Determine whether or not a request is eligible for tracing—with out-of-box instrumentation, all requests are eligible.
- Subject it to the sample rate, which by default is 100%.
- If the request is both eligible, sampled, subject to random selection based on a token bucket algorithm.
For applications with low traffic volume and good system capacity, smart tracing will examine every request. As application load increases, smart tracing adjusts fraction of requests traced to the minimum required to accurately represent the actual call volume. The end
result is that you can expect to see well less than 1% overhead while having statistically significant data, 100% backed by traces.
Total requests isn’t available on the default app: Total requests is not available for the default app, and is only available for entry layers—the layer created by the component that started the trace—individual urls, layers, filtered views, etc. won’t have it.
In the life of a trace, points of interest are declared as ‘events’; examples include a start-trace event for when a request enters your stack, and an end-trace event when the request has been serviced. There are other types of events along the way, and a single trace may contain tens or even thousands of events. All of those events are logically grouped according to the application component in which they occurred for the single purpose of exposing how much time was spent in each while servicing a request.
You’ll see separate layers for high-level, ubiquitous components like webserver and database. But application code is almost always comprised of discrete units of code, like libraries, frameworks, and modules, which our instrumentation understands and will map to layers as well. You’ll see the distinction between high-level and code-level layer types in the legend. There are dedicated layer types and icons for the webserver, database, cache, language, etc., while code-level layers are designated ‘generic’ and all have the same gearset icon.
The layer breakdown chart displays at 30 second granularity by default. You can confirm this by hovering over a few adjacent data points and checking out the timestamp. The tooltip will also have a latency value and a count. That count is not the number of traces, but rather the number of times, among all of the traces during the previous 30 seconds, that the layer was hit; the latency for that layer then is the average amount of time spent in that layer over the course of those hits. Take a second to find any data point in the top layer and pull it straight down, examining how the count value changes at each layer. Then pull it all the way down to the traced bar chart and examine the number of traces during that same time slice.
While the layer breakdown shows how each application component is contributing to average latency, the heatmap on the other hand shows how each request is contributing to average latency. Color is used to indicate the number of requests at each latency level; ideally what you want is that dark blue band as close to zero as possible. Heatmap has a few more controls than the layer breakdown chart:
- Every single trace is represented, but by default only the fastest 95% are shown; click ‘show all calls’ to display the outliers.
- The percentile filter only toggles between 95 and 100%, but you get much finer control via the adjacent custom range selector. A few notes here: one, return to the default view by clicking ‘clear custom range’ at the top of the dialog box, or just click ‘show 95th percentile’; two, the button label shows the maximum value of the current range, which is convenient for seeing the exact latency that corresponds to the 95th and 100th percentiles.
- Click-and-drag a box around interesting traces. The filters at the bottom of the page will update, and the ‘view traces’ tab will show you how many traces you snagged.
More on our blog: This post explains how heatmaps reveal information that is obscured by normal timeseries charts.
TraceView allows you to load multiple additional host-related timeseries charts under the layer breakdown and heatmap. The charts that you can load here are the same ones that are available on the hosts page, which is accessible thorough the left-side navigation panel. The difference is that via this facility, metrics can be displayed in the context of the applications they serve. This is crucial because when latency spikes, the first fork in the road will be whether the host is causing your application grief, or the other way around, and you don’t want to have to toggle back and forth between pages to correlate total traced with average load, etc.
More on our blog This post has more about how TraceView places host metrics in context.
TraceView gives you access to our API to annotate the layer breakdown and heatmap. Use it to compare performance before and after new code deployments, database upgrades, or any other operational event that might affect performance. On Linux hosts, there a script called ‘tlog’ which makes easy to record events, on Windows we offer the same ease of use through Powershell. Regardless of platform you can access this function directly via the API.
- tlog. tlog is a script available on linux hosts. It’s automatically installed during TraceView deployment, so you can run it wherever you are tracing. Try:
- REST API. You can also record events directly via the log_message method in our REST API. There’s a single endpoint and a few parameters. The time parameter is optional but highly encouraged to ensure that annotation times line up with the times reported by instrumentation on your servers.
- PowerShell. For Windows applications, you can annotate your graphs using PowerShell. Read about it on our blog.
App server filters
Filters help you focus on a set of traces matching particular criteria. You can take advantage of most of them right away with only stock instrumentation. But custom filters will require customization, and apdex groups will require some additional configuration via the TraceView console. Apart from those last two all of your filter widgets come several interactive components, and it’s worth pointing them out independent of any filter criteria.
- Sort. Float interesting traces to the top. ‘freq x avg. duration’ sorts by the product of frequency and average, which will float common traces with high latency.
- Filter search. A magnifying glass in the top left of a table means its contents are searchable. Click it to bring up an auto-completing search box that’ll help you find the domain, url, controller, or action you’re interested in. Selecting one of the results will apply the filter.
- Row toggle. Every row can be expanded to show max, min, and standard deviation.
- Show 10 20 40. The caveat here is that while there might be more than 40 rows, only 40 can be shown. That’s okay because you’ll either be looking for something interesting or something specific. In case of the former, use the sorting facility. For the latter, if you know at least some portion of the entry you’re looking for, you should be able to find it via the search auto-complete.
- Export data. Click the down arrow in the bottom left to export the table as .csv.
- Table toggle. Just to make sure every clickable related to the filter tables is listed, note that you can collapse the tables via the caret at the top left.
Any entry that highlights when you hover over can be used as a filter, click to apply it. The data on the page will reload and you can see the newly applied filter labels a the top of the page. You’ll need to clear some or all of the filter labels before you can apply another filter at the same level.
Filter by domain or url. Domain filtering is useful when you’re running several different sites on the same host and want to view their performance data separately. Click on the parameter you want to filter by. The data on the page refreshes, and and the filter labels indicating the currently applied filters are shown at the top of the page.
The hosts table shows hosts that are explicitly assigned to the application, have reported host metrics within the selected time period, or traced a request within the selected time range.
In MVC frameworks there’s the notion of controllers and actions; if you’re unfamiliar watch this and read this. The gist is that in an MVC framework multiple urls can map to a single endpoint in the application code. For example, all urls following the form /user/* might be served by a single piece of code that generates user profile pages. It might be helpful in this case to see the performance of the user profile page across all users, which is why TraceView enables you to filter based on the underlying implementation.
If you’re running Django, Pylons, or Rails, you’ll get controller/action instrumentation out of the box. TraceView supports a number of other frameworks, though they might require installing a framework-specific instrumentation module. Even if we don’t support your framework you can still add controller/action instrumentation using our custom instrumentation API. If your framework doesn’t have a conventional notion of controllers and actions, you can still take advantage of this filtering capability: just put values into controller and action that make sense to you. For example, in PHP instrumentation the Drupal module uses controllers and actions to represent different Drupal menu items.
Before you get into layer filtering, you should note that the app server page doesn’t always show traces. Notice that the heatmap legend and the view traces tab say ‘calls’. Call is a generic term indicating that some application entity—a component, endpoint, layer, etc.— has been accessed. In the case of urls, call is synonymous with ‘request’, which is synonymous with ‘trace’ most of the time, with the exception of ‘total requests’. When it comes to layers, a call refers to a trace descending into a layer, which could be multiple times during a trace. In the layer filter table, this is what ‘# calls’ indicates, i.e., the number of descents into the layer among all of the traces in the filtered time range. This is easiest to see when you filter for the database layer, because the number of calls to this layer is usually far greater than the number of traces.
Filters are a way to find all of the traces that have a some particular attribute in common. TraceView already provides a way to find all of the traces that have a common request domain, host, and database query. But what if you want to find traces that have some other attribute in common? This is what partitions are for, just think of them as custom filters.
The best use for partitions is to identify traces that pass through a particular section of code. Assigning a trace to a partition involves customizing your instrumentation. The gist is that you place a partition flag in the section of code you’re interested in, along with the name of the partition to which the trace should belong. Then you can come back to the app server page and limit the display to just the traces in a particular partition. Remember too that filters can be layered, so for example you can limit traces by host, then by url, and then again by partition if you want to. Traces are not partitioned by default.
The effect of end-user page load latency on conversions, perceived user experience, bounce rates, etc. is well-documented. Total latency is determined by three main components: the client’s browser, the network, and the servers responsible for providing the data. While often the most pernicious and difficult to optimize performance problems occur on the server side, most of the total latency actually occurs in your end users’ browsers! The ratio of time spent in the client’s browser fetching and rendering resources versus server-side computation can be as high as 80/20. It’s crucial then to understand the full end-to-end pageload experience from the perspective of real users around the world.
To setup up RUM data collection for your app you’ll either need to manually insert two scripts into your page templates, or enable auto-rum which allows your instrumented web server to inject them automatically.
The end user page, accessible via the left-side navigation in TraceView, shows the results of the data collected by our RUM js. Like the app server page, you can use either the visualizations or the table filters to drill down into a subset of transactions and their individual traces.
How much time was spent running your application code—shown on the app server page—isn’t the only form of latency. In fact latency is introduced at several points between the time a browser request is issued, and when the page is fully loaded. This chart shows the other points at which latency is introduced, and the relative size of those latencies.
- Network time
- The time it takes for the browser to establish a connection with the server and for the server to transmit data back to the browser. This is determined using the browser’s navigation timing API: basically whenever the user attempts to load a new page, the browser fires an event. Our RUM js simply calls the API for that timestamp and reports it back to TraceView.
- Server time
- Since all of RUM data is backed up by traces, TraceView knows how much time was spend running the application code before the requested resource is sent off to the browser.
- DOM processing
- The time it takes for the browser assemble the DOM; it corresponds to the jquery document.ready() callback—if that reference is for coders, if it’s unfamiliar to you, don’t worry about it. The start of DOM processing is marked by execution of _header.js and the end is marked by the execution of _footer.js. These are the two RUM scripts that you added to your page templates to collect timing data.
- Page rendering
- The time between when the DOM has been assembled and the browser fires the window.onload event, which signals that the html, CSS, images, and all other resources have been loaded and rendered.
If you find that the bulk of latency is in the network it would be a good time to choose infrastructure and template changes over code optimizations. For example, add servers closer to low bandwidth regions and move assets to a CDN. You could also decide to simply send less data to some domains by disabling content. If you find that DOM processing and page rendering is too slow you might try prefetching or browser caching. See this blog post for more ideas on how to improve page load times by reducing the client-side burden.
Your users will experience your app differently based on where they are for a few reasons. One is that geography impacts network latency. Some countries like South Korea and Japan have relatively reliable, high bandwidth cellular and fixed line networks, other countries less so. In lower bandwidth regions network latency is going to be higher. Similarly, the greater physical distance between the user and the app server, the more internet that exists between the two, which ultimately translates to more network latency. Finally, you’re likely serving up different content for each regional domain, which means that your application code is going to be different. In this case its going to be helpful to attribute performance differences to differences on the server side, namely in application paths taken.
This visual shows you apdex score by country; for the US and Canada you can drill down to state/province. Clicking on a region applies the same filter as clicking a row in the geographic regions table. When comparing regions, you might find it helpful to open each region in a separate browser tab.
Examine the difference in network latency between two regions, and if you find it to be excessive you can mitigate as described in average latency. If you find that the performance differences are attributable to the application code, you’ll want to move on to the page views tab where you can drill down to individual traces based on the filters you’ve applied.
Not all browsers are created equal. In fact you’ll find an entire website, caniuse.com, dedicated to enumerating the differences between browsers. These differences can impact application performance when developers are forced to make functionality cross-browser and backwards compatible. Add to that the substantially different requirements for achieving snappy response times on mobile.
Creating multiple variants of the thing you’re serving up to accommodate all of these use cases amounts to multiple applications paths, and ultimately the potential for different application performance. Despite any variances in the application code, most of the impact will be on the client side as browser-related design considerations usually involves the way assets are acquired and loaded.
In short, keep an eye out for disparity between browsers. If present use the visual or the browser table to filter to type/version of interest, and then drill down on those individual traces from the page views tab.
The overview map is a high-level visualization of the apps in your account and how they interact. We automatically detect resources that are used by two or more applications, and pull them out as their own nodes on the map. This allows you to quickly identify possible bottlenecks in your system. The overview map is updated in real-time and integrated with alerting so that you have one single, comprehensive view that can be used for both continuous monitoring and incident response.
The map is available out-of-box with no additional configuration required. It automatically discovers all of your components and updates as your app changes and scales.
A tool for DevOps plus everyone else
DevOps engineers are tasked with ensuring the operation of business-critical applications. They’re the front line of APM: they’re the first ones notified of issues, and they’re the ones responsible for restoring service as quickly as possible. Yet, DevOps isn’t likely to be familiar with all aspects of the code base. This is where the map comes in handy. It provides clear and current picture of the dependencies between components, the dependencies between apps, and the resources shared between apps. It also provides insight into the upstream and downstream systems that are likely to be impacted by—or might be underlying cause of—troubled applications. By extension, it’ll also help you focus robustness and scalability efforts when TraceView isn’t alerting.
Developers on the other hand might have intimate knowledge of the specific apps they work on, but little or no knowledge about the other systems they interact with in production. The overview map solves this problem at several points: when a developer needs to ramp up on an unfamiliar component for the first time, but also when a developer is returning to area previously worked on. Remember that your app is constantly evolving. Anyone that was at one time familiar with the application environment will at some point come back to find that things have changed. Conveniently, the overview map automatically discovers and updates, so there’s always an up-to-date reference available.
The point is—regardless of role—in large organizations it is not realistic for any one person to have mastery of all aspects of the application, and that’s why the overview map will quickly prove essential across your engineering team.
The map as a central dashboard
The overview map is updated in real-time and is integrated with TraceView alerting. Keeping it open all the time will help you verify redundancy and auto-scaling implementations, and significantly improve your time to resolution when alerts trigger. How the latter bit happens is largely via the map’s impact on the human element of DevOps.
When alerts trigger, the first thing that’ll happen is that you’ll have the opportunity to observe exactly which components are affected, and quickly drill down into the underlying hosts or layers. You’ll also have a better idea about how individual alerts are related. The overview map is a huge advantage when you receive multiple alert emails and have to quickly put the pieces together.
The second thing that’ll happen is that you’ll be better informed about how to prioritize alerts, and when to escalate issues if necessary. If other team members do need to be brought in, everyone will be using the the same up-to-date reference, so there’s better coordination and less time spent catching people up.
Third, you’ll be able to keep a closer eye on upstream and downstream components in case issues cascade. You’ll be able to quickly jump on those issues to try to contain them, and at least as important you’ll gain an understanding of not just what components are related, but how they’re related.