As web support engineer, I have used many and diverse motorization applications. In my beginnings, some open source tools like zabbix or nagios were the most common.
The reasons, apart from the fact of being “free”, were its widespread use and the support communities that contributed with numerous plugins and documentation. But everything has a price, the maintenance and development of this type of systems, especially when used in large and varied environments, is expensive in workforce and hardware.
Recently in Enimbos we started working with New Relic, a very complete SaaS monitoring system in real time. The following features stand out at first glance:
- Ease of use: it is configured and installed in about 15 minutes, the interface is quite intuitive and there is a lot of documentation. In addition, it also has a comunity of people quite active.
- The almost non-existent overhead on the system. Even with the standard instrumentation, the overload is minimum. On the other hand, it also has a mechanism called circuit breaker that protects the system in situations of stress.
- The amount of data collected and the tools that facilitate the analysis of said data.
The advantages of the SaaS model are also evident:
- The cost of maintenance is practically zero, something that does not happen with the tools on premise, you do not have to deal with hardware or patches or updates.
- You can use the same tool wherever your application runs.
- Scalability and speed for the analysis of the data collected by the monitors.
We mainly use the APM module and Insight dashboards, although it is true that both Browser and Synthetics are also very useful in Web diagnostics.
APM has helped us a lot in the most feared incident that we can report in the web: “the application is slow”. For a System engineer this is the Achilles heel, the usual symptoms:
- high memory consumption
- the CPU consumption goes up
- the two of them
- or none…
At the end the decision to restart is made, and although the issue improves, you cannot relax because it will happen again if the problem has not been solved.
Although these symptoms in 90% of situations, are due to some bottleneck or a leak in the code, these incidents usually continue on the system engineering court, mainly because without further data, developers cannot locate the problem. In short, these problems may become entrenched over time.
Usually you end up having to take palliative measures such as periodic restarts and only after laborious analysis of dumps of threads and heapdumps can you guess where things were going.
With New Relic this does not happen, since it makes the analysis much easier to provide the information of what is happening, in real time. Let’s see it with an example:
Let’s suppose that you get warnings of slowness in a web application: the first step would be to go to the Overview screen, where a summary of the general performance of the application is offered. This screen serves to guide us in the right direction but not to diagnose specific problems:
Here we find information on web response time, Apdex, throughtput, transaction response time, error percentage and metrics of the Java virtual machine.
In case any of these sections is showing us something out of the ordinary, simply clicking will take us to a page with more detailed data of that aspect of the application.
I will focus on this case scenario on the part of transactions analysis because for this type of incidents has been especially useful. On the Overview screen:
As you can see there are some transactions that are taking more than a minute, we click directly on it and this will take us to a screen with all the details of this particular transaction:
The amount of detail is very high, we can see from what hosts it has been executed, how long it took and an ordered list of the duration of executions of each component. In our case it seems that the heaviest has been a select to an Oracle DB. Also there are another two tabs where we observe detailed traces of the execution and yes … the queries that have caused the slowness:
As you can see in a matter of minutes we have discovered exactly what the problem is and we can just send the info to our development colleagues who will surely use it well.
Some at this point will accuse me of having selected a very simple case… Challenge accepted… harder still:
Incoming incident: the application goes extraordinarily slow and the traditional monitoring just shows an increase in CPU. If we follow the traditional debug procedure we should perform the following steps:
- Take out dump of java threads (to obtain the stack trace).
- Simultaneously execute a list of CPU consumption per linux process thread.
- Conversion of the linux threads TID to hexadecimal to establish the correspondence between java threads and system threads.
- Analyze the stack trace of the threads that consume more CPU to try to know what part of the code is causing it.
- Perform load tests on that part of the code to analyze its performance.
- Etc etc,..
Analysis of the problem with New Relic: as in the previous case we see that certain transactions are taking more time:
Let’s look in more detail at the slowest in the transactions tab:
This follows more screens, at first glance we can assure that, although the queries that are produced are fast, they are exaggeratedly numerous:
The problem at this point is clear, there is some transaction that makes numerous concurrent calls to the database, and that is what is consuming the CPU. Therefore, the solution is an optimization of that part of the code.
Obviously not only we have saved time in the diagnosis (weeks in fact), but this is more accurate than traditional methods.
To see it even clearer and confirm the problem we can use another module: Insight, this module collects data from the rest of the modules such as APM, Browser, etc, by means of graphs to be able to relate them and make a deeper analysis.
In our case we are going to see the graph of number of calls to BBDD:
The problem is seen at a glance: increase in concurrent calls to the database. After seeing this we decided to add this graph to the dashboard of our application, in this way we can keep an eye on the newly discovered problem. Other ways to be aware is to mark this transaction as a key transaction, in this way we can follow closely and analyse it more in deep:
It is an overview but only of that transaction in particular, hence you can isolate the part you want to study and even compare the data with those of previous days.
Finally, we can use these metrics to configure alerts so New Relic warns us in case it happens again.
Lastly let you know that this is the tip of the iceberg, I left in the inkwell paragraphs like the database screen, the analysis of errors, the reports… but this is already very long, we will continue counting our experiences and learning in the next post.