APM, Logging, Monitoring – three legs of a stool or redundant tools waiting to be consolidated?

Gartner’s Jonah Kowall published a recent research note around the size of the Network Performance Monitoring and Application Performance Monitoring market size.

“At an estimated $1 billion, the NPMD market is a fast-growing segment of the larger network management space ($1.8 billion in 2012), and overlaps slightly with aspects of the application performance monitoring (APM) space ($2 billion in 2012).”

http://blogs.gartner.com/jonah-kowall/2014/03/09/new-magic-quadrant-network-performance-monitoring-and-diagnostics/

Clearly, the APM market is big and growing. There are many misconceptions about APM. One of the diagrams he published provides insight into the positioning of the APM toolset.

http://blogs.gartner.com/jonah-kowall/2014/03/09/the-three-topologies-in-applications-and-infrastructure/

The market place is clearly evolving with the rapid rise of tools like Appdynamics, New Relic, Boundary, and many many others. It is interesting to see that APM tools are still evolving and have limitations such the inability to work in an agent-less environment, unable to consume & correlate log data for tracing and tracking transactions and errors. This causes the need for multiple solutions, within the IT Domain as expressed in the diagram above. However, with the advent of real-time processing of events using technologies like Apache Flume, HDFS/Hadoop and Stream-processors like Spark (or AWS Kinesis) – can’t we have a single solution for monitoring the entire stack? Does one really need to support different technologies for monitoring different layers of the stack and to satisfy the various user constituencies? Or will unified and integrated tools like SURO from Netflix become the blueprint for future architectures?

 

Network Performance Monitoring and Diagnostics – Security Onion and Snorby

Gartner’s Jonah Kowall just released some new research in the area of Network Performance Monitoring and Diagnostics “At an estimated $1 billion, the NPMD market is a fast-growing segment of the larger network management space ($1.8 billion in 2012), and overlaps slightly with aspects of the application performance monitoring (APM) space ($2 billion in 2012).” 

Read more on his blog site by clicking the link below.

http://blogs.gartner.com/jonah-kowall/2014/03/09/new-magic-quadrant-network-performance-monitoring-and-diagnostics/

For those of you that are interested in open source options, you may want to look at Security Onion and Snorby. Security Onion (SO) is a Linux distribution for IDS (Intrusion Detection) and NSM (Network Security Monitoring). It is based on Xubuntu 10.04 and contains Snort®, Suricata, Sguil, Snorby, Squert, argus, Xplico, tcpreplay, scapy, hping, and many other security tools. Click on the link below to learn more about Security Onion.

http://blog.securityonion.net/p/securityonion.html

Snorby is another interesting project that provides a ruby-on-rails front-end application that works very closely with the various NSM tools.

On a related note, Jonah published an excellent report on APM use cases. Read more on his blog.

http://blogs.gartner.com/jonah-kowall/2014/03/09/the-three-topologies-in-applications-and-infrastructure/

Is Logging for Machines or for Humans? Structured Logging the right approach?

The ability to monitor and log application operations data is an increasingly challenging problem for systems owners. This is driven in large part due to highly distributed and scale-out architectures that use clustering, load-balancing and increasingly cloud-computing elements with many moving parts. The range and diversity of logging formats makes it very difficult to easily collect and act upon this data stream to support application reliability, trouble shooting and performance testing/management use cases. A fundamental issue with logging is the subsequent need for parsing such that it can be analyzed. The irony is that most applications tend to generate log and error information for human consumption but in reality that data must be consumed and acted upon by machines!

Here is a very informative blog on structured logging with some real tips – here is a snippet.

Logging for Humans or Machines

I would argue that every log message is intended to be interpreted by either a human or a machine.

If a log message is only going to be used to help a human debug a system or casually see what’s going on, it’s intended for humans. If a log message is going to be used for machine analysis later (e.g. monitoring, alerting, business analytics, security auditing, etc), it’s intended for machines.

This is an important distinction. If it isn’t clear, please re-read the prior paragraphs. Now might also be a good time for you to audit the logs in your life and assign humanor machine labels to them.

The Problem

The problem with logging as most of us know it is that it is optimized for human consumption, even if machines are the intended consumer.

At the heart of most logging systems we log a string, a char *, an array of characters that can easily be read by humans. For example, a simple web application might produce log files containing lines like:

2012:11:24T17:32:23.3435 INFO gps@mozilla.com successfully logged in

Human readability is a great feature to have. However, it comes at a price: more difficult machine consumption.

By optimizing for human consumption, we’ve introduced a new problem: log parsing.

In the above example log message, what the program has done is combined a few distinct fields of data (an event - logged in, the username, and the time) into a single text string. This is great for humans glancing at a log file. But that decision now necessitates downstream machine consumers to parse/decode that text back into its original, distinct fields.

I call this style of logging destructured or unstructured logging. We start with distinct fields of data and then destructure them into a fully free-form or semi-formed text. By doing so, we lose part of the original data structure and necessitate the need to reconstruct it (via parsing).

Read more at the blog post below for the solution.

http://gregoryszorc.com/blog/2012/12/06/thoughts-on-logging—part-1—structured-logging/

Here is another interesting article on Project Lumberjack that provides a universal and structured logging solution.

http://blog.gerhards.net/2012/02/announcing-project-lumberjack.html

 

 

Troubleshooting Applications with Microsoft IIS Logs

The Web Server is a critical element of most enterprise solutions today. In order to effectively troubleshoot and monitor the integrity of services it is essential to have a robust understanding of the underlying logging framework. Microsoft IIS is a very commonly deployed web server in most windows oriented shops, this article in MSDN Magazine provides a good overview on how to use the IIS log for troubleshooting.

Here is an abbreviated extract from the article, click on the link to read more. http://msdn.microsoft.com/en-us/magazine/dn519926.aspx

The first step is to turn on Windows logging on the server. The actual steps can vary (sometimes greatly) depending on which version of Windows Server is running. You’ll find a lot of useful information in these two MSDN articles for Windows Server 2003 and 2012: “How to configure Web site logging in Windows Server 2003” (bit.ly/cbS3xZ) and “Configure Logging in IIS” (bit.ly/18vvSgT). Once logging is on you need to find out the ID number in IIS of the Web site you’re troubleshooting. This is crucial, as servers typically host more than one Web site, and trying to find the log folder manually can be daunting. (I attempted it on a server running 45 Web sites and it was almost impossible.)

Windows Event Log Monitoring

The ability to monitor and manage logs especially in distributed and cloud environments is critical to ensuring service integrity and availability. The NSA’s National Information Assurance Directorate has published a 40 plus page guide on how to configure and deploy central log collection; how to harden event collection and finally recommendations on events to collect and log.

windowsmonitoring

An extract from the guide is provided below. Click link to read more http://www.nsa.gov/ia/_files/app/Spotting_the_Adversary_with_Windows_Event_Log_Monitoring.pdf.

It is increasingly difficult to detect malicious activity, which makes it extremely important to monitor and collect log data from as many useful sources as possible. This paper provides an introduction to collecting important Windows workstation event logs and storing them in a central location for easier searching and monitoring of network health.

This paper focuses on using the built-in tools already available in the Microsoft Windows operating system (OS). Central event log collection requires a Windows Server operating system version 2003 R2 or above. Many commercially available tools exist for central event log collection. Using a Windows Server 2008 R2 or above server version is recommended. There are no additional licensing costs for using the event log collection feature. The cost of using this feature is based on the amount of additional storage hardware needed to support the amount of log data collected. This factor is dependent on the number of workstations within the local log collection network.

Windows includes monitoring and logging capabilities and logs data for many activities occurring within the operating system. The vast number of events which can be logged does not make it easy for an administrator to identify specific important events. This document defines a recommended set of events to collect and review on a frequent basis. The recommended set of events is common to both client and server versions of Windows. Product specific events, such as Microsoft Exchange or Internet Information Services (IIS), are not discussed in this document, but should be centrally collected and reviewed as well.