ServiceOps bridging the gap left by DevOps and Application Performance Monitoring (APM)

Application Performance Monitoring (APM) and DevOps are drawing a lot of interest in the IT community that is looking for ways of delivering faster and more reliable information services. DevOps largely bridges the gap between the Development and Operational Deployment activity through automation. DevOps is enabling the rapid delivery of software through Continuous Integration and Deployment with some limited forms of largely system level monitoring. Application Performance Monitoring (APM) is another rapidly emerging technology solution that enables the ability to deliver a more reliable service through application level metrics and monitoring to help accelerate incident resolution, detect performance issues and diagnostics – APM addresses issues that traditional monitoring and logging do not adequately cover. Adoption of APM solutions has been fast furious driven in large part through easy to consume cloud enabled SaaS services such as AppDynamics, New Relic, Loggly, LogEntries, SumoLogic, Boundary and many other options including open source alternatives like Elasticsearch, Kibana and Logstash (ELK) amongst others.

However, both DevOps and APM fall short in providing a truly holistic and light-weight operations solution.

ServiceOps addresses gaps left by DevOps and APM


Hard-core systems operations and management that typically include patching, vulnerability management, backups/restore and continuous security monitoring as well as the financial management of the platform are largely left out. This is a serious gap in the current “state-of-the-art” given than typically 60-70% of a total system cost is associated with the Operations & Maintenance activity. Although, ITIL is a robust framework that got some traction in the operations management arena in the past decade, it is arguably heavy weight, considered costly and lacks agility. Just like DevOps emerged as a logical implementation level methodology to deliver agile application services through automation, ServiceOps provides an integrated and data-driven framework for platform operations that integrates with DevOps.

The key technology drivers for ServiceOps are Cloud Computing and Big Data — as the infrastructure becomes more software driven and telemetry data is easily available across the whole “stack”; we now have the ability to collect, process and actionize large amounts of data – the foundation elements of ServiceOps are in place.



ServiceOps is an implementation and delivery focused methodology that uses full-stack telemetry data and automation to help organizations deliver a reliable, secure and cost-effective IT service that is continuously optimized and includes End-User, System, Security and Financial operations.

For example in the area of pay-as-you-go cloud computing models, the ability to optimize the performance of cloud-based applications pays rich dividends in operational savings. Some organizations report being able to save up to 20% of their IaaS spend through a rigorous monthly tracking & optimization ensuring that “orphaned storage”, right-sizing VM’s,and using the right pricing model. Most IT organizations are ill-equipped and not focused on the financial aspects of cloud computing. Similarly, there are serious emerging challenges in the security operations arena – traditional security frameworks tend to be reactive in nature – the ability to perform forensic and trending analytics have been primary use cases. But with the advent of the NIST cybersecurity framework, high-profile incidents like Target and the increased cyber threat, organizations must implement real-time, automated solutions to contain the security costs and yet deliver a viable “armor” against threats.


Logging and Monitoring with Lucidworks Silk – Lucene-Solr, Kibana and Logstash

The folks at Lucidworks recently announced Silk – a logging and monitoring solution for enterprises using open source components like Solr, Kibana and Logstash.


You can learn more about their product on the Lucidworks website Clearly, the logging and monitoring space has spawned many interesting open source options, e.g. Elasticsearch is also a Lucene based solution that has integrated Kibana and Logstash. Based on discussions with early adopters, Lucidworks seems to be making an enterprise play by offering a more complete enterprise ready solution because they have a “Big Data” offering as well. 



Building a Logging and Monitoring Solution – How Cloudant did it

Application Performance Monitoring, Logging and Systems Monitoring continue to be critical for secure and reliable platform operations. A key decision is whether to buy or build such a service – there are any number of commercial services that cover a wide spectrum and an equal number of mature open source “do-it-yourself” options such as Elasticsearch, Logstash and Kibana amongst others.  Major platform and service providers such Netflix (SURO), LinkedIn (NAARAD), and Google (DAPPER) have developed their own logging, monitoring and service reliability support systems. There is a lot of innovation in this space both in the open source and commercial arena. Here is an interesting article about Cloudant, which is a database as a service provider, that decided to build their own solution using open source components including -

Read more about it on Techrepublic.

Data Collection and Real-time Performance Analytics at LinkedIN

The ability to rapidly collect, process and present log data from the full stack is increasingly becoming important for trouble-shooting, ensuring service reliability and proactive service management. LinkedIn, Netflix and Google continue to drive new standards in implementing new types of architectures and solutions for collecting and processing telemetry data.


LinkedIn relies on inGraphs and Naarad for their operational monitoring and real-time application monitoring purposes. Specifically, Naarad is a framework for performance analysis & rating of sharded & stateful services. The use cases supported by Naarad include Continuous Integration, Performance Investigation and parses JVM Garbage Collection (GC), System/Network (SAR), Mysql (Innotop), Jmeter (JTL/XML) logs, VMStat, ZoneInfo, and MemInfo.

The link to the github repository is here.

You can read more about how Naarad works by viewing this presentation.

Advanced Analytics on Operations Data

The ability to collect, process and synthesize IT operations data is increasingly critical for ensuring service reliability and security of software applications. With the increased ability to collect and process large amounts of operational telemetry data, there is an interest in applying advanced analytics techniques to help detect patterns and predict service failures. Here are two interesting articles on use of analytical techniques for performance monitoring.

Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems

“Diagnosis and correction of performance issues in modern, large-scale distributed systems can be a daunting task, since a single developer is unlikely to be familiar with the entire system and it is hard to characterize the behavior of a software system without completely understanding its internal components. Moreover, distributed systems are extremely complex because of the innate complexity of their code, combined with the network that can cause unpredictable delays and orderings.”


Using Survival Analysis for reliability of software systems

Survival analysis (SA) is a discipline of statistics that focuses on estimating time to events. You would typically apply survival analysis methods to clinical studies to help determine the effectiveness of certain drugs (time to patient death), reliability of software systems (time to failure) and credit analytics (time to loan default).