Jesse Rothstein, who was the lead architect of F5's flagship product line, founded ExtraHop in 2007 to develop products to derive IT operations intelligence from data gleaned from the network. John Dix, Editor in Chief of Techworld's sister title Network World,recently caught up with Rothstein for an update on the company and what it has learned about things like virtual packet loss - the bane of highly virtualised environments.
How does your background at F5 help you at ExtraHop?
My co-founder Raja Mukerji and I were both at F5 for many years. And what we did at F5 was bring application awareness and application fluency to what was the load balancer, and that created a whole new product category called the application delivery controller. Over at ExtraHop, we leverage that same domain expertise in high-speed packet processing and application fluency, but we've brought it to a new space, much more on the IT operations side, and we're starting to call this IT operations intelligence.
Raja and I had conversations with IT organisations and people we'd worked with in the past and it became apparent to us the end result of megatrends like server virtualisation, where VMs spin up and spin down and jump across the data centre, and agile development, where we roll out new versions of applications every two weeks or every two days, was resulting in an unprecedented level of scale, complexity and dynamism. And the previous generation of tools and technologies that companies use to manage these environments are no longer tenable. And that's if they have those tools at all. More often than not companies just throw smart people at the problem of figuring out what's going on.
So I would say, No.1, the situation has become such that we're beyond the capability of just throwing smart people at the problem and pulling a few all-nighters and ordering pizza. And No.2, the previous generation of tools were built for much smaller environments that were not dynamic. Those tools basically start off as bricks, and you parachute in teams of sales engineers and systems engineers and consultants to configure them in order to provide the visibility you need. Then if the environment changes, rather than automatically detecting the changes, you have to rinse and repeat that process.
So we started with the notion that these IT megatrends were occurring, that we had the domain expertise to solve some of the problems around scale and dynamism, and that we could provide visibility into these environments.
What are you lumping into the current generation of tools?
This is a taxonomy I've been thinking about for a while. In enterprise IT there are four or so sources of data that you can use to derive some intelligence about your environment.
So No.1 we have machine data, and I'm using a term that Splunk popularised. Machine data includes log files, SNMP and WMI, and all of these data sources are largely unstructured. Splunk and others like them realised that enterprises are producing a lot of this unstructured machine data and not really doing anything with it. So they built a platform to index it, archive it, and analyse it to derive some intelligence from it.
I sometimes joke that it's been transformational in the same way as fracking has been in the energy market. What I mean by that is, the value was always there, but by applying new technology we can now access it and extract it. So I think one source of data in the IT environment is this unstructured machine data.
Another source is what I would call code-level instrumentation. And this is what traditional Application Performance Management is based upon. Wily (acquired by CA) really founded that market, but companies like DynaTrace and AppDynamics and even New Relic make use of code-level instrumentation. They have agents that instrument the Java JVM or the .NET common language runtime, and they can derive some intelligence and some performance metrics around how that service performs. Where are the hotspots and bottlenecks? What's it doing? These are very useful tools for developers who have intimate knowledge of the code and want to see how it runs in production.
The third source of data I call service checks. There are lots of facilities for doing this. If you're running some sort of synthetic transaction (basically a script mirroring common user actions), you can use internal checks, which is what HP's Mercury SiteScope and Nagios do today, or external service checks like a Keynote or Compuware's Gomez. These give you a sense of if your service or your application are up or down and, to some degree, how it is performing. But there are some challenges with this approach because, given these things are periodic in nature, there's an inherent under sampling problem. So that means that if you've got any sort of intermittent issue you very well might miss it.
And finally the fourth fundamental source of data for intelligence is what we call wire data. That's everything on the network, from the packets to the payload of individual transactions. It is a very deep, very rich source of data. In fact, all indications are that wire data is at least one or two orders of magnitude larger than other sources of data, because there is just so much moving across our networks. And it's definitive. We know that a transaction completes if we can observe it completing on the wire and we can observe the peers in this conversation acknowledge that that transaction completed.
To a large degree wire data has been neglected. Yes, there have been products like network probes and packet sniffers for three decades or more, but I would say they only scratch the surface of what's available on the wire. At ExtraHop we founded the company on the premise that there is this tremendously rich, tremendously deep source of data on the wire, and by leveraging gains in processing power and storage capacity, that we could extract and analyse and derive intelligence from that data. It has required a completely different technology approach than you would do for any of the other sources of data. But it is, I believe, every bit as valuable.
I tell organisations that, as a best practice, they should probably have a product that is focused on each of these four sources. I wish I could say that there's one that does it all, but there isn't, because these do require pretty fundamentally different approaches.
APM providers argue they can see it all, embedded as they are in the applications. What are you providing they can't?
APM is really focused on code-level instrumentation, and there are probably three fundamental differences between us and APM. One is philosophical. We define the application differently. APM tends to define the application as the code running on a server and they instrument that. At ExtraHop we define the application as the entire application delivery chain. That includes the client devices, the network transport, the front end, the middleware, the transaction queuing, back-end storage and even other ancillary services. It's a chain because if any one link fails, the entire application is down, and any one link can be a bottleneck. I can't tell you how many applications I've seen where the code is running fine but the application fails because of something like DNS resolutions aren't completing. That has to be considered part of that delivery chain.
No.2 is audience. Traditional APM tends to be used more by developers who have intimate knowledge of the application code, whereas IT operation teams can get more out of our wire data analysis because it is focused on production-level systems. We answer the questions they care about most, like "What's happening right now? Did something change in my environment? Are transactions succeeding or failing? Is this better or worse than it usually is? What resources are people trying to access?"
And the third difference is between custom applications versus off-the-shelf packaged applications. APM solutions are much more popular with organisations that are developing custom applications because they're writing the code and the code is changing and they need to see how that's performing. We really sell to both. Yes, we absolutely are used by organisations that are writing custom applications, but we're also used by organisations who are dependent on packaged applications that they don't have very intimate knowledge of, but still absolutely care how well it's working.
You guys deliver as an appliance, right?
Yes. We're sold as a physical or a virtual appliance.
And where do you plug in?
For us, we just take a copy of the network traffic with no overhead at all. We're not in line, we're out of line. And how we get a copy of the traffic really depends on the environment. Sometimes it's directly from one or more switches using a SPAN port or a VACL capture. Sometimes there is a whole aggregation-tapping layer that's in place. Some organisations even use some pretty advanced SDN techniques to get us traffic to analyse. At the end of the day, if we get a feed of the traffic, we can make sense of it.
But I want to stress that, even though we're a network deployment and we analyse what I'm calling the wire data, we're really answering questions about the health and performance of business-critical applications. So it's not just network teams that use an ExtraHop system. And that's an important distinction, because I see that confusion a lot.
Do you have a sweet spot in terms of customer size?
Our high-end physical appliances can support 20 gigabits of line-rate analysis, and hundreds of thousands of transactions per second. So we have large enterprises and carriers that use multiple EH8000 appliances across the data centre with an ExtraHop Central Manager to provide a unified view. Our initial customers were larger enterprises, but we're starting to see more adoption at mid-size organisations because we also have virtual appliances that can analyse a gigabit of traffic and cost less than $10,000.
How are the virtual appliances used?
First of all, a virtual appliance can actually terminate traffic from physical systems as well as virtual systems. So the fact that it runs in a virtual appliance is really just a form factor for us to deliver. But we're certified by Cisco to run in the Cisco UCS environment, where there is great flexibility around tapping virtual traffic. With VMware vSphere 5.1 and the distributed vSwitch, they introduced support for both RSPAN and ERSPAN and the ability to tap virtual traffic for security and monitoring purposes. And some of the announcements at VMworld around the new NSX offering afford even greater flexibility. So there are a number of approaches to take there, but I think the short answer is that virtual networking has really matured rapidly in the past 24 months or so, and we're seeing great capabilities for tapping virtual traffic much as you would tap physical traffic.
Do efforts to virtualise everything increase the need for your type of product?
Absolutely. Any time there are additional layers of abstraction it increases the need for not just our product, but solutions to help manage that complexity. That's a general trend. And certainly server virtualisation and SDN are additional layers of abstraction and complexity. But we've worked with a lot of customers around things as simple as physical-to-virtual migrations, where they need to prove to the application owners that when they migrate an application from a physical environment to a virtual environment the performance and availability are the same or better. Or if they're not, they need to be able to measure that they're not.
And in a virtual environment, you can't measure performance by looking at resource utilisation -- how much CPU it takes or how much memory is required. Resource utilisation is not the same thing as performance, it's not the same thing as response time. In fact, in the virtual environment, we derive greater efficiency and cost savings by not having as much headroom and by utilising that CPU and memory resources more efficiently. You actually want the CPU of your physical host to be highly utilised, but you don't want it to be under provisioned. And that's the balance.
A great example of additional complexity in these environments is something we call virtual packet loss. Hypervisors are basically schedulers. They have to share resources across multiple guest machines and packets can get delayed, sometimes delayed enough that they're considered lost by the underlying network stack. Now TCP is very resilient to loss. If loss occurs on the network TCP will retransmit, so you might see extra packets on the network, and that might affect your throughput, but it doesn't necessarily affect your performance.
However, a sufficient amount of loss can actually be devastating to performance. It can cause 1-2 second stalls in the application flows. So what's particularly insidious about virtual packet loss is, if due to the hypervisor scheduling, packets are delayed to the point where they're considered lost and this loss causes poor performance, it's really hard to track down. Because if you query your switches and routers using traditional means, and say, "How many frames did you drop?" The answer is none. No frames were dropped, they were just delayed. So a system like ours that's analysing the wire data in real time can detect when this occurs, and it's essentially a provisioning problem, where certain physical hosts were under provisioned for the load that was on them. But it's an example of complexity and some challenges that exist in these virtual environments that we simply don't see in physical environments.
Another similar example, but perhaps higher level, is around things like auto-discovery. I don't really want to get into a debate about whether or not configuration management databases should exist anymore or how some organisations use them and some don't, but I think we can all say that virtual environments are much more dynamic. VMs spin up and they spin down. IP addresses are reused. It's a lot more difficult to figure out exactly what's running, who's using it, what the dependencies are, where stuff is located.
And that means the level of auto-discovery required in our virtual environments is far beyond the more static asset management that we could get away with in physical environments. Our customers have used our auto-discovery capabilities for a variety of things -- to find dependencies that maybe they inherited in systems that weren't well understood, to manage things like data centre consolidations where they need to understand the dependencies, application-level dependencies in order to migrate an application to a new data centre. Just yesterday a customer told me about a troubleshooting scenario that I hadn't heard before. A system that was mislabeled as a test system went down, and they didn't know what it was doing. So they used the ExtraHop system to go back in time a bit and say "OK, what was this system doing and who was talking to it?" It sounds bad, but in large complex environments things like that can occur.
Do you guys get pulled into security scenarios at all?
We do. This is probably another parallel between us and Splunk. We're not a security company, but we are being used more and more for security use cases. And it's because we're providing visibility into things that are occurring on the wire. And if there is an anomaly, if all of a sudden a database serves a 100-megabyte response to a user when usually the responses are a couple of kilobytes, that might have security implications. It could be a data leakage event or a rogue application. It could be something that's broken. As we see more and more sophisticated intrusions and more zero-day vulnerabilities being exploited, the better visibility into these environments and the ability to detect when things are abnormal or anomalous is just that much more important. So there are definitely security implications with the type of visibility we provide, and a lot of that is in the interpretation.
What typically gets you in the customer's door?
There is so much noise in the market and there are so many different vendors and they all use the same vocabulary and they all say the same words. A lot of what gets us in the door is word of mouth. Some sort of reference. And I think one of the reasons that occurs is we are very focused on customer success as a company. It really permeates our culture. It comes from my co-founder and myself. I don't have a sales background, I have an engineering background. We like building great products. We like solving hard problems. We like making our customers successful. And when you do that, that's good business, because customers come back for repeat purchases and they recommend you to other people.
How does your customer base break out between service providers and enterprise IT shops?
Most of our customers are enterprise customers. However, we do have several carrier customers.
All right. Any closing thoughts?
We're seeing tremendous traction across a number of vertical industries. And we recently introduced a free virtual appliance called the ExtraHop Discovery Edition that provides analysis for up to a gigabit of traffic, and we're seeing a lot of interest in that as well. I think the megatrends I mentioned earlier, whether it's server virtualisation or agile development or just modern globally distributed architectures and applications, are driving a much greater need for visibility. We're seeing organisations that want to become more proactive, want to be able to detect little problems before they turn into big disasters. They want to mitigate risk for new application rollouts or data centre consolidations for physical to virtual migrations, and ExtraHop can help them with all of that.