Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

Summary
This paper identifies a major problem with existing diagnostic tools. They identify faulty *hosts*, but operators almost always already know what host is faulty, and rather have some application specific problem. Isolating these application faults is what is really what operators need. Thus the authors identify a conundrum: We want our diagnostic tool to be general (application agnostic) but to give application specific information (application specific).

As a resolution the paper takes an inference based reasoning approach, looking at feature vectors of various generic and specific application metrics and comparing their values now to to historical data. A dependency graph is generated between all hosts in the system, and directed edges are introduced between components that can directly affect each other. While doing analysis the algorithm assigns high weights to edges that have corrolated state. That is, if there is an edge from node A to node B and the historical states of A and B are similar to their current states, it is likely that something in A's state is causing B's state.

Using this weighted graph the system can identify probable culprits in an anomaly. A study of a number of reports sent to Microsoft showed that they could identify the correct culprit 80% of the time, and almost always have the culprit in their top 5 list.


Comments
This paper reads a bit differently being an industry paper, and can feel a bit like a marketing pitch at times. Nevertheless the technique was quite interesting and it clearly worked since I remember NetMedic being quite popular.

I also found the overall approach interesting. I got the sense that they began by looking at problem reports, figured out what a good solution would look like, and then set about finding a method to produce that solution.

One concern I have with this approach is that it requires you to run NetMedic on all hosts in your network, and statistics gathering software is prone to making systems unstable. Norton was/is notorious for this and one worries that the cure becomes worse than the problem.

1 comment:

  1. Well it is from Microsoft Research, so it isn't really industrial. But they do get to integrate over several real world networks of organizations that work with Microsoft. One does get the feeling in reading the paper that a future product is in the works.

    ReplyDelete