The X-Trace paper presents a system for tracing causal paths through multi-host network architectures to help with debugging distributed applications. The authors identify the problem that modern systems are often made up of many, heterogeneous hosts and that a request must traverse a number of these hosts in its lifetime. Finding out why a given request is not performing as expected requires investigating all the hosts along that path, a difficult and time-consuming task.
X-Trace proposes to deal with this problem by adding a small amount of data in-band with each request. Each participating host must be modified to append its own data to the request, for example, an http-proxy adds some metadata, then pushes the request to the appropriate host, which adds its own data and so on. The recursive nature of this propagation assumes no particular architecture a-priori. Once the request is finished, the total metadata can be examined and used to build the resulting task tree. This tree can be used to discover what amounts of time are being taken at each component, and/or to identify any failures along the path at a particular component.
The paper then evaluates X-Trace in three example situations
- A web request with recursive DNS queries. Using X-Trace the authors are able to follow the recursive DNS query and isolate faults along the way. This also allowed them to see which cache the result was coming from and know why updated addresses might be out of date.
- A web hosting provider. Using X-Trace allowed for determining that a request was failing, say, in a PHP script, not in a DB query.
- An overlay network. Probably the most complex scenario, X-Trace allows for distinguishing between a number of possible faults in Chord including the receiving node failing, or a middle-box failure, either a particular process crashing, or the whole host going down.
I like this paper a lot. It gives a clear outline of the problem and proposes a useful and practical solution. I've used X-Trace in a couple of projects and it most certainly helps the debugging of a distributed application. In one case we were confused about why a request was taking so long. X-Trace allowed us to pinpoint the location in some client code that was doing a slow type conversion. There were at least 4 other systems involved in the request and each one would have to have been examined without the X-Trace data.
There is some concern that requiring modification to the client will hurt adoption. While this might be true, it's hard to see how something like X-Trace could work without such modification. I also think inter-node communication libraries like thrift might help the situation. We've added X-Trace support to the thrift protocol, making the actual client modifications a one line trivial change.
I would keep this paper on the syllabus.
I agree with your final thoughts. It turns out that retrofitting xtrace to some code is hard because the parts of the code that generate packets are spread throughout the code base. For new apps written with the right libraries, it is not so difficult to make the code xtraceable. George evangelized this inside Sun, but he doesn't work there anymore.
ReplyDelete