News19 Aug 2006: illuminati database reaches 5 million unique clients.13 May 2006: Plugin for incorporating illuminati into WordPress blogs. 5 Apr 2006: Introduce support for teams and live statistics. 24 Mar 2006: Measurements go live on all CoralCDN servers.
OverviewThe goal of this project is to explore the transparency (or conversely opacity) of the Internet's edge. We seek to grapple with a number of questions about client populations and deployments that have been difficult to answer up until now, including things like:
This is an academic endeavor. While we plan to publish our results at refereed conference(s), all identifying information will be kept private and only released in anonymized or aggregate form. This analysis is important because significant decisions are often based on IP addresses available to servers when answering an online request. As an example, a web content distribution network (CDN)---such as Akamai or CoralCDN---may use the public IP address of the DNS request to choose a web-server "close" to the client. However, that address belongs to the client's DNS resolver, not the client itself: it is unclear quantitatively the proximity between clients and their DNS resolvers on average. Further, filtering decisions such as blacklisting and whitelisting often are expressed over public IP addresses of the client. However, it is well known that with the pervasive deployment of proxies and NATs, an IP address is a poor approximation for a unique host. Yet, to our knowledge, there are no public studies which attempt to explore the extent of NAT and proxy usage, nor practical methods for doing so in a large-scale measurement effort. Our study, hereafter called illuminati for its attempt to "illuminate the shadows of the Internet", seeks to perform opportunistic measurements using oblivious Internet clients. We apply a number of measurement techniques to explore the space around the extent, type, location, and configuration of web clients, their proxies, and their DNS resolvers. To interpose ourselves on real web traffic and thus traffic patterns, integrating illuminati is as simple as either inserting a small embedded object on your webpage (a "web-bug") to load in the background. The web-bug is requested from our measurement servers and subsequently helps us perform some local measurements. Alternatively, a webpage author can integrate illuminati by passing links' click-through traffic through our measurement servers, before the client gets redirected back to the desired content. To cast a wide net, we are asking third-party server operators to integrate such web-bugs on their own pages: We'll even track your individual page's stats for you and how this compares with other adopters. Please see our teams page for more information. We also transparently redirect some HTTP requests for CoralCDN through illuminati using our redirection scheme, which accounts for a significant amount of traffic spread world-wide. Using illuminati within web pagesMore specifically, when a client accesses a webpage with an illuminati web-bug, it performs the following:
Interposing illuminati on web linksWhen illuminati is integrated via link redirection, as is the case with CoralCDN, a client performs the following:
Measurement techniques: Detecting NATs, proxies, and private addressesAt a high level, when a client downloads the measurement object from our servers, it performs several additional actions:
These actions include performing a fresh DNS request to our
nameservers, as well as several subsequent GET requests and TCP
connections to detect NAT and proxy use. We force a fresh DNS
request---and can subsequently directly correlate clients to their DNS
resolvers---by synthesizing a random hostname in the returns object,
as described earlier:
Java Applet: For clients that support Java, we run a small applet which performs the following tests:
Compare SYN-fingerprint with HTTP User-Agent: Our modified web-server captures the SYN packet of all incoming requests and uses it to generate a SYN fingerprint. The SYN fingerprint is then used to try and determine the sending host-type. For common operating systems such as Windows, MacOS X, BSD and Linux, SYN fingerprinting is relatively accurate. If the OS guess from the SYN fingerprint differs from the User-Agent field in the client's HTTP headers, or if the Client's User-Agent is a common operating system type and the OS type cannot be gleaned from the SYN fingerprint, it is likely that the request went through a proxy (or the true client is within a virtual machine, which is often the case). Check Client Headers: Some proxies will add headers to client's requests which will not only reveal their existence, but sometimes reveal the IP of the client being forwarded. We use the following headers to indicate the presence of a proxy:
DNS Server Usage: We've installed a modified DNS server
(described above) to correlate client requests with their DNS
requests. Many clients will use 2 or 3 DNS servers, yet few will use
more. If a particular public IP address is using a large number of
DNS servers (say 5 or more) over multiple requests, it is likely that
the public IP corresponds to a proxy, not to an end-user.
|