"Illuminating the shadows of the Internet"

illuminating

Opportunistic network and web measurement

•• Project home page
•• Brief overview and news
•• Frequently-asked questions
•• Publications and people
•• Preliminary results
•• Data coverage and proxy maps
•• Join our community

News

    19 Aug 2006:   illuminati database reaches 5 million unique clients.

    13 May 2006:   Plugin for incorporating illuminati into WordPress blogs.

      5 Apr 2006:   Introduce support for teams and live statistics.

    24 Mar 2006:   Measurements go live on all CoralCDN servers.

Overview

The goal of this project is to explore the transparency (or conversely opacity) of the Internet's edge. We seek to grapple with a number of questions about client populations and deployments that have been difficult to answer up until now, including things like:
  • What is the distance between clients and their DNS resolvers?
  • How suboptimal (in terms of network distance) are clients' use of web proxies?
  • What do client populations behind NATs and web proxies look like?
  • How effective is blacklisting/whitelisting IPs for admissions control to websites? What additional information can be applied to make these approaches more effective?
In other words, given an online service request from a client such as an HTTP GET, what is the expected accuracy of decisions based on the client's public IP address or on other public information accompanying the request (e.g., HTTP headers).

This is an academic endeavor. While we plan to publish our results at refereed conference(s), all identifying information will be kept private and only released in anonymized or aggregate form.

This analysis is important because significant decisions are often based on IP addresses available to servers when answering an online request. As an example, a web content distribution network (CDN)---such as Akamai or CoralCDN---may use the public IP address of the DNS request to choose a web-server "close" to the client. However, that address belongs to the client's DNS resolver, not the client itself: it is unclear quantitatively the proximity between clients and their DNS resolvers on average.

Further, filtering decisions such as blacklisting and whitelisting often are expressed over public IP addresses of the client. However, it is well known that with the pervasive deployment of proxies and NATs, an IP address is a poor approximation for a unique host. Yet, to our knowledge, there are no public studies which attempt to explore the extent of NAT and proxy usage, nor practical methods for doing so in a large-scale measurement effort.

Our study, hereafter called illuminati for its attempt to "illuminate the shadows of the Internet", seeks to perform opportunistic measurements using oblivious Internet clients. We apply a number of measurement techniques to explore the space around the extent, type, location, and configuration of web clients, their proxies, and their DNS resolvers.

To interpose ourselves on real web traffic and thus traffic patterns, integrating illuminati is as simple as either inserting a small embedded object on your webpage (a "web-bug") to load in the background. The web-bug is requested from our measurement servers and subsequently helps us perform some local measurements. Alternatively, a webpage author can integrate illuminati by passing links' click-through traffic through our measurement servers, before the client gets redirected back to the desired content.

To cast a wide net, we are asking third-party server operators to integrate such web-bugs on their own pages: We'll even track your individual page's stats for you and how this compares with other adopters. Please see our teams page for more information. We also transparently redirect some HTTP requests for CoralCDN through illuminati using our redirection scheme, which accounts for a significant amount of traffic spread world-wide.

Using illuminati within web pages

More specifically, when a client accesses a webpage with an illuminati web-bug, it performs the following:

  1. Client sends an HTTP GET request to your webserver. Your server returns the requested web object, which includes an embedded link to our measurement servers, shown here.

  2. The client transparently sends an HTTP GET request to our measurement servers, which returns a small HTML object:
        <html><head></head><body>
        <img src="http://$rand-$pubip.img.cdn.coralcdn.org/404.jpg" width=0 height=0>
        <a href ="http://$rand-$pubip.href.cdn.coralcdn.org/404.html"> </a>
        <APPLET CODE="prxytrkr.class" WIDTH=0 HEIGHT=0></APPLET>
        </script></body></html>
    
    where $rand is a new random 32-bit number and $pubip is the client's public IP address. prxytrkr.class is a small (~1500 byte) Java applet that performs some additional measurement tasks. Note that server operators can choose a "diet" version of the web-bug, which does not include this Java applet (see here).

    The client will perform a few HTTP GET requests to our servers at *.cdn.coralcdn.org, which we log and use for our measurement study.

Interposing illuminati on web links

When illuminati is integrated via link redirection, as is the case with CoralCDN, a client performs the following:

  1. Client clicks on a link on your web page, issuing a GET request to your webserver. Your server issues a HTTP redirect to one of the following URLs:
        http://www.cdn.coralcdn.org/redirect.html?url=$url
        http://www.cdn.coralcdn.org/redirect.diet.html?url=$url
    
    using some dynamic link rewriting technique, such as in php scripts or via Apache's mod_rewrite. Of course, this link could also just be hard-coded in the HTML originally returned to the client, thus avoiding this GET request.

    Because we use HTML redirects (via javascript and META REFRESH commands) later, it is important that such redirection only occur for web objects that obey such directives, i.e., no <img src or <embed src links.

  2. The client sends an HTTP GET request to our measurement servers, which returns a small HTML object as before, with the addition of a redirect back to the URL specified before.
        <html><head>
        <META http-equiv="refresh" content="2;URL=$url">
        </head><body>
        <img src="http://$rand-$pubip.img.cdn.coralcdn.org/404.jpg" width=0 height=0>
        <a href ="http://$rand-$pubip.href.cdn.coralcdn.org/404.html"> </a>
        <APPLET CODE="prxytrkr.class" WIDTH=0 HEIGHT=0></APPLET>
    
        <script language="JavaScript"><!--
        window . onerror = null;
        setTimeout('Redirect()',1000);
        function Redirect() { location.href = '$url'; }
        // -->
        </script>
        </script></body></html>
    
    The client will perform a few HTTP GET requests to our servers at *.cdn.coralcdn.org, which we log and use for our measurement study.

  3. The client issues a HTTP GET request to $url and fetches the desired web page. While interposing illuminati on a click-through does add a small amount of latency (mostly the time to bring up the Java VM), its functionality should appear transparent to the client.

Measurement techniques: Detecting NATs, proxies, and private addresses

At a high level, when a client downloads the measurement object from our servers, it performs several additional actions:

These actions include performing a fresh DNS request to our nameservers, as well as several subsequent GET requests and TCP connections to detect NAT and proxy use. We force a fresh DNS request---and can subsequently directly correlate clients to their DNS resolvers---by synthesizing a random hostname in the returns object, as described earlier:

    <img src="http://$rand-$pubip.img.cdn.coralcdn.org/404.jpg" width=0 height=0>
We use a combination of techniques to determine whether or not a web client is behind a proxy or a NAT. This includes information collected from a Java applet (provided the client supports Java), inferences using HTTP headers, network stack fingerprinting, and DNS use patterns. We discuss each of these techniques now:

Java Applet: For clients that support Java, we run a small applet which performs the following tests:

  • Creates a socket connection back to our modified web-servers. From the connection, it grabs the local IP address and ephemeral port and issues it back to the web-server as a standard get request. From this get request we record the public IP address, the local IP address, the public source port (that seen by the web server) and the local source port. Differences in the two IP addresses are a sure indication of NAT. Differences in the ports can help detect transparent proxies and provide some insight into how the middleboxes are doing port remapping.

  • Creates a socket connection back to our modified server and issues the following GET request:
    GET / HTTP/1.1
    host: www.google.com
    
    If a proxy is interposed between the client and the web-server, it will dutifully respond with Google's index page. If the web-server receives the request, the test fails.

Compare SYN-fingerprint with HTTP User-Agent: Our modified web-server captures the SYN packet of all incoming requests and uses it to generate a SYN fingerprint. The SYN fingerprint is then used to try and determine the sending host-type. For common operating systems such as Windows, MacOS X, BSD and Linux, SYN fingerprinting is relatively accurate. If the OS guess from the SYN fingerprint differs from the User-Agent field in the client's HTTP headers, or if the Client's User-Agent is a common operating system type and the OS type cannot be gleaned from the SYN fingerprint, it is likely that the request went through a proxy (or the true client is within a virtual machine, which is often the case).

Check Client Headers: Some proxies will add headers to client's requests which will not only reveal their existence, but sometimes reveal the IP of the client being forwarded. We use the following headers to indicate the presence of a proxy:

  • Via
  • X-Bluecoat-Via
  • X-forwarded-for

DNS Server Usage: We've installed a modified DNS server (described above) to correlate client requests with their DNS requests. Many clients will use 2 or 3 DNS servers, yet few will use more. If a particular public IP address is using a large number of DNS servers (say 5 or more) over multiple requests, it is likely that the public IP corresponds to a proxy, not to an end-user.