Successful societies and institutions recognize the need to record their history - this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web's digitized content.
In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.
About this fork
This is a port of the original codebase to Python using the Pyramid framework for the pages, SQLAlchemy for managing database queries, GViz-Data-Table for visualisation. The data differs minimally from httparchive.org: only unique fully qualified, domains are included and folder-based URLs are excluded. Furthermore, websites are only included once they have been crawled more than once. The aim behind the port was to make working with the dataset easier and focus on query optimisation.
- Q: How is the list of URLs generated?
- Q: How is the data gathered?
- Q: How accurate is the data, in particular the time measurements?
- Q: What are the limitations of this testing methodology (using lists)?
- Q: What's a "HAR file"?
- Q: How is the HTTP waterfall chart generated?
- Q: When looking at Trends what does it mean to choose the "intersection" URLs?
- Q: What are the definitions for the table columns for a website's requests?
- Q: How do I add a website to the HTTP Archive?
- Q: How do I get my website removed from the HTTP Archive?
- Q: How do I report inappropriate (adult only) content?
- Q: Who created the HTTP Archive?
- Q: Who sponsors the HTTP Archive?
- Q: How do I make a donation to support the HTTP Archive?
- Q: Who do I contact for more information?
How is the list of URLs generated?
Starting in November 2011, the list of URLs is based solely on the Alexa Top 1,000,000 Sites (zip). Only unique, fully qualified domains with more than one crawl are included.
From November 2010 through October 2011 there were 18,026 URLs analyzed. This list was based on the union of the following lists:
- Alexa 500 (source)
- Alexa US 500 (source)
- Alexa 10,000 (source, zip)
- Fortune 500 (source)
- Global 500 (source)
- Quantcast10K (source)
How is the data gathered?
The list of URLs is fed to WebPagetest.org. (Huge thanks to Pat Meenan!)
The WebPagetest settings are:
- Internet Explorer 8
- Dulles, VA
- empty cache
Each URL is loaded 3 times. The data from the median run (based on load) is collected via a HAR file. The HTTP Archive collects these HAR files, parses them, and populates our database with the relevant information.
For the HTTP Archive Mobile the data is gathered using Blaze.io's mobile web performance tool Mobitest using iPhone 4.3. Please see their methodology page for more information.
How accurate is the data, in particular the time measurements?
The "static" measurements (# of bytes, HTTP headers, etc. - everything but time) are accurate at the time the test was performed. It's entirely possible that the web page has changed since it was tested. The tests were performed using Internet Explorer 8. If the page's content varies by browser this could be a source of differences.
The time measurements are gathered in a test environment, and thus have all the potential biases that come with that:
- browser - All tests are performed using Internet Explorer 8. Page load times can vary depending on browser.
- location - The HAR files are generated from WebPagetest.org's location in Redwood City, CA. The distance to the site's servers can affect time measurements.
- sample size - Each URL is loaded three times. The HAR file is generated from the median test run. This is not a large sample size.
- Internet connection - The connection speed, latency, and packet loss from the test location is another variable that affects time measurements.
Given these conditions it's virtually impossible to compare WebPagetest.org's time measurements with those gathered in other browsers or locations or connection speeds. They are best used as a source of comparison.
What are the limitations of this testing methodology (using lists)?
The HTTP Archive examines each URL in the list, but does not crawl the website other pages. Although these lists of websites (Fortune 500 and Alexa Top 500 for example) are well known, the entire website doesn't necessarily map well to a single URL.
- Most websites are comprised of many separate web pages. The landing page may not be representative of the overall site.
- Some websites, such as http://www.facebook.com/, require logging in to see typical content.
- Some websites, such as http://www.googleusercontent.com/, don't have a landing page. Instead, they are used for hosting other URLs and resources. In this case http://www.googleusercontent.com/ is the domain path used for resources inserted by users into Google documents, etc.
Because of these issues and more, it's possible that the actual HTML document analyzed is not representative of the website.
What's a "HAR file"?
HAR files are based on the HTTP Archive specification. They capture web page loading information in a JSON format. See the list of tools that support the HAR format.
How is the HTTP waterfall chart generated?
When looking at Trends what does it mean to choose the "intersection" URLs?
The number and exact list of URLs changes from run to run. Comparing trends for "All" the URLs from run to run is a bit like comparing apples and oranges. For more of an apples to apples comparison you can choose the "intersection" URLs. This is the maximum set of URLs that were measured in every run.
What are the definitions for the table columns for a website's requests?
The View Site page contains a table with information about each HTTP request in an individual page, for example http://www.w3.org/. The more obtuse columns are defined here:
- Req# - The sequence number for each HTTP request - 1 = first, 2 = second, etc.
- URL - The URL of the HTTP request. These are often truncated in the display. Hold your mouse over the link to see the full URL in the browser's status bar.
- MIME Type - The request's MIME type.
- Method - The HTTP request method.
- Status - The HTTP response status code.
- Time - The number of milliseconds it took to complete the request.
- Response Size - The size of the response transferred over the wire. If the response was compressed the actual size of the response content is larger.
- Request Cookie Len - The size of the Cookie: request header.
- Response Cookie Len - The size of the Set-Cookie: response header.
- Response HTTP Ver - The HTTP version number sent in the request.
- Response HTTP Ver - The HTTP version number received in the response.
- other HTTP request headers:
- other HTTP response headers:
Definitions for each of the HTTP headers can be found in the HTTP/1.1: Header Field Definitions.
How do I add a website to the HTTP Archive?
You can add a website to the HTTP Archive via the Add a Site page.
How do I get my website removed from the HTTP Archive?
You can have your site removed from the HTTP Archive via the Remove Your Site page.
How do I report inappropriate (adult only) content?
Please report any inappropriate content by creating a new issue. You may come across inappropriate content when viewing a website's filmstrip screenshots. You can help us flag these websites. Screenshots are not shown for websites flagged as adult only.
Who created the HTTP Archive?
Steve Souders created the HTTP Archive. It's built on the shoulders of Pat Meenan's WebPagetest system. Several folks on Google's Make the Web Faster team chipped in. I've received patches from several individuals including Jonathan Klein, Yusuke Tsutsumi, Carson McDonald, James Byers, Ido Green, Charlie Clark, and Mike Pfirrmann. Guy Leech helped early on with the design. More recently, Stephen Hay created the new logo.
The HTTP Archive Mobile test framework is provided by Blaze.io with much help from Guy (Guypo) Podjarny.
Who sponsors the HTTP Archive?
The HTTP Archive is possible through the support of these sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, dynaTrace Software, and Torbit.
How do I make a donation to support the HTTP Archive?
The HTTP Archive is part of the Internet Archive, a 501(c)(3) non-profit. Donations in support of the HTTP Archive can be made through the Internet Archive's donation page. Make sure to designate your donation is for the "HTTP Archive".
Who do I contact for more information?
Please go to the HTTP Archive discussion list and submit a post.