About This Research

Background

The proliferation of AI-powered web crawlers has dramatically changed the landscape of web scraping and indexing. Companies like OpenAI (GPTBot), Anthropic (ClaudeBot), Amazon (Amazonbot), and others now operate large-scale crawlers that visit millions of websites daily. These crawlers behave differently from traditional search engine bots in several important ways.

Research Questions

  1. Which file types do AI crawlers request versus ignore?
  2. Do crawlers follow links found in CSS, JavaScript, and JSON files?
  3. How do crawlers handle files without extensions?
  4. Do crawlers parse and follow links in XML sitemaps and RSS feeds?
  5. What are the download size limits for different crawler implementations?
  6. How do crawlers identify themselves (User-Agent strings)?
  7. Do crawlers respect robots.txt directives specific to their bot name?
  8. How frequently do crawlers revisit previously fetched content?

Setup

This site is hosted as an AWS S3 static website in us-east-1. All files are publicly accessible and served with correct Content-Type headers. S3 server access logging is enabled to capture detailed request metadata including timestamps, IP addresses, HTTP methods, response codes, referrers, and user-agent strings.

File Categories

Text Files

HTML pages with semantic markup, CSS stylesheets, JavaScript with DOM manipulation code, JSON data structures, XML sitemaps, RSS feeds, and plain text files. These test whether crawlers parse different text formats and follow embedded links.

Image Files

PNG, JPEG, GIF (animated), SVG, and WebP images. These test whether crawlers download binary image content or skip it, and whether they process SVG (which can contain links and text).

Other Files

Files without extensions, application manifests, and structured data files. These test how crawlers handle ambiguous content types and web application metadata.

Data Collection

Access logs are collected from S3 and analyzed using Python scripts. Each log entry contains the full User-Agent string, allowing precise identification of different crawler implementations. The referrer field shows how crawlers discover new URLs (direct access, following links from other pages, sitemap parsing, etc.).