Welcome to the Research Site

This site hosts various file types to study how web crawlers and AI bots interact with different content formats, sizes, and structures. Each file is designed to test specific aspects of crawler behavior.

Research Overview

Modern web crawlers from search engines and AI companies traverse the web at massive scale. Understanding how they handle different file types, follow links, respect robots.txt directives, and process various content formats is critical for web developers and researchers alike.

This site provides a controlled environment with known file types, sizes, and link structures. By analyzing server access logs, we can observe crawler behavior patterns including: which file types they request, how they handle binary content, whether they parse and follow links in HTML and XML files, and how they identify themselves via User-Agent strings.

Available Files

Below is the complete list of files hosted on this site. Each file has been crafted with substantial content to provide meaningful data for analysis.

Methodology

Files are served from AWS S3 with static website hosting enabled. Server access logging captures every request with timestamp, remote IP, requested key, HTTP status, referrer, and user-agent. This data is periodically downloaded and analyzed to identify crawler behavior patterns.

The file set includes text formats (HTML, CSS, JavaScript, JSON, XML, plain text), image formats (PNG, JPEG, GIF, SVG, WebP), and binary formats (files without extensions, log files, compressed archives). This diversity allows us to compare how crawlers treat different MIME types and file extensions.

Architecture

Architecture diagram Research photo

Links

For more information about this research, see the about page or subscribe to the RSS feed for updates. The humans.txt file credits the team behind this project.