CRAWLER BEHAVIOR RESEARCH - README =================================== Project: Site 4726837462198733423 Date: February 2026 Status: Active data collection OVERVIEW -------- This site is a controlled research environment for studying how modern web crawlers -- particularly AI-powered bots from companies like OpenAI, Anthropic, Amazon, Google, and Microsoft -- interact with different types of web content. The site hosts a variety of file types including HTML documents, stylesheets, JavaScript files, structured data (JSON, XML), images in multiple formats (PNG, JPEG, GIF, SVG, WebP), and other file types. Each file contains substantial, non-trivial content designed to provide meaningful observations about crawler behavior. METHODOLOGY ----------- All files are served from AWS S3 with static website hosting enabled. S3 server access logging captures every HTTP request with the following fields: - Timestamp (date and time of the request) - Remote IP address - HTTP method and URI - Response status code - Bytes transferred - Referrer (HTTP Referer header) - User-Agent string This data allows us to identify specific crawlers, track their navigation patterns, measure how they handle different content types, and observe whether they respect directives in robots.txt. FILE INVENTORY -------------- Text formats: - index.html : Main page with links to all other files - about.html : Detailed research description - error.html : Custom 404 error page - style.css : Complete CSS stylesheet - app.js : Client-side JavaScript - data.json : Structured research dataset - sitemap.xml : Standard XML sitemap - robots.txt : Crawler directives - readme.txt : This file - feed.xml : RSS 2.0 feed - humans.txt : Team credits - manifest.json : Web app manifest - server.log : Simulated log file (large) Image formats: - logo.png : 256x256 PNG with geometric pattern - banner.jpg : 800x200 JPEG gradient - icon.gif : 32x32 animated GIF - diagram.svg : SVG with shapes and paths - photo.webp : 400x300 WebP image Other: - config : No file extension (binary/text content) - archive.zip : ZIP archive containing multiple files CONTACT ------- This is a research project. Data collected is limited to publicly available HTTP request metadata (IP addresses, user-agent strings, referrers) and is used solely for academic analysis of crawler behavior patterns.