ArchiveBox MCP Setup Guide
Instructions for setting up mcp-server-webcrawl with ArchiveBox. This allows your LLM (e.g. Claude Desktop) to search content and metadata from websites you’ve archived using ArchiveBox.
Follow along with the video, or the step-action guide below.
Requirements
Before you begin, ensure you have:
Claude Desktop installed
Python 3.10 or later installed
ArchiveBox installed
Basic familiarity with command line interfaces
What is ArchiveBox?
ArchiveBox is a powerful open-source web archiving solution that offers:
Multiple output formats (HTML, PDF, screenshots, WARC, etc.)
Comprehensive metadata
CLI + webadmin for browsing and managing archives
Support for various input sources (URLs, browser bookmarks, RSS feeds)
Self-hosted solution for long-term web content preservation
Installation Steps
1. Install mcp-server-webcrawl
Open your terminal or command line and install the package:
pip install mcp-server-webcrawl
Verify installation was successful:
mcp-server-webcrawl --help
2. Install and Set Up ArchiveBox
macOS/Linux only, Windows may work under Docker but is untested.
Install ArchiveBox (macOS/Linux):
pip install archivebox
macOS only, install brew and wget:
brew install wget
Create ArchiveBox collections. Unlike other crawlers that focus on single websites, ArchiveBox uses a collection-based approach where each collection can contain multiple URLs. You can create separate content for different projects or group related URLs together:
# Create a directory structure for your collections mkdir ~/archivebox-data # Create an "example" collection mkdir ~/archivebox-data/example cd ~/archivebox-data/example archivebox init archivebox add https://example.com # Create a "pragmar" collection mkdir ~/archivebox-data/pragmar cd ~/archivebox-data/pragmar archivebox init archivebox add https://pragmar.com
Each
archivebox init
creates a complete ArchiveBox instance with its own database and archive directory structure. The typical structure includes:collection-name/ ├── archive/ # Archived content organized by timestamp ├── logs/ # ArchiveBox operation logs ├── sources/ # Source URL lists and metadata └── index.sqlite3 # Database containing all metadata
3. Configure Claude Desktop
Open Claude Desktop
Go to File → Settings → Developer → Edit Config
Add the following configuration (modify paths as needed):
{
"mcpServers": {
"webcrawl": {
"command": "/path/to/mcp-server-webcrawl",
"args": ["--crawler", "archivebox", "--datasrc",
"/path/to/archivebox-data/"]
}
}
}
Note
On Windows, use
"mcp-server-webcrawl"
as the commandOn macOS/Linux, use the absolute path (output of
which mcp-server-webcrawl
)The datasrc path should point to the parent directory containing your ArchiveBox collections (e.g.,
~/archivebox-data/
), not to individual collection directoriesEach collection directory (example, pragmar, etc.) will appear as a separate “site” in MCP
Save the file and completely exit Claude Desktop (not just close the window)
Restart Claude Desktop
4. Verify and Use
In Claude Desktop, you should now see MCP tools available under Search and Tools
Ask Claude to list your archived sites:
Can you list the crawled sites available?
Try searching content from your archives:
Can you find information about [topic] on [archived site]?
Use the rich metadata for content discovery:
Can you find all the archived pages related to [keyword] from [archive]?
Troubleshooting
If Claude doesn’t show MCP tools after restart, verify your configuration file is correctly formatted
Ensure Python and mcp-server-webcrawl are properly installed
Check that your ArchiveBox archive directory path in the configuration is correct
Make sure ArchiveBox has successfully archived the websites and created the database
Verify that files exist in your archive/[timestamp] directories
Remember that the first time you use a function, Claude will ask for permission
For large archives, initial indexing may take some time during the first search
ArchiveBox’s comprehensive archiving capabilities combined with mcp-server-webcrawl provide powerful tools for content preservation, research, and analysis across your archived web content.
For more details, including API documentation and other crawler options, visit the mcp-server-webcrawl documentation.