Installation

Install the package via pip:

pip install mcp-server-webcrawl

Requirements

To use mcp-server-webcrawl effectively, you need:

  • An MCP-capable LLM client such as Claude Desktop

  • Python installed on your command line interface

  • Basic familiarity with running Python packages

After ensuring these prerequisites are met, run the pip install command above to add the package to your environment.

MCP Configuration

To enable your LLM client to access your web crawl data, you’ll need to add an MCP server configuration. From Claude’s developer settings, locate the MCP configuration section and add the appropriate configuration for your crawler type.

Below are configurations for each supported crawler type. Choose the one that matches your crawler and modify the --datasrc path to point to your specific data location.

wget

{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc",
         "/path/to/wget/archives/"]
    }
  }
}

Tested wget commands:

# (macOS Terminal/Windows WSL)
# --adjust-extension for file extensions, e.g. *.html
$ wget --mirror https://example.com
$ wget --mirror https://example.com --adjust-extension

WARC

{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc",
         "/path/to/warc/archives/"]
    }
  }
}

Tested WARC commands:

# (macOS Terminal/Windows WSL)
$ wget --warc-file=example --recursive https://example.com
$ wget --warc-file=example --recursive --page-requisites https://example.com

InterroBot

{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "interrobot", "--datasrc",
         "[homedir]/Documents/InterroBot/interrobot.v2.db"]
    }
  }
}

Notes for InterroBot:

  • Crawls must be executed in InterroBot (windowed application)

  • On Windows: replace [homedir] with /Users/…

  • On macOS: path is provided on InterroBot settings page

Katana

{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "katana", "--datasrc",
         "/path/to/katana/crawls/"]
    }
  }
}

Tested Katana commands:

# (macOS Terminal/Powershell/WSL)
# -store-response to save crawl contents
# -store-response-dir for expansion of hosts
#   consistent with default Katana behavior to
#   spread assets across origin host directories

$ katana -u https://example.com -store-response -store-response-dir crawls/example.com/

SiteOne

{
  "mcpServers": {
    "webcrawl": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "siteone", "--datasrc",
         "/path/to/siteone/archives/"]
    }
  }
}

Notes for SiteOne:

  • Crawls must be executed in SiteOne (windowed application)

  • Generate offline website must be checked

Multiple Configurations

You can set up multiple mcp-server-webcrawl connections under the mcpServers section if you want to access different crawler data sources simultaneously.

{
  "mcpServers": {
    "webcrawl_warc": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "warc", "--datasrc", "/path/to/warc/archives/"]
    },
    "webcrawl_wget": {
      "command": "/path/to/mcp-server-webcrawl",
       "args": ["--crawler", "wget", "--datasrc", "/path/to/wget/archives/"]
    }
  }
}

After adding the configuration, save the file and restart your LLM client to apply the changes.