mcp_server_webcrawl.crawlers.archivebox package

Submodules

mcp_server_webcrawl.crawlers.archivebox.adapter module

class ArchiveBoxManager[source]

Bases: IndexedManager

Manages ArchiveBox in-memory SQLite databases for session-level reuse.

Initialize the ArchiveBox manager with empty cache and statistics.

__init__()[source]

Initialize the ArchiveBox manager with empty cache and statistics.

Return type:: None

get_sites(datasrc, ids=None, fields=None)[source]

List ArchiveBox instances as separate sites. Each subdirectory of datasrc that contains an “archive” folder is treated as a separate ArchiveBox instance.

Parameters:

datasrc (Path) – path to the directory containing ArchiveBox instance directories
ids (list[int] | None) – optional list of site IDs to filter by
fields (list[str] | None) – optional list of fields to include in the response

Returns:

List of SiteResult objects, one for each ArchiveBox instance

Return type:

list[SiteResult]

get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]

Get resources from ArchiveBox instances using in-memory SQLite.

Parameters:

datasrc (Path) – path to the directory containing ArchiveBox instance directories
sites (list[int] | None) – optional list of site IDs to filter by
query (str) – search query string
fields (list[str] | None) – optional list of fields to include in response
sort (str | None) – sort order for results
limit (int) – maximum number of results to return
offset (int) – number of results to skip for pagination

Returns:

Tuple of (list of ResourceResult objects, total count, IndexState)

Return type:

tuple[list[ResourceResult], int, IndexState]

mcp_server_webcrawl.crawlers.archivebox.crawler module

class ArchiveBoxCrawler[source]

Bases: IndexedCrawler

A crawler implementation for ArchiveBox archived sites. Provides functionality for accessing and searching web content from ArchiveBox archives. ArchiveBox creates single-URL archives with metadata stored in JSON files and HTML content preserved in index.html files.

Initialize the ArchiveBox crawler with a data source directory.

Parameters:: datasrc – The input argument as Path, it must be a directory containing ArchiveBox archive directories, each containing individual URL entries
Raises:: AssertionError – If datasrc is None or not a directory

__init__(datasrc)[source]

Initialize the ArchiveBox crawler with a data source directory.

Parameters:: datasrc (Path) – The input argument as Path, it must be a directory containing ArchiveBox archive directories, each containing individual URL entries
Raises:: AssertionError – If datasrc is None or not a directory

mcp_server_webcrawl.crawlers.archivebox.tests module

class ArchiveBoxTests[source]

Bases: BaseCrawlerTests

Test suite for the ArchiveBox crawler implementation. Uses wrapped test methods from BaseCrawlerTests adapted for ArchiveBox’s multi-instance structure.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]: Set up the test environment with fixture data.

test_archivebox_pulse()[source]: Test basic crawler initialization.

test_archivebox_sites()[source]: Test site retrieval API functionality.

test_archivebox_search()[source]: Test boolean search functionality.

test_pragmar_tokenizer()[source]: Test tokenizer search functionality.

test_archivebox_resources()[source]: Test resource retrieval API functionality with various parameters.

test_interrobot_images()[source]: Test InterroBot-specific image handling and thumbnails.

test_archivebox_sorts()[source]: Test random sort functionality using the ‘?’ sort parameter.

test_archivebox_content_parsing()[source]: Test content type detection and parsing for ArchiveBox resources.

test_archivebox_url_reconstruction()[source]: Test URL reconstruction from ArchiveBox metadata.

test_archivebox_deduplication()[source]: Test resource deduplication across timestamped entries.

test_archivebox_metadata_parsing()[source]: Test JSON metadata parsing from ArchiveBox files.

test_archivebox_timestamped_structure()[source]: Test handling of ArchiveBox’s timestamped entry structure.

test_archivebox_error_resilience()[source]: Test resilience to malformed JSON and missing files.

test_archivebox_multi_site()[source]: Test that multiple ArchiveBox working directories are treated as separate sites.

test_report()[source]: Run test report for ArchiveBox archive.

mcp_server_webcrawl.crawlers.archivebox package

Submodules

mcp_server_webcrawl.crawlers.archivebox.adapter module

mcp_server_webcrawl.crawlers.archivebox.crawler module

mcp_server_webcrawl.crawlers.archivebox.tests module

Module contents