mcp_server_webcrawl.crawlers.archivebox package
Submodules
mcp_server_webcrawl.crawlers.archivebox.adapter module
- class ArchiveBoxManager[source]
Bases:
IndexedManager
Manages ArchiveBox in-memory SQLite databases for session-level reuse.
Initialize the ArchiveBox manager with empty cache and statistics.
- get_sites(datasrc, ids=None, fields=None)[source]
List ArchiveBox instances as separate sites. Each subdirectory of datasrc that contains an “archive” folder is treated as a separate ArchiveBox instance.
- Parameters:
- Returns:
List of SiteResult objects, one for each ArchiveBox instance
- Return type:
- get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]
Get resources from ArchiveBox instances using in-memory SQLite.
- Parameters:
datasrc (Path) – path to the directory containing ArchiveBox instance directories
sites (list[int] | None) – optional list of site IDs to filter by
query (str) – search query string
fields (list[str] | None) – optional list of fields to include in response
sort (str | None) – sort order for results
limit (int) – maximum number of results to return
offset (int) – number of results to skip for pagination
- Returns:
Tuple of (list of ResourceResult objects, total count, IndexState)
- Return type:
mcp_server_webcrawl.crawlers.archivebox.crawler module
- class ArchiveBoxCrawler[source]
Bases:
IndexedCrawler
A crawler implementation for ArchiveBox archived sites. Provides functionality for accessing and searching web content from ArchiveBox archives. ArchiveBox creates single-URL archives with metadata stored in JSON files and HTML content preserved in index.html files.
Initialize the ArchiveBox crawler with a data source directory.
- Parameters:
datasrc – The input argument as Path, it must be a directory containing ArchiveBox archive directories, each containing individual URL entries
- Raises:
AssertionError – If datasrc is None or not a directory
- __init__(datasrc)[source]
Initialize the ArchiveBox crawler with a data source directory.
- Parameters:
datasrc (Path) – The input argument as Path, it must be a directory containing ArchiveBox archive directories, each containing individual URL entries
- Raises:
AssertionError – If datasrc is None or not a directory
mcp_server_webcrawl.crawlers.archivebox.tests module
- class ArchiveBoxTests[source]
Bases:
BaseCrawlerTests
Test suite for the ArchiveBox crawler implementation. Uses wrapped test methods from BaseCrawlerTests adapted for ArchiveBox’s multi-instance structure.
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
- test_archivebox_resources()[source]
Test resource retrieval API functionality with various parameters.
- test_archivebox_content_parsing()[source]
Test content type detection and parsing for ArchiveBox resources.
- test_archivebox_timestamped_structure()[source]
Test handling of ArchiveBox’s timestamped entry structure.