mcp_server_webcrawl.crawlers.warc package
Submodules
mcp_server_webcrawl.crawlers.warc.adapter module
- class WarcManager[source]
Bases:
IndexedManager
Manages WARC file data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.
Initialize the WARC manager with empty cache and statistics.
- get_sites(datasrc, ids=None, fields=None)[source]
List WARC files in the datasrc directory as sites.
- get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]
Get resources from wget directories using in-memory SQLite.
- Parameters:
datasrc (Path) – path to the directory containing wget captures
sites (list[int] | None) – optional list of site IDs to filter by
query (str) – search query string
fields (list[str] | None) – optional list of fields to include in response
sort (str | None) – sort order for results
limit (int) – maximum number of results to return
offset (int) – number of results to skip for pagination
- Returns:
Tuple of (list of ResourceResult objects, total count)
- Return type:
mcp_server_webcrawl.crawlers.warc.crawler module
- class WarcCrawler[source]
Bases:
IndexedCrawler
A crawler implementation for WARC (Web ARChive) files. Provides functionality for accessing and searching web archive content.
Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz
- Parameters:
datasrc – the input argument as Path, must be a directory containing WARC files
- Raises:
AssertionError – If datasrc is None or not a directory
- __init__(datasrc)[source]
Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz
- Parameters:
datasrc (Path) – the input argument as Path, must be a directory containing WARC files
- Raises:
AssertionError – If datasrc is None or not a directory
mcp_server_webcrawl.crawlers.warc.tests module
- class WarcTests[source]
Bases:
BaseCrawlerTests
Test suite for the WARC crawler implementation. Uses all wrapped test methods from BaseCrawlerTests.
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.