mcp_server_webcrawl.crawlers.warc package
Submodules
mcp_server_webcrawl.crawlers.warc.adapter module
- class WarcManager[source]
Bases:
IndexedManagerManages WARC file data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.
Initialize the WARC manager with empty cache and statistics.
- get_sites(datasrc, ids=None, fields=None)[source]
List WARC files in the datasrc directory as sites.
- get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]
Get resources from wget directories using in-memory SQLite.
- Parameters:
datasrc (Path) – path to the directory containing wget captures
sites (list[int] | None) – optional list of site IDs to filter by
query (str) – search query string
fields (list[str] | None) – optional list of fields to include in response
sort (str | None) – sort order for results
limit (int) – maximum number of results to return
offset (int) – number of results to skip for pagination
- Returns:
Tuple of (list of ResourceResult objects, total count)
- Return type:
mcp_server_webcrawl.crawlers.warc.crawler module
- class WarcCrawler[source]
Bases:
IndexedCrawlerA crawler implementation for WARC (Web ARChive) files. Provides functionality for accessing and searching web archive content.
Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz
- Parameters:
datasrc – the input argument as Path, must be a directory containing WARC files
- Raises:
AssertionError – If datasrc is None or not a directory
- __init__(datasrc)[source]
Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz
- Parameters:
datasrc (Path) – the input argument as Path, must be a directory containing WARC files
- Raises:
AssertionError – If datasrc is None or not a directory
mcp_server_webcrawl.crawlers.warc.tests module
- class WarcTests[source]
Bases:
BaseCrawlerTestsTest suite for the WARC crawler implementation. Uses all wrapped test methods from BaseCrawlerTests.
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.