mcp_server_webcrawl.crawlers.warc package

Submodules

mcp_server_webcrawl.crawlers.warc.adapter module

class WarcManager[source]

Bases: IndexedManager

Manages WARC file data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.

Initialize the WARC manager with empty cache and statistics.

__init__()[source]

Initialize the WARC manager with empty cache and statistics.

Return type:

None

get_sites(datasrc, ids=None, fields=None)[source]

List WARC files in the datasrc directory as sites.

Parameters:
  • datasrc (Path) – path to the directory containing WARC files

  • ids (list[int] | None) – optional list of site IDs to filter by

  • fields (list[str] | None) – list of fields to include in the response

Returns:

List of SiteResult objects, one for each WARC file

Return type:

list[SiteResult]

get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]

Get resources from wget directories using in-memory SQLite.

Parameters:
  • datasrc (Path) – path to the directory containing wget captures

  • sites (list[int] | None) – optional list of site IDs to filter by

  • query (str) – search query string

  • fields (list[str] | None) – optional list of fields to include in response

  • sort (str | None) – sort order for results

  • limit (int) – maximum number of results to return

  • offset (int) – number of results to skip for pagination

Returns:

Tuple of (list of ResourceResult objects, total count)

Return type:

tuple[list[ResourceResult], int, IndexState]

mcp_server_webcrawl.crawlers.warc.crawler module

class WarcCrawler[source]

Bases: IndexedCrawler

A crawler implementation for WARC (Web ARChive) files. Provides functionality for accessing and searching web archive content.

Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz

Parameters:

datasrc – the input argument as Path, must be a directory containing WARC files

Raises:

AssertionError – If datasrc is None or not a directory

__init__(datasrc)[source]

Initialize the WARC crawler with a data source directory. Supported file types: .txt, .warc, and .warc.gz

Parameters:

datasrc (Path) – the input argument as Path, must be a directory containing WARC files

Raises:

AssertionError – If datasrc is None or not a directory

mcp_server_webcrawl.crawlers.warc.tests module

class WarcTests[source]

Bases: BaseCrawlerTests

Test suite for the WARC crawler implementation. Uses all wrapped test methods from BaseCrawlerTests.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]

Set up the test environment with fixture data.

test_warc_pulse()[source]

Test basic crawler initialization.

test_warc_sites()[source]

Test site retrieval API functionality.

Test boolean search functionality

test_warc_resources()[source]

Test resource retrieval API functionality with various parameters.

test_warc_random_sort()[source]

Test random sort functionality using the ‘?’ sort parameter.

test_warc_content_parsing()[source]

Test content type detection and parsing for WARC files.

test_report()[source]

Test thumbnail generation functionality (InterroBot-specific).

Module contents