mcp_server_webcrawl.crawlers.katana package

Submodules

mcp_server_webcrawl.crawlers.katana.adapter module

class KatanaManager[source]

Bases: BaseManager

Manages HTTP text files in in-memory SQLite databases. Provides connection pooling and caching for efficient access.

Initialize the HTTP text manager with empty cache and statistics.

__init__()[source]

Initialize the HTTP text manager with empty cache and statistics.

Return type:: None

get_sites(datasrc, ids=None, fields=None)[source]

List site directories in the datasrc directory as sites.

Parameters:

datasrc (Path) – Path to the directory containing site subdirectories
ids (list[int] | None) – Optional list of site IDs to filter by
fields (list[str] | None) – Optional list of fields to include in the response

Returns:

List of SiteResult objects, one for each site directory

Return type:

list[SiteResult]

Notes

Returns an empty list if the datasrc directory doesn’t exist.

get_resources(datasrc, ids=None, sites=None, query='', types=None, fields=None, statuses=None, sort=None, limit=20, offset=0)[source]

get resources from HTTP text files using in-memory SQLite.

Parameters:

datasrc (Path) – Path to the directory containing site directories
ids (list[int] | None) – Optional list of resource IDs to filter by
sites (list[int] | None) – Optional list of site IDs to filter by
query (str) – Search query string
types (list[ResourceResultType] | None) – Optional list of resource types to filter by
fields (list[str] | None) – Optional list of fields to include in response
statuses (list[int] | None) – Optional list of HTTP status codes to filter by
sort (str | None) – Sort order for results
limit (int) – Maximum number of results to return
offset (int) – Number of results to skip for pagination

Returns:

Tuple of (list of ResourceResult objects, total count)

Return type:

Tuple[list[ResourceResult], int]

mcp_server_webcrawl.crawlers.katana.crawler module

class KatanaCrawler[source]

Bases: IndexedCrawler

A crawler implementation for HTTP text files. Provides functionality for accessing and searching web content from captured HTTP exchanges.

Initialize the HTTP text crawler with a data source directory.

Parameters:: datasrc – The input argument as Path, it must be a directory containing subdirectories with HTTP text files

__init__(datasrc)[source]

Initialize the HTTP text crawler with a data source directory.

Parameters:: datasrc (Path) – The input argument as Path, it must be a directory containing subdirectories with HTTP text files

mcp_server_webcrawl.crawlers.katana.tests module

class KatanaTests[source]

Bases: BaseCrawlerTests

test suite for the HTTP text crawler implementation. tests parsing and retrieval of web content from HTTP text files.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]: set up the test environment with fixture data.

test_katana_pulse()[source]: basic crawler initialization.

test_katana_sites()[source]: site retrieval API functionality.

test_katana_resources()[source]: resource retrieval API functionality with various parameters.

test_katana_random_sort()[source]: random sort functionality using the ‘?’ sort parameter.

test_katana_content_parsing()[source]: content type detection and parsing for HTTP text files.

mcp_server_webcrawl.crawlers.katana package

Submodules

mcp_server_webcrawl.crawlers.katana.adapter module

mcp_server_webcrawl.crawlers.katana.crawler module

mcp_server_webcrawl.crawlers.katana.tests module

Module contents