mcp_server_webcrawl.crawlers.base package

Submodules

mcp_server_webcrawl.crawlers.base.adapter module

class IndexStatus[source]

Bases: Enum

An enumeration.

UNDEFINED = ''

IDLE = 'idle'

INDEXING = 'indexing'

PARTIAL = 'partial'

COMPLETE = 'complete'

REMOTE = 'remote'

FAILED = 'failed'

class IndexState[source]

Bases: object

Shared state between crawler and manager for indexing progress

status: IndexStatus = ''

processed: int = 0

time_start: datetime | None = None

time_end: datetime | None = None

set_status(status)[source]

Parameters:: status (IndexStatus) –

increment_processed()[source]

property duration: str

is_timeout()[source]

Check if the indexing operation has exceeded the timeout threshold

Return type:: bool

to_dict()[source]

Convert the IndexState to a dictionary representation

Return type:: dict

__init__(status=IndexStatus.UNDEFINED, processed=0, time_start=None, time_end=None)

Parameters:

status (IndexStatus) –
processed (int) –
time_start (datetime | None) –
time_end (datetime | None) –

Return type:

None

class SitesGroup[source]

Bases: object

Container class supports the searching of one or more sites at once.

Parameters:

datasrc – site datasrc
site_ids – site ids of the sites
site_paths – paths to site contents (directories)

__init__(datasrc, site_ids, site_paths)[source]

Container class supports the searching of one or more sites at once.

Parameters:

datasrc (Path) – site datasrc
site_ids (list[int]) – site ids of the sites
site_paths (list[Path]) – paths to site contents (directories)

Return type:

None

get_sites()[source]

Return type:: dict[int, str]

class SitesStat[source]

Bases: object

Some basic bookeeping, for troubleshooting

__init__(group, cached)[source]

Some basic bookeeping, for troubleshooting

Parameters:

group (SitesGroup) –
cached (bool) –

Return type:

None

class BaseManager[source]

Bases: object

Base class for managing web crawler data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.

Initialize the manager with statistics.

__init__()[source]

Initialize the manager with statistics.

Return type:: None

static string_to_id(value)[source]

Convert a string, such as a directory name, to a numeric ID suitable for a database primary key.

Hash space and collision probability notes: - [:8] = 32 bits (4.29 billion values) - ~1% collision chance with 10,000 items - [:12] = 48 bits (280 trillion values) - ~0.0000001% collision chance with 10,000 items - [:16] = 64 bits (max safe SQLite INTEGER) - near-zero collision, 9.22 quintillion values - SQLite INTEGER type is 64-bit signed, with max value of 9,223,372,036,854,775,807. - The big problem with larger hashspaces is the length of the ids they generate for presentation.

Parameters:: value (str) – Input string to convert to an ID
Returns:: Integer ID derived from the input string
Return type:: int

static get_basic_headers(file_size, resource_type)[source]

Parameters:

file_size (int) –
resource_type (ResourceResultType) –

Return type:

str

static read_files(paths)[source]

Parameters:: paths (list[Path]) –
Return type:: dict[Path, str | None]

static read_file_contents(file_path, resource_type)[source]

Read content from text files with better error handling and encoding detection.

Return type:: str | None

static decruft_path(path)[source]

Very light touch cleanup of file naming, these tmps are creating noise and extensions are useful in classifying resources

Parameters:: path (str) –
Return type:: str

get_stats()[source]

Return type:: list[SitesStat]

get_resources_for_sites_group(sites_group, query, fields, sort, limit, offset, swap_values={})[source]

Get resources from directories using structured query parsing with SearchQueryParser.

This method extracts types, fields, and statuses from the querystring instead of accepting them as separate arguments, using the new SearchSubquery functionality.

Parameters:

sites_group (SitesGroup) – Group of sites to search in
query (str) – Search query string that can include field:value syntax for filtering
fields (list[str] | None) – resource fields to be returned by the API (Content, Headers, etc.)
sort (str | None) – Sort order for results
limit (int) – Maximum number of results to return
offset (int) – Number of results to skip for pagination
swap_values (dict) – per-field parameterized values to check for (and replace)

Returns:

Tuple of (list of ResourceResult objects, total count, connection_index_state)

Return type:

tuple[list[ResourceResult], int, IndexState]

Notes

Returns empty results if sites is empty or not provided. If the database is being built, it will log a message and return empty results.

This method extracts field-specific filters from the query string using SearchQueryParser: - type:html (to filter by resource type) - status:200 (to filter by HTTP status) Any fields present in the SearchSubquery will be included in the response.

mcp_server_webcrawl.crawlers.base.api module

class BaseJsonApiEncoder[source]

Bases: JSONEncoder

Custom JSON encoder for BaseJsonApi objects and ResourceResultType enums.

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(obj)[source]

Override default encoder to handle custom types.

Parameters:: obj – Object to encode
Returns:: JSON serializable representation of the object
Return type:: Any

class BaseJsonApi[source]

Bases: object

Base class for JSON API responses.

Provides a standardized structure for API responses including metadata, results, and error handling.

Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.

Parameters:

method – API method name
args – Dictionary of API arguments
index_state – indexing, complete, remote, etc.

__init__(method, args, index_state=None)[source]

Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.

Parameters:

method (str) – API method name
args (dict[str, Any]) – Dictionary of API arguments
index_state (IndexState | None) – indexing, complete, remote, etc.

property total: int

Returns the total number of results.

Returns:: Integer count of total results

get_results()[source]

Return type:: list[SiteResult | ResourceResult]

set_results(results, total, offset, limit)[source]

Set the results of the API response.

Parameters:

results (list[SiteResult | ResourceResult]) – List of result objects
total (int) – Total number of results (including those beyond limit)
offset (int) – Starting position in the full result set
limit (int) – Maximum number of results to include

Return type:

None

append_error(message)[source]

Add an error to the JSON response, visible to the endpoint LLM.

Parameters:: message (str) – Error message to add
Return type:: None

to_dict()[source]

Convert the object to a JSON-serializable dictionary.

Returns:: Dictionary representation of the API response
Return type:: dict[str, str | int | float | bool | list[str] | list[int] | list[float] | None]

to_json()[source]

Return a JSON serializable representation of this object.

Returns:: JSON string representation of the API response
Return type:: str

mcp_server_webcrawl.crawlers.base.crawler module

class BaseCrawler[source]

Bases: object

Base crawler class that implements MCP server functionality.

This class provides the foundation for specialized crawlers to interact with the MCP server and handle tool operations for web resources.

Initialize the BaseCrawler with a data source path and required adapter functions.

Parameters:

datasrc – path to the data source
get_sites_func – function to retrieve sites from the data source
get_resources_func – function to retrieve resources from the data source
resource_field_mapping – mapping of resource field names to display names

__init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]

Initialize the BaseCrawler with a data source path and required adapter functions.

Parameters:

datasrc (Path) – path to the data source
get_sites_func (Callable) – function to retrieve sites from the data source
get_resources_func (Callable) – function to retrieve resources from the data source
resource_field_mapping (dict[str, str]) – mapping of resource field names to display names

Return type:

None

property datasrc: Path

async mcp_list_prompts()[source]

List available prompts (currently none).

Return type:: list

async mcp_list_resources()[source]

List available resources (currently none).

Return type:: list

async serve(stdin, stdout)[source]

Launch the awaitable server.

Parameters:

stdin (AsyncFile[str] | None) – input stream for the server
stdout (AsyncFile[str] | None) – output stream for the server

Returns:

The MCP server over stdio

Return type:

dict[str, Any]

get_initialization_options()[source]

Get the MCP initialization object.

Returns:: Dictionary containing project information
Return type:: InitializationOptions

get_sites_api_json(**kwargs)[source]

Get sites API result as JSON.

Returns:: JSON string of sites API results
Return type:: str

get_resources_api_json(**kwargs)[source]

Get resources API result as JSON.

Returns:: JSON string of resources API results
Return type:: str

get_sites_api(ids=None, fields=None)[source]

Parameters:

ids (list[int] | None) –
fields (list[str] | None) –

Return type:

BaseJsonApi

get_resources_api(sites=None, query='', fields=None, sort=None, limit=20, offset=0, extras=None)[source]

Parameters:

sites (list[int] | None) –
query (str) –
fields (list[str] | None) –
sort (str | None) –
limit (int) –
offset (int) –
extras (list[str] | None) –

Return type:

BaseJsonApi

async mcp_list_tools()[source]

List available tools.

Returns:: List of available tools
Raises:: NotImplementedError – This method must be implemented by subclasses
Return type:: list[Tool]

async mcp_call_tool(name, arguments)[source]

Handle tool execution requests. You can override this or super(), then tweak. Basically, it is a passthrough.

Parameters:

name (str) – name of the tool to call
arguments (dict[str, Any] | None) – arguments to pass to the tool

Returns:

List of content objects resulting from the tool execution

Raises:

ValueError – If the specified tool does not exist

Return type:

list[TextContent | ImageContent | EmbeddedResource]

get_thumbnails(results)[source]

Parameters:: results (list[ResourceResult]) –
Return type:: list[ImageContent]

mcp_server_webcrawl.crawlers.base.indexed module

class IndexedManager[source]

Bases: BaseManager

Initialize the manager with statistics.

__init__()[source]: Initialize the manager with statistics.

get_connection(group)[source]

Get database connection for sites in the group, creating if needed.

Parameters:

group (SitesGroup) – group of sites to connect to

Returns:

Tuple of (SQLite connection to in-memory database with data loaded or None if building,: IndexState associated with this database)

Return type:

tuple[Connection | None, IndexState]

get_sites_for_directories(datasrc, ids=None, fields=None)[source]

List site directories in the datasrc directory as sites.

Parameters:

datasrc (Path) – path to the directory containing site subdirectories
ids (list[int] | None) – optional list of site IDs to filter by
fields (list[str] | None) – optional list of fields to include in the response

Returns:

List of SiteResult objects, one for each site directory

Return type:

list[SiteResult]

Notes

Returns an empty list if the datasrc directory doesn’t exist.

class IndexedCrawler[source]

Bases: BaseCrawler

A crawler implementation for data sources that load into an in-memory sqlite. Shares commonality between specialized crawlers.

Initialize the IndexedCrawler with a data source path and required adapter functions.

Parameters:

datasrc – path to the data source
get_sites_func – function to retrieve sites from the data source
get_resources_func – function to retrieve resources from the data source
resource_field_mapping – mapping of resource field names to display names

__init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]

Initialize the IndexedCrawler with a data source path and required adapter functions.

Parameters:

datasrc (Path) – path to the data source
get_sites_func (Callable) – function to retrieve sites from the data source
get_resources_func (Callable) – function to retrieve resources from the data source
resource_field_mapping (dict[str, str]) – mapping of resource field names to display names

Return type:

None

async mcp_list_tools()[source]

List available tools for this crawler.

Returns:: List of Tool objects
Return type:: list[Tool]

mcp_server_webcrawl.crawlers.base.tests module

class BaseCrawlerTests[source]

Bases: TestCase

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]: Hook method for setting up the test fixture before exercising it.

run_pragmar_search_tests(crawler, site_id)[source]

Run a battery of database checks on the crawler and Boolean validation

Parameters:

crawler (BaseCrawler) –
site_id (int) –

run_pragmar_image_tests(crawler, pragmar_site_id)[source]

Test InterroBot-specific image handling and thumbnails.

Parameters:

crawler (BaseCrawler) –
pragmar_site_id (int) –

run_sites_resources_tests(crawler, pragmar_site_id, example_site_id)[source]

Parameters:

crawler (BaseCrawler) –
pragmar_site_id (int) –
example_site_id (int) –

run_pragmar_tokenizer_tests(crawler, site_id)[source]

fts hyphens and underscores are particularly challenging, thus have a dedicated test. these must be configured in multiple places including CREATE TABLE … tokenizer, as well as handled by the query parser.

Parameters:

crawler (BaseCrawler) –
site_id (int) –

run_pragmar_site_tests(crawler, site_id)[source]

Parameters:

crawler (BaseCrawler) –
site_id (int) –

run_pragmar_sort_tests(crawler, site_id)[source]

Parameters:

crawler (BaseCrawler) –
site_id (int) –

run_pragmar_content_tests(crawler, site_id, html_leniency)[source]

Parameters:

crawler (BaseCrawler) –
site_id (int) –
html_leniency (bool) –

run_pragmar_report(crawler, site_id, heading)[source]

Generate a comprehensive report of all resources for a site. Returns a formatted string with counts and URLs by type.

Parameters:

crawler (BaseCrawler) –
site_id (int) –
heading (str) –

mcp_server_webcrawl.crawlers.base package

Submodules

mcp_server_webcrawl.crawlers.base.adapter module

mcp_server_webcrawl.crawlers.base.api module

mcp_server_webcrawl.crawlers.base.crawler module

mcp_server_webcrawl.crawlers.base.indexed module

mcp_server_webcrawl.crawlers.base.tests module

Module contents