mcp_server_webcrawl.crawlers.base package

Submodules

mcp_server_webcrawl.crawlers.base.adapter module

class IndexStatus[source]

Bases: Enum

An enumeration.

UNDEFINED = ''
IDLE = 'idle'
INDEXING = 'indexing'
PARTIAL = 'partial'
COMPLETE = 'complete'
REMOTE = 'remote'
FAILED = 'failed'
class IndexState[source]

Bases: object

Shared state between crawler and manager for indexing progress

status: IndexStatus = ''
processed: int = 0
time_start: datetime | None = None
time_end: datetime | None = None
set_status(status)[source]
Parameters:

status (IndexStatus) –

increment_processed()[source]
property duration: str
is_timeout()[source]

Check if the indexing operation has exceeded the timeout threshold

Return type:

bool

to_dict()[source]

Convert the IndexState to a dictionary representation

Return type:

dict

__init__(status=IndexStatus.UNDEFINED, processed=0, time_start=None, time_end=None)
Parameters:
Return type:

None

class SitesGroup[source]

Bases: object

Container class supports the searching of one or more sites at once.

Parameters:
  • datasrc – site datasrc

  • site_ids – site ids of the sites

  • site_paths – paths to site contents (directories)

__init__(datasrc, site_ids, site_paths)[source]

Container class supports the searching of one or more sites at once.

Parameters:
  • datasrc (Path) – site datasrc

  • site_ids (list[int]) – site ids of the sites

  • site_paths (list[Path]) – paths to site contents (directories)

Return type:

None

get_sites()[source]
Return type:

dict[int, str]

class SitesStat[source]

Bases: object

Some basic bookeeping, for troubleshooting

__init__(group, cached)[source]

Some basic bookeeping, for troubleshooting

Parameters:
Return type:

None

class BaseManager[source]

Bases: object

Base class for managing web crawler data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.

Initialize the manager with statistics.

__init__()[source]

Initialize the manager with statistics.

Return type:

None

static string_to_id(value)[source]

Convert a string, such as a directory name, to a numeric ID suitable for a database primary key.

Hash space and collision probability notes: - [:8] = 32 bits (4.29 billion values) - ~1% collision chance with 10,000 items - [:12] = 48 bits (280 trillion values) - ~0.0000001% collision chance with 10,000 items - [:16] = 64 bits (max safe SQLite INTEGER) - near-zero collision, 9.22 quintillion values - SQLite INTEGER type is 64-bit signed, with max value of 9,223,372,036,854,775,807. - The big problem with larger hashspaces is the length of the ids they generate for presentation.

Parameters:

value (str) – Input string to convert to an ID

Returns:

Integer ID derived from the input string

Return type:

int

static get_basic_headers(file_size, resource_type)[source]
Parameters:
Return type:

str

static read_files(paths)[source]
Parameters:

paths (list[Path]) –

Return type:

dict[Path, str | None]

static read_file_contents(file_path, resource_type)[source]

Read content from text files with better error handling and encoding detection.

Return type:

str | None

static decruft_path(path)[source]

Very light touch cleanup of file naming, these tmps are creating noise and extensions are useful in classifying resources

Parameters:

path (str) –

Return type:

str

get_stats()[source]
Return type:

list[SitesStat]

get_resources_for_sites_group(sites_group, query, fields, sort, limit, offset, swap_values={})[source]

Get resources from directories using structured query parsing with SearchQueryParser.

This method extracts types, fields, and statuses from the querystring instead of accepting them as separate arguments, using the new SearchSubquery functionality.

Parameters:
  • sites_group (SitesGroup) – Group of sites to search in

  • query (str) – Search query string that can include field:value syntax for filtering

  • fields (list[str] | None) – resource fields to be returned by the API (Content, Headers, etc.)

  • sort (str | None) – Sort order for results

  • limit (int) – Maximum number of results to return

  • offset (int) – Number of results to skip for pagination

  • swap_values (dict) – per-field parameterized values to check for (and replace)

Returns:

Tuple of (list of ResourceResult objects, total count, connection_index_state)

Return type:

tuple[list[ResourceResult], int, IndexState]

Notes

Returns empty results if sites is empty or not provided. If the database is being built, it will log a message and return empty results.

This method extracts field-specific filters from the query string using SearchQueryParser: - type:html (to filter by resource type) - status:200 (to filter by HTTP status) Any fields present in the SearchSubquery will be included in the response.

mcp_server_webcrawl.crawlers.base.api module

class BaseJsonApiEncoder[source]

Bases: JSONEncoder

Custom JSON encoder for BaseJsonApi objects and ResourceResultType enums.

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is None and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

default(obj)[source]

Override default encoder to handle custom types.

Parameters:

obj – Object to encode

Returns:

JSON serializable representation of the object

Return type:

Any

class BaseJsonApi[source]

Bases: object

Base class for JSON API responses.

Provides a standardized structure for API responses including metadata, results, and error handling.

Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.

Parameters:
  • method – API method name

  • args – Dictionary of API arguments

  • index_state – indexing, complete, remote, etc.

__init__(method, args, index_state=None)[source]

Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.

Parameters:
  • method (str) – API method name

  • args (dict[str, Any]) – Dictionary of API arguments

  • index_state (IndexState | None) – indexing, complete, remote, etc.

property total: int

Returns the total number of results.

Returns:

Integer count of total results

get_results()[source]
Return type:

list[SiteResult | ResourceResult]

set_results(results, total, offset, limit)[source]

Set the results of the API response.

Parameters:
  • results (list[SiteResult | ResourceResult]) – List of result objects

  • total (int) – Total number of results (including those beyond limit)

  • offset (int) – Starting position in the full result set

  • limit (int) – Maximum number of results to include

Return type:

None

append_error(message)[source]

Add an error to the JSON response, visible to the endpoint LLM.

Parameters:

message (str) – Error message to add

Return type:

None

to_dict()[source]

Convert the object to a JSON-serializable dictionary.

Returns:

Dictionary representation of the API response

Return type:

dict[str, str | int | float | bool | list[str] | list[int] | list[float] | None]

to_json()[source]

Return a JSON serializable representation of this object.

Returns:

JSON string representation of the API response

Return type:

str

mcp_server_webcrawl.crawlers.base.crawler module

class BaseCrawler[source]

Bases: object

Base crawler class that implements MCP server functionality.

This class provides the foundation for specialized crawlers to interact with the MCP server and handle tool operations for web resources.

Initialize the BaseCrawler with a data source path and required adapter functions.

Parameters:
  • datasrc – path to the data source

  • get_sites_func – function to retrieve sites from the data source

  • get_resources_func – function to retrieve resources from the data source

  • resource_field_mapping – mapping of resource field names to display names

__init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]

Initialize the BaseCrawler with a data source path and required adapter functions.

Parameters:
  • datasrc (Path) – path to the data source

  • get_sites_func (Callable) – function to retrieve sites from the data source

  • get_resources_func (Callable) – function to retrieve resources from the data source

  • resource_field_mapping (dict[str, str]) – mapping of resource field names to display names

Return type:

None

property datasrc: Path
async mcp_list_prompts()[source]

List available prompts (currently none).

Return type:

list

async mcp_list_resources()[source]

List available resources (currently none).

Return type:

list

async serve(stdin, stdout)[source]

Launch the awaitable server.

Parameters:
  • stdin (AsyncFile[str] | None) – input stream for the server

  • stdout (AsyncFile[str] | None) – output stream for the server

Returns:

The MCP server over stdio

Return type:

dict[str, Any]

get_initialization_options()[source]

Get the MCP initialization object.

Returns:

Dictionary containing project information

Return type:

InitializationOptions

get_sites_api_json(**kwargs)[source]

Get sites API result as JSON.

Returns:

JSON string of sites API results

Return type:

str

get_resources_api_json(**kwargs)[source]

Get resources API result as JSON.

Returns:

JSON string of resources API results

Return type:

str

get_sites_api(ids=None, fields=None)[source]
Parameters:
Return type:

BaseJsonApi

get_resources_api(sites=None, query='', fields=None, sort=None, limit=20, offset=0, extras=None)[source]
Parameters:
Return type:

BaseJsonApi

async mcp_list_tools()[source]

List available tools.

Returns:

List of available tools

Raises:

NotImplementedError – This method must be implemented by subclasses

Return type:

list[Tool]

async mcp_call_tool(name, arguments)[source]

Handle tool execution requests. You can override this or super(), then tweak. Basically, it is a passthrough.

Parameters:
  • name (str) – name of the tool to call

  • arguments (dict[str, Any] | None) – arguments to pass to the tool

Returns:

List of content objects resulting from the tool execution

Raises:

ValueError – If the specified tool does not exist

Return type:

list[TextContent | ImageContent | EmbeddedResource]

get_thumbnails(results)[source]
Parameters:

results (list[ResourceResult]) –

Return type:

list[ImageContent]

mcp_server_webcrawl.crawlers.base.indexed module

class IndexedManager[source]

Bases: BaseManager

Initialize the manager with statistics.

__init__()[source]

Initialize the manager with statistics.

get_connection(group)[source]

Get database connection for sites in the group, creating if needed.

Parameters:

group (SitesGroup) – group of sites to connect to

Returns:

Tuple of (SQLite connection to in-memory database with data loaded or None if building,

IndexState associated with this database)

Return type:

tuple[Connection | None, IndexState]

get_sites_for_directories(datasrc, ids=None, fields=None)[source]

List site directories in the datasrc directory as sites.

Parameters:
  • datasrc (Path) – path to the directory containing site subdirectories

  • ids (list[int] | None) – optional list of site IDs to filter by

  • fields (list[str] | None) – optional list of fields to include in the response

Returns:

List of SiteResult objects, one for each site directory

Return type:

list[SiteResult]

Notes

Returns an empty list if the datasrc directory doesn’t exist.

class IndexedCrawler[source]

Bases: BaseCrawler

A crawler implementation for data sources that load into an in-memory sqlite. Shares commonality between specialized crawlers.

Initialize the IndexedCrawler with a data source path and required adapter functions.

Parameters:
  • datasrc – path to the data source

  • get_sites_func – function to retrieve sites from the data source

  • get_resources_func – function to retrieve resources from the data source

  • resource_field_mapping – mapping of resource field names to display names

__init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]

Initialize the IndexedCrawler with a data source path and required adapter functions.

Parameters:
  • datasrc (Path) – path to the data source

  • get_sites_func (Callable) – function to retrieve sites from the data source

  • get_resources_func (Callable) – function to retrieve resources from the data source

  • resource_field_mapping (dict[str, str]) – mapping of resource field names to display names

Return type:

None

async mcp_list_tools()[source]

List available tools for this crawler.

Returns:

List of Tool objects

Return type:

list[Tool]

mcp_server_webcrawl.crawlers.base.tests module

class BaseCrawlerTests[source]

Bases: TestCase

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]

Hook method for setting up the test fixture before exercising it.

run_pragmar_search_tests(crawler, site_id)[source]

Run a battery of database checks on the crawler and Boolean validation

Parameters:
run_pragmar_image_tests(crawler, pragmar_site_id)[source]

Test InterroBot-specific image handling and thumbnails.

Parameters:
run_sites_resources_tests(crawler, pragmar_site_id, example_site_id)[source]
Parameters:
run_pragmar_tokenizer_tests(crawler, site_id)[source]

fts hyphens and underscores are particularly challenging, thus have a dedicated test. these must be configured in multiple places including CREATE TABLE … tokenizer, as well as handled by the query parser.

Parameters:
run_pragmar_site_tests(crawler, site_id)[source]
Parameters:
run_pragmar_sort_tests(crawler, site_id)[source]
Parameters:
run_pragmar_content_tests(crawler, site_id, html_leniency)[source]
Parameters:
run_pragmar_report(crawler, site_id, heading)[source]

Generate a comprehensive report of all resources for a site. Returns a formatted string with counts and URLs by type.

Parameters:

Module contents