mcp_server_webcrawl.crawlers.base package
Submodules
mcp_server_webcrawl.crawlers.base.adapter module
- class IndexStatus[source]
Bases:
Enum
An enumeration.
- UNDEFINED = ''
- IDLE = 'idle'
- INDEXING = 'indexing'
- PARTIAL = 'partial'
- COMPLETE = 'complete'
- REMOTE = 'remote'
- FAILED = 'failed'
- class IndexState[source]
Bases:
object
Shared state between crawler and manager for indexing progress
- status: IndexStatus = ''
- set_status(status)[source]
- Parameters:
status (IndexStatus) –
- is_timeout()[source]
Check if the indexing operation has exceeded the timeout threshold
- Return type:
- __init__(status=IndexStatus.UNDEFINED, processed=0, time_start=None, time_end=None)
- Parameters:
status (IndexStatus) –
processed (int) –
time_start (datetime | None) –
time_end (datetime | None) –
- Return type:
None
- class SitesGroup[source]
Bases:
object
Container class supports the searching of one or more sites at once.
- Parameters:
datasrc – site datasrc
site_ids – site ids of the sites
site_paths – paths to site contents (directories)
- class SitesStat[source]
Bases:
object
Some basic bookeeping, for troubleshooting
- __init__(group, cached)[source]
Some basic bookeeping, for troubleshooting
- Parameters:
group (SitesGroup) –
cached (bool) –
- Return type:
None
- class BaseManager[source]
Bases:
object
Base class for managing web crawler data in in-memory SQLite databases. Provides connection pooling and caching for efficient access.
Initialize the manager with statistics.
- static string_to_id(value)[source]
Convert a string, such as a directory name, to a numeric ID suitable for a database primary key.
Hash space and collision probability notes: - [:8] = 32 bits (4.29 billion values) - ~1% collision chance with 10,000 items - [:12] = 48 bits (280 trillion values) - ~0.0000001% collision chance with 10,000 items - [:16] = 64 bits (max safe SQLite INTEGER) - near-zero collision, 9.22 quintillion values - SQLite INTEGER type is 64-bit signed, with max value of 9,223,372,036,854,775,807. - The big problem with larger hashspaces is the length of the ids they generate for presentation.
- static get_basic_headers(file_size, resource_type)[source]
- Parameters:
file_size (int) –
resource_type (ResourceResultType) –
- Return type:
- static read_file_contents(file_path, resource_type)[source]
Read content from text files with better error handling and encoding detection.
- Return type:
str | None
- static decruft_path(path)[source]
Very light touch cleanup of file naming, these tmps are creating noise and extensions are useful in classifying resources
- get_resources_for_sites_group(sites_group, query, fields, sort, limit, offset, swap_values={})[source]
Get resources from directories using structured query parsing with SearchQueryParser.
This method extracts types, fields, and statuses from the querystring instead of accepting them as separate arguments, using the new SearchSubquery functionality.
- Parameters:
sites_group (SitesGroup) – Group of sites to search in
query (str) – Search query string that can include field:value syntax for filtering
fields (list[str] | None) – resource fields to be returned by the API (Content, Headers, etc.)
sort (str | None) – Sort order for results
limit (int) – Maximum number of results to return
offset (int) – Number of results to skip for pagination
swap_values (dict) – per-field parameterized values to check for (and replace)
- Returns:
Tuple of (list of ResourceResult objects, total count, connection_index_state)
- Return type:
Notes
Returns empty results if sites is empty or not provided. If the database is being built, it will log a message and return empty results.
This method extracts field-specific filters from the query string using SearchQueryParser: - type:html (to filter by resource type) - status:200 (to filter by HTTP status) Any fields present in the SearchSubquery will be included in the response.
mcp_server_webcrawl.crawlers.base.api module
- class BaseJsonApiEncoder[source]
Bases:
JSONEncoder
Custom JSON encoder for BaseJsonApi objects and ResourceResultType enums.
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is
None
and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a
TypeError
.
- class BaseJsonApi[source]
Bases:
object
Base class for JSON API responses.
Provides a standardized structure for API responses including metadata, results, and error handling.
Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.
- Parameters:
method – API method name
args – Dictionary of API arguments
index_state – indexing, complete, remote, etc.
- __init__(method, args, index_state=None)[source]
Construct with the arguments of creation (aoc), these will be echoed back in JSON response. This is an object that collapses into json on json dumps. This is done with everything within implementing to_dict.
- Parameters:
method (str) – API method name
index_state (IndexState | None) – indexing, complete, remote, etc.
- set_results(results, total, offset, limit)[source]
Set the results of the API response.
- Parameters:
results (list[SiteResult | ResourceResult]) – List of result objects
total (int) – Total number of results (including those beyond limit)
offset (int) – Starting position in the full result set
limit (int) – Maximum number of results to include
- Return type:
None
- append_error(message)[source]
Add an error to the JSON response, visible to the endpoint LLM.
- Parameters:
message (str) – Error message to add
- Return type:
None
mcp_server_webcrawl.crawlers.base.crawler module
- class BaseCrawler[source]
Bases:
object
Base crawler class that implements MCP server functionality.
This class provides the foundation for specialized crawlers to interact with the MCP server and handle tool operations for web resources.
Initialize the BaseCrawler with a data source path and required adapter functions.
- Parameters:
datasrc – path to the data source
get_sites_func – function to retrieve sites from the data source
get_resources_func – function to retrieve resources from the data source
resource_field_mapping – mapping of resource field names to display names
- __init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]
Initialize the BaseCrawler with a data source path and required adapter functions.
- Parameters:
- Return type:
None
- get_initialization_options()[source]
Get the MCP initialization object.
- Returns:
Dictionary containing project information
- Return type:
InitializationOptions
- get_sites_api_json(**kwargs)[source]
Get sites API result as JSON.
- Returns:
JSON string of sites API results
- Return type:
- get_resources_api_json(**kwargs)[source]
Get resources API result as JSON.
- Returns:
JSON string of resources API results
- Return type:
- get_resources_api(sites=None, query='', fields=None, sort=None, limit=20, offset=0, extras=None)[source]
- async mcp_list_tools()[source]
List available tools.
- Returns:
List of available tools
- Raises:
NotImplementedError – This method must be implemented by subclasses
- Return type:
list[Tool]
- async mcp_call_tool(name, arguments)[source]
Handle tool execution requests. You can override this or super(), then tweak. Basically, it is a passthrough.
- Parameters:
- Returns:
List of content objects resulting from the tool execution
- Raises:
ValueError – If the specified tool does not exist
- Return type:
list[TextContent | ImageContent | EmbeddedResource]
- get_thumbnails(results)[source]
- Parameters:
results (list[ResourceResult]) –
- Return type:
list[ImageContent]
mcp_server_webcrawl.crawlers.base.indexed module
- class IndexedManager[source]
Bases:
BaseManager
Initialize the manager with statistics.
- get_connection(group)[source]
Get database connection for sites in the group, creating if needed.
- Parameters:
group (SitesGroup) – group of sites to connect to
- Returns:
- Tuple of (SQLite connection to in-memory database with data loaded or None if building,
IndexState associated with this database)
- Return type:
tuple[Connection | None, IndexState]
- class IndexedCrawler[source]
Bases:
BaseCrawler
A crawler implementation for data sources that load into an in-memory sqlite. Shares commonality between specialized crawlers.
Initialize the IndexedCrawler with a data source path and required adapter functions.
- Parameters:
datasrc – path to the data source
get_sites_func – function to retrieve sites from the data source
get_resources_func – function to retrieve resources from the data source
resource_field_mapping – mapping of resource field names to display names
- __init__(datasrc, get_sites_func, get_resources_func, resource_field_mapping={'content': 'ResourcesFullText.Content', 'created': 'Resources.Created', 'fulltext': 'ResourcesFullText', 'headers': 'ResourcesFullText.Headers', 'id': 'ResourcesFullText.Id', 'modified': 'Resources.Modified', 'site': 'ResourcesFullText.Project', 'size': 'Resources.Size', 'status': 'Resources.Status', 'time': 'Resources.Time', 'type': 'ResourcesFullText.Type', 'url': 'ResourcesFullText.Url'})[source]
Initialize the IndexedCrawler with a data source path and required adapter functions.
- Parameters:
- Return type:
None
mcp_server_webcrawl.crawlers.base.tests module
- class BaseCrawlerTests[source]
Bases:
TestCase
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
- run_pragmar_search_tests(crawler, site_id)[source]
Run a battery of database checks on the crawler and Boolean validation
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
- run_pragmar_image_tests(crawler, pragmar_site_id)[source]
Test InterroBot-specific image handling and thumbnails.
- Parameters:
crawler (BaseCrawler) –
pragmar_site_id (int) –
- run_sites_resources_tests(crawler, pragmar_site_id, example_site_id)[source]
- Parameters:
crawler (BaseCrawler) –
pragmar_site_id (int) –
example_site_id (int) –
- run_pragmar_tokenizer_tests(crawler, site_id)[source]
fts hyphens and underscores are particularly challenging, thus have a dedicated test. these must be configured in multiple places including CREATE TABLE … tokenizer, as well as handled by the query parser.
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
- run_pragmar_site_tests(crawler, site_id)[source]
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
- run_pragmar_sort_tests(crawler, site_id)[source]
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
- run_pragmar_content_tests(crawler, site_id, html_leniency)[source]
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
html_leniency (bool) –
- run_pragmar_report(crawler, site_id, heading)[source]
Generate a comprehensive report of all resources for a site. Returns a formatted string with counts and URLs by type.
- Parameters:
crawler (BaseCrawler) –
site_id (int) –
heading (str) –