mcp_server_webcrawl.crawlers.siteone package
Submodules
mcp_server_webcrawl.crawlers.siteone.adapter module
- class SiteOneManager[source]
Bases:
BaseManager
Manages SiteOne directory data in in-memory SQLite databases. Wraps wget archive format (shared by SiteOne and wget) Provides connection pooling and caching for efficient access.
Initialize the SiteOne manager with empty cache and statistics.
- get_resources(datasrc, ids=None, sites=None, query='', types=None, fields=None, statuses=None, sort=None, limit=20, offset=0)[source]
Get resources from wget directories using in-memory SQLite.
- Parameters:
datasrc (Path) – Path to the directory containing wget captures
ids (list[int] | None) – Optional list of resource IDs to filter by
sites (list[int] | None) – Optional list of site IDs to filter by
query (str) – Search query string
types (list[ResourceResultType] | None) – Optional list of resource types to filter by
fields (list[str] | None) – Optional list of fields to include in response
statuses (list[int] | None) – Optional list of HTTP status codes to filter by
sort (str | None) – Sort order for results
limit (int) – Maximum number of results to return
offset (int) – Number of results to skip for pagination
- Returns:
Tuple of (list of ResourceResult objects, total count)
- Return type:
mcp_server_webcrawl.crawlers.siteone.crawler module
- class SiteOneCrawler[source]
Bases:
IndexedCrawler
A crawler implementation for SiteOne captured sites. Provides functionality for accessing and searching web content from SiteOne captures. SiteOne merges a wget archive with a custom SiteOne generated log to aquire more fields than wget can alone.
Initialize the SiteOne crawler with a data source directory.
- Parameters:
datasrc – The input argument as Path, it must be a directory containing SiteOne captures organized as subdirectories
- Raises:
AssertionError – If datasrc is None or not a directory
- __init__(datasrc)[source]
Initialize the SiteOne crawler with a data source directory.
- Parameters:
datasrc (Path) – The input argument as Path, it must be a directory containing SiteOne captures organized as subdirectories
- Raises:
AssertionError – If datasrc is None or not a directory
mcp_server_webcrawl.crawlers.siteone.tests module
- class SiteOneTests[source]
Bases:
BaseCrawlerTests
Test suite for the SiteOne crawler implementation.
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.