mcp_server_webcrawl.crawlers.siteone package

Submodules

mcp_server_webcrawl.crawlers.siteone.adapter module

class SiteOneManager[source]

Bases: BaseManager

Manages SiteOne directory data in in-memory SQLite databases. Wraps wget archive format (shared by SiteOne and wget) Provides connection pooling and caching for efficient access.

Initialize the SiteOne manager with empty cache and statistics.

__init__()[source]

Initialize the SiteOne manager with empty cache and statistics.

Return type:

None

get_resources(datasrc, ids=None, sites=None, query='', types=None, fields=None, statuses=None, sort=None, limit=20, offset=0)[source]

Get resources from wget directories using in-memory SQLite.

Parameters:
  • datasrc (Path) – Path to the directory containing wget captures

  • ids (list[int] | None) – Optional list of resource IDs to filter by

  • sites (list[int] | None) – Optional list of site IDs to filter by

  • query (str) – Search query string

  • types (list[ResourceResultType] | None) – Optional list of resource types to filter by

  • fields (list[str] | None) – Optional list of fields to include in response

  • statuses (list[int] | None) – Optional list of HTTP status codes to filter by

  • sort (str | None) – Sort order for results

  • limit (int) – Maximum number of results to return

  • offset (int) – Number of results to skip for pagination

Returns:

Tuple of (list of ResourceResult objects, total count)

Return type:

Tuple[list[ResourceResult], int]

mcp_server_webcrawl.crawlers.siteone.crawler module

class SiteOneCrawler[source]

Bases: IndexedCrawler

A crawler implementation for SiteOne captured sites. Provides functionality for accessing and searching web content from SiteOne captures. SiteOne merges a wget archive with a custom SiteOne generated log to aquire more fields than wget can alone.

Initialize the SiteOne crawler with a data source directory.

Parameters:

datasrc – The input argument as Path, it must be a directory containing SiteOne captures organized as subdirectories

Raises:

AssertionError – If datasrc is None or not a directory

__init__(datasrc)[source]

Initialize the SiteOne crawler with a data source directory.

Parameters:

datasrc (Path) – The input argument as Path, it must be a directory containing SiteOne captures organized as subdirectories

Raises:

AssertionError – If datasrc is None or not a directory

mcp_server_webcrawl.crawlers.siteone.tests module

class SiteOneTests[source]

Bases: BaseCrawlerTests

Test suite for the SiteOne crawler implementation.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]

Set up the test environment with fixture data.

test_siteone_pulse()[source]

Test basic crawler initialization.

test_siteone_sites()[source]

Test site retrieval API functionality.

test_siteone_resources()[source]

Test resource retrieval API functionality with various parameters.

test_siteone_random_sort()[source]

Test the random sort functionality using the ‘?’ sort parameter.

Module contents