mcp_server_webcrawl.crawlers.httrack package

Submodules

mcp_server_webcrawl.crawlers.httrack.adapter module

class HtTrackManager[source]

Bases: IndexedManager

Manages HTTrack project data in in-memory SQLite databases.

Initialize the HTTrack manager with empty cache and statistics.

__init__()[source]

Initialize the HTTrack manager with empty cache and statistics.

Return type:

None

get_sites(datasrc, ids=None, fields=None)[source]

List HTTrack project directories as sites.

Parameters:
  • datasrc (Path) – path to the directory containing HTTrack projects

  • ids (list[int] | None) – optional list of site IDs to filter by

  • fields (list[str] | None) – optional list of fields to include in the response

Returns:

List of SiteResult objects, one for each HTTrack project

Return type:

list[SiteResult]

get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]

Get resources from HTTrack project directories using in-memory SQLite.

Parameters:
  • datasrc (Path) – path to the directory containing HTTrack projects

  • sites (list[int] | None) – optional list of site IDs to filter by

  • query (str) – search query string

  • fields (list[str] | None) – optional list of fields to include in response

  • sort (str | None) – sort order for results

  • limit (int) – maximum number of results to return

  • offset (int) – number of results to skip for pagination

Returns:

Tuple of (list of ResourceResult objects, total count, IndexState)

Return type:

tuple[list[ResourceResult], int, IndexState]

mcp_server_webcrawl.crawlers.httrack.crawler module

class HtTrackCrawler[source]

Bases: IndexedCrawler

A crawler implementation for HTTrack captured sites. Provides functionality for accessing and searching web content from HTTrack projects. HTTrack creates offline mirrors of websites with preserved directory structure and metadata in hts-log.txt files.

Initialize the HTTrack crawler with a data source directory.

Parameters:

datasrc – The input argument as Path, it must be a directory containing HTTrack project directories, each potentially containing multiple domains

Raises:

AssertionError – If datasrc is None or not a directory

__init__(datasrc)[source]

Initialize the HTTrack crawler with a data source directory.

Parameters:

datasrc (Path) – The input argument as Path, it must be a directory containing HTTrack project directories, each potentially containing multiple domains

Raises:

AssertionError – If datasrc is None or not a directory

mcp_server_webcrawl.crawlers.httrack.tests module

class HtTrackTests[source]

Bases: BaseCrawlerTests

Test suite for the HTTrack crawler implementation. Uses all wrapped test methods from BaseCrawlerTests plus HTTrack-specific features.

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()[source]

Set up the test environment with fixture data.

test_httrack_pulse()[source]

Test basic crawler initialization.

test_httrack_sites()[source]

Test site retrieval API functionality.

Test boolean search functionality

test_httrack_resources()[source]

Test resource retrieval API functionality with various arguments.

test_httrack_images()[source]

Test HTTrack image handling and thumbnails.

test_httrack_sorts()[source]

Test random sort functionality using the sort argument.

test_httrack_content_parsing()[source]

Test content type detection and parsing.

test_httrack_tokenizer()[source]

Test HTTrack-specific tokenizer functionality for hyphenated terms.

test_httrack_log_parsing_features()[source]

Test HTTrack-specific features related to hts-log.txt parsing.

test_httrack_url_reconstruction()[source]

Test HTTrack URL reconstruction from project and domain structure.

test_httrack_domain_detection()[source]

Test HTTrack domain directory detection and multi-domain handling.

test_httrack_file_exclusion()[source]

Test that HTTrack-generated files are properly excluded.

test_httrack_advanced_features()[source]

Test HTTrack-specific advanced features not covered by base tests.

test_report()[source]

Run test report, save to data directory.

Module contents