mcp_server_webcrawl.crawlers.httrack package
Submodules
mcp_server_webcrawl.crawlers.httrack.adapter module
- class HtTrackManager[source]
Bases:
IndexedManager
Manages HTTrack project data in in-memory SQLite databases.
Initialize the HTTrack manager with empty cache and statistics.
- get_sites(datasrc, ids=None, fields=None)[source]
List HTTrack project directories as sites.
- Parameters:
- Returns:
List of SiteResult objects, one for each HTTrack project
- Return type:
- get_resources(datasrc, sites=None, query='', fields=None, sort=None, limit=20, offset=0)[source]
Get resources from HTTrack project directories using in-memory SQLite.
- Parameters:
datasrc (Path) – path to the directory containing HTTrack projects
sites (list[int] | None) – optional list of site IDs to filter by
query (str) – search query string
fields (list[str] | None) – optional list of fields to include in response
sort (str | None) – sort order for results
limit (int) – maximum number of results to return
offset (int) – number of results to skip for pagination
- Returns:
Tuple of (list of ResourceResult objects, total count, IndexState)
- Return type:
mcp_server_webcrawl.crawlers.httrack.crawler module
- class HtTrackCrawler[source]
Bases:
IndexedCrawler
A crawler implementation for HTTrack captured sites. Provides functionality for accessing and searching web content from HTTrack projects. HTTrack creates offline mirrors of websites with preserved directory structure and metadata in hts-log.txt files.
Initialize the HTTrack crawler with a data source directory.
- Parameters:
datasrc – The input argument as Path, it must be a directory containing HTTrack project directories, each potentially containing multiple domains
- Raises:
AssertionError – If datasrc is None or not a directory
- __init__(datasrc)[source]
Initialize the HTTrack crawler with a data source directory.
- Parameters:
datasrc (Path) – The input argument as Path, it must be a directory containing HTTrack project directories, each potentially containing multiple domains
- Raises:
AssertionError – If datasrc is None or not a directory
mcp_server_webcrawl.crawlers.httrack.tests module
- class HtTrackTests[source]
Bases:
BaseCrawlerTests
Test suite for the HTTrack crawler implementation. Uses all wrapped test methods from BaseCrawlerTests plus HTTrack-specific features.
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
- test_httrack_tokenizer()[source]
Test HTTrack-specific tokenizer functionality for hyphenated terms.
- test_httrack_log_parsing_features()[source]
Test HTTrack-specific features related to hts-log.txt parsing.
- test_httrack_url_reconstruction()[source]
Test HTTrack URL reconstruction from project and domain structure.
- test_httrack_domain_detection()[source]
Test HTTrack domain directory detection and multi-domain handling.