mcp_server_webcrawl.extras package

Submodules

mcp_server_webcrawl.extras.markdown module

class MarkdownTransformer[source]

Bases: object

Memoizes the XSLT transformer

classmethod get_xslt_transform()[source]: Get the HTML to text markdown XSLT transformer

get_markdown(content)[source]

Transform HTML content to Markdown using XSLT.

Parameters:

content (str) – The HTML content to transform.

Returns:

The transformed Markdown string, or None if the input is empty: or if transformation fails (e.g., due to invalid HTML or XSLT errors).

Return type:

str | None

mcp_server_webcrawl.extras.regex module

get_regex(headers, content, patterns)[source]

Takes headers and content and gets regex matches

Parameters:

headers (str) – The headers to search
content (str) – The content to search
patterns (list[str]) – The regex patterns

Returns:

A list of dicts, with selector, value, groups, position info, and source

Return type:

list[dict[str, str | int]]

mcp_server_webcrawl.extras.snippets module

class SnippetContentExtractor[source]

Bases: object

lxml-based HTML parser for extracting different types of content from HTML. Content separates into components: text, markup, attributes (values), and comments. These can be prioritized in search so that text is the displayed hit over noisier types.

PRIORITY_ORDER: list[str] = ['url', 'document_text', 'document_attributes', 'document_comments', 'headers', 'document_markup']

__init__(url, headers, content)[source]

Parameters:

url (str) –
headers (str) –
content (str) –

get_snippets(url, headers, content, query)[source]

Takes a query and content, reduces the HTML to text content and extracts hits as excerpts of text.

Parameters:

headers (str) – Header content to search
content (str) – The HTML or text content to search in
query (str) – The search query string
url (str) –

Returns:

A string of snippets with context around matched terms, separated by “ … “ or None

Return type:

str | None

find_snippets_in_text(text, terms, max_snippets=15, group_name='')[source]

Searches for whole-word matches of the given terms in the text and extracts surrounding context to create highlighted snippets. Each snippet shows the matched term in context with markdown-style bold highlighting (term).

Parameters:

text (str) – The text to search within
terms (list[str]) – List of search terms to find (case-insensitive, whole words only)
max_snippets (int) – Maximum number of snippets to return (default: MAX_SNIPPETS_MATCHED_COUNT)
group_name (str) – Regex group identifier (reserved for future use)

Returns:

List of unique snippet strings with matched terms highlighted using bold markdown. Each snippet includes surrounding context up to MAX_SNIPPETS_CONTEXT_SIZE characters on each side of the match. Returns empty list if no matches found or invalid input.

Return type:

list[str]

mcp_server_webcrawl.extras.thumbnails module

class ThumbnailManager[source]

Bases: object

Manages thumbnail generation and caching for image files and URLs.

__init__()[source]

get_thumbnails(paths)[source]

Convert URLs or file paths to base64 encoded strings.

Parameters:: paths (list[str]) – List of URLs or file paths to convert
Returns:: Dictionary mapping paths to their base64 representation or None if failed
Return type:: dict[str, str | None]

mcp_server_webcrawl.extras.xpath module

get_xpath(content, xpaths)[source]

Takes content and gets xpath hits

Parameters:

content (str) – The HTML source
xpaths (list[str]) – The xpath selectors

Returns:

A list of dicts, with selector and value

Return type:

list[dict[str, str | int | float]]

mcp_server_webcrawl.extras package

Submodules

mcp_server_webcrawl.extras.markdown module

mcp_server_webcrawl.extras.regex module

mcp_server_webcrawl.extras.snippets module

mcp_server_webcrawl.extras.thumbnails module

mcp_server_webcrawl.extras.xpath module

Module contents