mcp_server_webcrawl.extras package

Submodules

mcp_server_webcrawl.extras.markdown module

get_markdown(content)[source]
Parameters:

content (str) –

Return type:

str | None

mcp_server_webcrawl.extras.regex module

get_regex(headers, content, patterns)[source]

Takes headers and content and gets regex matches

Parameters:
  • headers (str) – The headers to search

  • content (str) – The content to search

  • patterns (list[str]) – The regex patterns

Returns:

A list of dicts, with selector, value, groups, position info, and source

Return type:

list[dict[str, str | int]]

mcp_server_webcrawl.extras.snippets module

class SnippetContentExtractor[source]

Bases: object

lxml-based HTML parser for extracting different types of content from HTML. Content separates into components: text, markup, attributes (values), and comments. These can be prioritized in search so that text is the displayed hit over noisier types.

PRIORITY_ORDER: list[str] = ['url', 'document_text', 'document_attributes', 'document_comments', 'headers', 'document_markup']
__init__(url, headers, content)[source]
Parameters:
  • url (str) –

  • headers (str) –

  • content (str) –

get_snippets(url, headers, content, query)[source]

Takes a query and content, reduces the HTML to text content and extracts hits as excerpts of text.

Parameters:
  • headers (str) – Header content to search

  • content (str) – The HTML or text content to search in

  • query (str) – The search query string

  • url (str) –

Returns:

A string of snippets with context around matched terms, separated by “ … “ or None

Return type:

str | None

find_snippets_in_text(text, terms, max_snippets=15, group_name='')[source]

Searches for whole-word matches of the given terms in the text and extracts surrounding context to create highlighted snippets. Each snippet shows the matched term in context with markdown-style bold highlighting (term).

Parameters:
  • text (str) – The text to search within

  • terms (list[str]) – List of search terms to find (case-insensitive, whole words only)

  • max_snippets (int) – Maximum number of snippets to return (default: MAX_SNIPPETS_MATCHED_COUNT)

  • group_name (str) – Regex group identifier (reserved for future use)

Returns:

List of unique snippet strings with matched terms highlighted using bold markdown. Each snippet includes surrounding context up to MAX_SNIPPETS_CONTEXT_SIZE characters on each side of the match. Returns empty list if no matches found or invalid input.

Return type:

list[str]

mcp_server_webcrawl.extras.thumbnails module

class ThumbnailManager[source]

Bases: object

Manages thumbnail generation and caching for image files and URLs.

__init__()[source]
get_thumbnails(paths)[source]

Convert URLs or file paths to base64 encoded strings.

Parameters:

paths (list[str]) – List of URLs or file paths to convert

Returns:

Dictionary mapping paths to their base64 representation or None if failed

Return type:

dict[str, str | None]

mcp_server_webcrawl.extras.xpath module

get_xpath(content, xpaths)[source]

Takes content and gets xpath hits

Parameters:
  • content (str) – The HTML source

  • xpaths (list[str]) – The xpath selectors

Returns:

A list of dicts, with selector and value

Return type:

list[dict[str, str | int | float]]

Module contents