mcp_server_webcrawl.extras package
Submodules
mcp_server_webcrawl.extras.markdown module
mcp_server_webcrawl.extras.regex module
mcp_server_webcrawl.extras.snippets module
- class SnippetContentExtractor[source]
Bases:
object
lxml-based HTML parser for extracting different types of content from HTML. Content separates into components: text, markup, attributes (values), and comments. These can be prioritized in search so that text is the displayed hit over noisier types.
- get_snippets(url, headers, content, query)[source]
Takes a query and content, reduces the HTML to text content and extracts hits as excerpts of text.
- find_snippets_in_text(text, terms, max_snippets=15, group_name='')[source]
Searches for whole-word matches of the given terms in the text and extracts surrounding context to create highlighted snippets. Each snippet shows the matched term in context with markdown-style bold highlighting (term).
- Parameters:
- Returns:
List of unique snippet strings with matched terms highlighted using bold markdown. Each snippet includes surrounding context up to MAX_SNIPPETS_CONTEXT_SIZE characters on each side of the match. Returns empty list if no matches found or invalid input.
- Return type: