2
mcp webscraper
Patrick Plate edited this page 2026-04-05 10:11:47 +02:00

🕸️ mcp-webscraper — Web Scraping

Webscraper Banner

mcp-webscraper is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.

Tools

Tool Description
webscraper_fetch(url, max_chars=5000) Title + full page as Markdown + metadata
webscraper_fetch_links(url, deduplicate=True) All href links found on the page
webscraper_fetch_tables(url) All HTML tables converted to Markdown
webscraper_fetch_all(url, max_chars=5000) Everything in one call (fetch + links + tables + meta)
webscraper_fetch_section(url, selector) Specific CSS selector section only
webscraper_fetch_meta(url) Title, description, Open Graph tags
webscraper_fetch_sitemap(url, max_urls=100) Parse sitemap.xml, return URL list
webscraper_search_hint(query, max_results=5) Brave Search — top URLs + snippets for a query

Stack

  • HTTP client: httpx (async, with SSL support, Chrome/Linux User-Agent)
  • HTML parser: BeautifulSoup4 + lxml
  • Markdown converter: html2text
  • Search backend: Brave Search (search.brave.com) — works without CAPTCHA
  • SSL: Custom cert bundle for Fedora 43 compatibility

🔍 Search: The Two-Step Research Pattern

webscraper_search_hint is the entry point for all web research. The recommended workflow is:

Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
Step 2: webscraper_fetch(best_url)           → get full page content

This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.

webscraper_search_hint uses Brave Search (search.brave.com) because:

  • Returns real results without CAPTCHA or consent walls
  • No API key required — works with plain HTTP GET
  • Handles special characters (C++, &, %, etc.) via URL encoding
  • Google blocks plain HTTP with 302 consent redirect
  • DuckDuckGo blocks with CAPTCHA

Return Value

The tool returns a structured dict:

{
  "query": "FastMCP tool decorator",
  "search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
  "result_count": 5,
  "hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
  "results": [
    {
      "title": "FastMCP Docs",
      "url": "https://docs.fastmcp.dev",
      "snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
    },
    ...
  ]
}

The hint field is a pipe-separated string of "Title (url): snippet[:120]" entries — immediately actionable for deciding which URL to fetch next.

Example: Two-Step Research Flow

# Step 1: Orient — what pages exist about this topic?
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."

# Step 2: Deep-dive the most relevant result
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)

Known Limitations

  • Reddit / Stack Overflow snippets may be empty — these platforms block snippet extraction
  • Brave CSS selectors use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
  • Use sparingly — once per research task to get oriented, not for every query

SSL Note — Fedora 43 Comodo Root CA

Fedora 43 is missing the Comodo AAA Services Root CA needed for Cloudflare-protected sites. The fix is bundled at mcp/webscraper/certs/comodo-aaa-services-root.pem.

The server automatically uses this cert bundle — no manual configuration needed.

Quick Start

cd mcp/webscraper
uv sync
uv run python src/server.py

Run Tests

cd mcp/webscraper
uv run pytest tests/ -v
# 28/28 tests passing

Usage Examples

# Step 1: Search — get candidate URLs for a topic
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)

# Step 2: Deep-dive the most relevant URL
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables from a documentation page
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")

# Search with special characters (C++, &, % all work)
webscraper_search_hint("C++ std::optional usage", max_results=3)