🕸️ mcp-webscraper — Web Scraping

mcp-webscraper is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.

Tools

Tool	Description
`webscraper_fetch(url, max_chars=5000)`	Title + full page as Markdown + metadata
`webscraper_fetch_links(url, deduplicate=True)`	All `href` links found on the page
`webscraper_fetch_tables(url)`	All HTML tables converted to Markdown
`webscraper_fetch_all(url, max_chars=5000)`	Everything in one call (fetch + links + tables + meta)
`webscraper_fetch_section(url, selector)`	Specific CSS selector section only
`webscraper_fetch_meta(url)`	Title, description, Open Graph tags
`webscraper_fetch_sitemap(url, max_urls=100)`	Parse sitemap.xml, return URL list
`webscraper_search_hint(query, max_results=5)`	Brave Search — top URLs + snippets for a query

Stack

HTTP client: httpx (async, with SSL support, Chrome/Linux User-Agent)
HTML parser: BeautifulSoup4 + lxml
Markdown converter: html2text
Search backend: Brave Search (search.brave.com) — works without CAPTCHA
SSL: Custom cert bundle for Fedora 43 compatibility

🔍 Search: The Two-Step Research Pattern

webscraper_search_hint is the entry point for all web research. The recommended workflow is:

Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
Step 2: webscraper_fetch(best_url)           → get full page content

This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.

Why Brave Search?

webscraper_search_hint uses Brave Search (search.brave.com) because:

✅ Returns real results without CAPTCHA or consent walls
✅ No API key required — works with plain HTTP GET
✅ Handles special characters (C++, &, %, etc.) via URL encoding
❌ Google blocks plain HTTP with 302 consent redirect
❌ DuckDuckGo blocks with CAPTCHA

Return Value

The tool returns a structured dict:

{
  "query": "FastMCP tool decorator",
  "search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
  "result_count": 5,
  "hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
  "results": [
    {
      "title": "FastMCP Docs",
      "url": "https://docs.fastmcp.dev",
      "snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
    },
    ...
  ]
}

The hint field is a pipe-separated string of "Title (url): snippet[:120]" entries — immediately actionable for deciding which URL to fetch next.

Example: Two-Step Research Flow

# Step 1: Orient — what pages exist about this topic?
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."

# Step 2: Deep-dive the most relevant result
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)

Known Limitations

Reddit / Stack Overflow snippets may be empty — these platforms block snippet extraction
Brave CSS selectors use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
Use sparingly — once per research task to get oriented, not for every query

SSL Note — Fedora 43 Comodo Root CA

Fedora 43 is missing the Comodo AAA Services Root CA needed for Cloudflare-protected sites. The fix is bundled at mcp/webscraper/certs/comodo-aaa-services-root.pem.

The server automatically uses this cert bundle — no manual configuration needed.

Quick Start

cd mcp/webscraper
uv sync
uv run python src/server.py

Run Tests

cd mcp/webscraper
uv run pytest tests/ -v
# 28/28 tests passing

Usage Examples

# Step 1: Search — get candidate URLs for a topic
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)

# Step 2: Deep-dive the most relevant URL
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables from a documentation page
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")

# Search with special characters (C++, &, % all work)
webscraper_search_hint("C++ std::optional usage", max_results=3)