Files

T

Patrick Plate dabdda167f docs(wiki): migrate to git-based workflow with persistent wiki/ clone

- Extract all wiki content from create_wiki_pages.py into docs/wiki/pages/*.md
- Add docs/wiki/deploy_wiki.sh: copies pages to wiki/ repo, commits, pushes
- Add /wiki/ to .gitignore (anchored — does not affect docs/wiki/)
- 12 pages: Home, MCP-Servers-Overview, mcp-image-gen, ComfyUI-Setup,
  mcp-webscraper (8 tools incl. search_hint), BigMind (schema v8),
  Development-Conventions, Java-Projects, Java-wellmann-shop,
  Java-mss-failsafe, Java-Architecture, _Sidebar
- Workflow: edit docs/wiki/pages/*.md → ./docs/wiki/deploy_wiki.sh

2026-04-05 09:48:19 +02:00

3.0 KiB

Raw Blame History

🕸️ mcp-webscraper — Web Scraping

mcp-webscraper is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.

Tools

Tool	Description
`webscraper_fetch(url, max_chars=5000)`	Title + full page as Markdown + metadata
`webscraper_fetch_links(url, deduplicate=True)`	All `href` links found on the page
`webscraper_fetch_tables(url)`	All HTML tables converted to Markdown
`webscraper_fetch_all(url, max_chars=5000)`	Everything in one call (fetch + links + tables + meta)
`webscraper_fetch_section(url, selector)`	Specific CSS selector section only
`webscraper_fetch_meta(url)`	Title, description, Open Graph tags
`webscraper_fetch_sitemap(url, max_urls=100)`	Parse sitemap.xml, return URL list
`webscraper_search_hint(query, max_results=5)`	Brave Search — top URLs + snippets for a query

Stack

HTTP client: httpx (async, with SSL support, Chrome/Linux User-Agent)
HTML parser: BeautifulSoup4 + lxml
Markdown converter: html2text
Search backend: Brave Search (search.brave.com) — works without CAPTCHA
SSL: Custom cert bundle for Fedora 43 compatibility

Search Hint Strategy

webscraper_search_hint uses Brave Search because:

✅ Returns real results without CAPTCHA or consent walls
❌ Google blocks plain HTTP with 302 consent redirect
❌ DuckDuckGo blocks with CAPTCHA

Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.

# Get top 5 results for a query
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)

SSL Note — Fedora 43 Comodo Root CA

Fedora 43 is missing the Comodo AAA Services Root CA needed for Cloudflare-protected sites. The fix is bundled at mcp/webscraper/certs/comodo-aaa-services-root.pem.

The server automatically uses this cert bundle — no manual configuration needed.

Quick Start

cd mcp/webscraper
uv sync
uv run python src/server.py

Run Tests

cd mcp/webscraper
uv run pytest tests/ -v
# 23/23 tests passing

Usage Examples

# Fetch a page as Markdown
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables from a documentation page
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")

# Quick search orientation
webscraper_search_hint("Gitea wiki git clone", max_results=3)

3.0 KiB Raw Blame History