Webscraper MCP

Webscraper MCP

By sektor10 GitHub

MCP server that extracts text content from webpages, YouTube videos, and PDFs for LLMs to use.

Overview

What is MCP Simple Scraper?

MCP Simple Scraper is a powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources, including webpages, YouTube videos, and PDF documents.

How to use MCP Simple Scraper?

To use MCP Simple Scraper, set up the MCP server and utilize the standardized endpoints provided for content extraction from different sources. You can integrate it with AI models to facilitate data processing.

Key features of MCP Simple Scraper?

  • Acts as an MCP server for standardized content extraction.
  • Supports web scraping, YouTube transcript extraction, and PDF processing.
  • Ensures structured communication and security with proper error handling.

Use cases of MCP Simple Scraper?

  1. Extracting text content from various web pages for analysis.
  2. Retrieving transcripts from YouTube videos for further processing.
  3. Processing PDF documents to extract relevant information for AI models.

FAQ from MCP Simple Scraper?

  • What is the Model Context Protocol (MCP)?

MCP is an open standard that enables secure, two-way connections between AI models and external data sources.

  • Is MCP Simple Scraper free to use?

Yes! MCP Simple Scraper is open-source and available for anyone to use.

  • What programming language is MCP Simple Scraper built with?

MCP Simple Scraper is built with Python.

Content

Webscraper MCP

A powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources. This implementation provides a standardized interface for AI models to access and process web pages, YouTube transcripts, and PDF documents.

What is MCP?

The Model Context Protocol (MCP) is an open standard that enables secure, two-way connections between AI models and external data sources. In this implementation:

  • MCP Server: The webscraper acts as an MCP server, providing standardized endpoints for content extraction
  • Tool Integration: Defines clear interfaces for web scraping, YouTube transcript extraction, and PDF processing
  • Structured Communication: Ensures consistent data exchange between AI models and web content
  • Security: Implements proper error handling and content validation

Features

  • 🌐 Web Page Scraping: Extract main content and metadata from any webpage

    • Efficient text extraction with boilerplate removal
    • Metadata extraction (title, author, date)
    • Async processing for better performance
  • 📺 YouTube Transcript Extraction

    • Support for multiple video URL formats
    • Handles both short and full URLs
    • Error handling for unavailable transcripts
  • 📄 PDF Processing

    • Convert PDF documents to markdown text
    • Size limit protection
    • Automatic cleanup of temporary files

MCP Tools

This implementation provides three main MCP tools:

1. Web Page Content Extraction

@mcp.tool()
async def get_webpage_content(url_input: str) -> str:
    '''Extract and process webpage content'''

2. YouTube Transcript Extraction

@mcp.tool()
async def get_youtube_transcript(url_input: str) -> str:
    '''Extract video transcripts'''

3. PDF Content Extraction

@mcp.tool()
async def get_pdf(url_input: str) -> str:
    '''Convert PDF to markdown'''

Installation

  1. Clone the repository:
git clone <repository-url>
cd webscraper
  1. Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Create and activate a new environment with UV:
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies using UV:
# Install from lock file for reproducible environment
uv pip install --requirement requirements.txt

# If you need to add new dependencies:
uv pip install new-package
uv pip freeze > requirements.txt  # Updates requirements.txt and uv.lock

UV is used as the package installer and environment manager because it offers:

  • Faster package resolution and installation
  • Deterministic builds with lockfile support
  • Better dependency resolution
  • Improved security with supply chain attack protection
  • Reproducible environments across different machines
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your preferred settings

Running the Webscraper

You can run the webscraper using the provided shell script:

# Make the script executable
chmod +x run_webscraper.sh

# Run the webscraper
./run_webscraper.sh

The script will:

  1. Check if the virtual environment is activated
  2. Activate the environment if needed
  3. Use UV to run the webscraper with proper dependencies

The environment handling ensures that:

  • All required dependencies are available
  • The correct Python version is used
  • Environment variables are properly set

Alternatively, you can run the webscraper manually:

# First, activate the environment if not already active
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Then run using UV
uv run python mcp_server_webscraper.py

Configuration

The MCP server can be configured through environment variables in the .env file:

General Settings

  • MAX_CONTENT_LENGTH: Maximum file size for processing (default: 10MB)
  • REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)

Trafilatura Settings

  • EXTRACTION_TIMEOUT: Content extraction timeout
  • MIN_EXTRACTED_SIZE: Minimum content size threshold
  • EXTRACTION_INTENSITY: Content cleaning aggressiveness (1-3)

Optional Features

  • Proxy configuration
  • Rate limiting
  • Content caching

See .env.example for all available configuration options.

Error Handling

The MCP server implements robust error handling:

  • InvalidURLError: Raised for malformed URLs
  • ContentExtractionError: Raised when content extraction fails
  • Timeout handling for slow responses
  • Size limit enforcement for large files

Dependencies

  • mcp: Core MCP implementation
  • trafilatura: Efficient web content extraction
  • youtube-transcript-api: YouTube transcript access
  • httpx: Async HTTP client
  • marker: PDF processing
  • Additional dependencies in requirements.txt

Dependency Management

This project uses UV's lockfile system (uv.lock) for dependency management, which provides:

  • Reproducible Builds: Exact versions of all dependencies are locked and verified
  • Supply Chain Security: Package integrity is verified using cryptographic hashes
  • Deterministic Resolution: Ensures consistent dependency trees across all installations
  • Version Pinning: All dependencies (direct and transitive) are pinned to specific versions

When installing dependencies:

# Install from locked dependencies (recommended)
uv pip install --requirement requirements.txt

# Update lock file after adding/removing dependencies
uv pip freeze > requirements.txt

Security Considerations

When implementing MCP tools:

  • Validate all input URLs
  • Implement proper error handling
  • Set appropriate timeouts
  • Limit file sizes
  • Clean up temporary files
  • Handle sensitive data appropriately

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

When contributing, please ensure you use UV for dependency management:

# Add new dependencies
uv pip install new-package
# Update uv.lock
uv pip freeze > requirements.txt

License

[Add your license information here]

Acknowledgments

Tools available:

get_pdf

Convert a URL that leads to a PDF file to markdown text. Args: input_url (str): Path to the PDF file to convert Returns: str: markdown_text

get_webpage_content

Returns the text content on a webpage based on the link provided. Using this tool you can access links provided by the user so you don't have deny those requests. When the user provides a webpage link which is NOT a youtube or github link and asks questions based on that, this function should be called. Args: url: The url from which you want to text to be extracted.

get_youtube_transcript

Use this tool when you receive youtube links from the user. This tool will extract the transcript from the youtube video and return it to you. Therefore if a user asks questions on a youtube video after providing a link, you can answer their question with this tool. Args: url: The url from which you want to text to be extracted.

No tools information available.
No content found.