Webscraper MCP

By sektor10 GitHub

MCP server that extracts text content from webpages, YouTube videos, and PDFs for LLMs to use.

Overview

What is MCP Simple Scraper?

MCP Simple Scraper is a powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources, including webpages, YouTube videos, and PDF documents.

How to use MCP Simple Scraper?

To use MCP Simple Scraper, set up the MCP server and utilize the standardized endpoints provided for content extraction from different sources. You can integrate it with AI models to facilitate data processing.

Key features of MCP Simple Scraper?

Acts as an MCP server for standardized content extraction.
Supports web scraping, YouTube transcript extraction, and PDF processing.
Ensures structured communication and security with proper error handling.

Use cases of MCP Simple Scraper?

Extracting text content from various web pages for analysis.
Retrieving transcripts from YouTube videos for further processing.
Processing PDF documents to extract relevant information for AI models.

FAQ from MCP Simple Scraper?

What is the Model Context Protocol (MCP)?

MCP is an open standard that enables secure, two-way connections between AI models and external data sources.

Is MCP Simple Scraper free to use?

Yes! MCP Simple Scraper is open-source and available for anyone to use.

What programming language is MCP Simple Scraper built with?

MCP Simple Scraper is built with Python.

Content

Webscraper MCP

A powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources. This implementation provides a standardized interface for AI models to access and process web pages, YouTube transcripts, and PDF documents.

What is MCP?

The Model Context Protocol (MCP) is an open standard that enables secure, two-way connections between AI models and external data sources. In this implementation:

MCP Server: The webscraper acts as an MCP server, providing standardized endpoints for content extraction
Tool Integration: Defines clear interfaces for web scraping, YouTube transcript extraction, and PDF processing
Structured Communication: Ensures consistent data exchange between AI models and web content
Security: Implements proper error handling and content validation

Features

🌐 Web Page Scraping: Extract main content and metadata from any webpage
- Efficient text extraction with boilerplate removal
- Metadata extraction (title, author, date)
- Async processing for better performance
📺 YouTube Transcript Extraction
- Support for multiple video URL formats
- Handles both short and full URLs
- Error handling for unavailable transcripts
📄 PDF Processing
- Convert PDF documents to markdown text
- Size limit protection
- Automatic cleanup of temporary files

MCP Tools

This implementation provides three main MCP tools:

1. Web Page Content Extraction

@mcp.tool()
async def get_webpage_content(url_input: str) -> str:
    '''Extract and process webpage content'''

2. YouTube Transcript Extraction

@mcp.tool()
async def get_youtube_transcript(url_input: str) -> str:
    '''Extract video transcripts'''

3. PDF Content Extraction

@mcp.tool()
async def get_pdf(url_input: str) -> str:
    '''Convert PDF to markdown'''

Installation

Clone the repository:

git clone <repository-url>
cd webscraper

Install UV (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Create and activate a new environment with UV:

uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies using UV:

# Install from lock file for reproducible environment
uv pip install --requirement requirements.txt

# If you need to add new dependencies:
uv pip install new-package
uv pip freeze > requirements.txt  # Updates requirements.txt and uv.lock

UV is used as the package installer and environment manager because it offers:

Faster package resolution and installation
Deterministic builds with lockfile support
Better dependency resolution
Improved security with supply chain attack protection
Reproducible environments across different machines

Set up environment variables:

cp .env.example .env
# Edit .env with your preferred settings

Running the Webscraper

You can run the webscraper using the provided shell script:

# Make the script executable
chmod +x run_webscraper.sh

# Run the webscraper
./run_webscraper.sh

The script will:

Check if the virtual environment is activated
Activate the environment if needed
Use UV to run the webscraper with proper dependencies

The environment handling ensures that:

All required dependencies are available
The correct Python version is used
Environment variables are properly set

Alternatively, you can run the webscraper manually:

# First, activate the environment if not already active
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Then run using UV
uv run python mcp_server_webscraper.py

Configuration

The MCP server can be configured through environment variables in the .env file:

General Settings

MAX_CONTENT_LENGTH: Maximum file size for processing (default: 10MB)
REQUEST_TIMEOUT: HTTP request timeout in seconds (default: 30)

Trafilatura Settings

EXTRACTION_TIMEOUT: Content extraction timeout
MIN_EXTRACTED_SIZE: Minimum content size threshold
EXTRACTION_INTENSITY: Content cleaning aggressiveness (1-3)

Optional Features

Proxy configuration
Rate limiting
Content caching

See .env.example for all available configuration options.

Error Handling

The MCP server implements robust error handling:

InvalidURLError: Raised for malformed URLs
ContentExtractionError: Raised when content extraction fails
Timeout handling for slow responses
Size limit enforcement for large files

Dependencies

mcp: Core MCP implementation
trafilatura: Efficient web content extraction
youtube-transcript-api: YouTube transcript access
httpx: Async HTTP client
marker: PDF processing
Additional dependencies in requirements.txt

Dependency Management

This project uses UV's lockfile system (uv.lock) for dependency management, which provides:

Reproducible Builds: Exact versions of all dependencies are locked and verified
Supply Chain Security: Package integrity is verified using cryptographic hashes
Deterministic Resolution: Ensures consistent dependency trees across all installations
Version Pinning: All dependencies (direct and transitive) are pinned to specific versions

When installing dependencies:

# Install from locked dependencies (recommended)
uv pip install --requirement requirements.txt

# Update lock file after adding/removing dependencies
uv pip freeze > requirements.txt

Security Considerations

When implementing MCP tools:

Validate all input URLs
Implement proper error handling
Set appropriate timeouts
Limit file sizes
Clean up temporary files
Handle sensitive data appropriately

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

When contributing, please ensure you use UV for dependency management:

# Add new dependencies
uv pip install new-package
# Update uv.lock
uv pip freeze > requirements.txt

License

[Add your license information here]

Acknowledgments

Model Context Protocol for the standardized AI integration framework
Trafilatura for efficient web content extraction
youtube-transcript-api for YouTube transcript access
UV for fast, reliable Python package management

Tools available:

get_pdf

Convert a URL that leads to a PDF file to markdown text. Args: input_url (str): Path to the PDF file to convert Returns: str: markdown_text

get_webpage_content

Returns the text content on a webpage based on the link provided. Using this tool you can access links provided by the user so you don't have deny those requests. When the user provides a webpage link which is NOT a youtube or github link and asks questions based on that, this function should be called. Args: url: The url from which you want to text to be extracted.

get_youtube_transcript

Use this tool when you receive youtube links from the user. This tool will extract the transcript from the youtube video and return it to you. Therefore if a user asks questions on a youtube video after providing a link, you can answer their question with this tool. Args: url: The url from which you want to text to be extracted.

No tools information available.

No content found.