
Webscraper MCP
MCP server that extracts text content from webpages, YouTube videos, and PDFs for LLMs to use.
What is MCP Simple Scraper?
MCP Simple Scraper is a powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources, including webpages, YouTube videos, and PDF documents.
How to use MCP Simple Scraper?
To use MCP Simple Scraper, set up the MCP server and utilize the standardized endpoints provided for content extraction from different sources. You can integrate it with AI models to facilitate data processing.
Key features of MCP Simple Scraper?
- Acts as an MCP server for standardized content extraction.
- Supports web scraping, YouTube transcript extraction, and PDF processing.
- Ensures structured communication and security with proper error handling.
Use cases of MCP Simple Scraper?
- Extracting text content from various web pages for analysis.
- Retrieving transcripts from YouTube videos for further processing.
- Processing PDF documents to extract relevant information for AI models.
FAQ from MCP Simple Scraper?
- What is the Model Context Protocol (MCP)?
MCP is an open standard that enables secure, two-way connections between AI models and external data sources.
- Is MCP Simple Scraper free to use?
Yes! MCP Simple Scraper is open-source and available for anyone to use.
- What programming language is MCP Simple Scraper built with?
MCP Simple Scraper is built with Python.
Webscraper MCP
A powerful web scraping tool built with Python that leverages the Model Context Protocol (MCP) to enable AI assistants to extract and process content from various web sources. This implementation provides a standardized interface for AI models to access and process web pages, YouTube transcripts, and PDF documents.
What is MCP?
The Model Context Protocol (MCP) is an open standard that enables secure, two-way connections between AI models and external data sources. In this implementation:
- MCP Server: The webscraper acts as an MCP server, providing standardized endpoints for content extraction
- Tool Integration: Defines clear interfaces for web scraping, YouTube transcript extraction, and PDF processing
- Structured Communication: Ensures consistent data exchange between AI models and web content
- Security: Implements proper error handling and content validation
Features
-
🌐 Web Page Scraping: Extract main content and metadata from any webpage
- Efficient text extraction with boilerplate removal
- Metadata extraction (title, author, date)
- Async processing for better performance
-
📺 YouTube Transcript Extraction
- Support for multiple video URL formats
- Handles both short and full URLs
- Error handling for unavailable transcripts
-
📄 PDF Processing
- Convert PDF documents to markdown text
- Size limit protection
- Automatic cleanup of temporary files
MCP Tools
This implementation provides three main MCP tools:
1. Web Page Content Extraction
@mcp.tool()
async def get_webpage_content(url_input: str) -> str:
'''Extract and process webpage content'''
2. YouTube Transcript Extraction
@mcp.tool()
async def get_youtube_transcript(url_input: str) -> str:
'''Extract video transcripts'''
3. PDF Content Extraction
@mcp.tool()
async def get_pdf(url_input: str) -> str:
'''Convert PDF to markdown'''
Installation
- Clone the repository:
git clone <repository-url>
cd webscraper
- Install UV (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh
- Create and activate a new environment with UV:
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies using UV:
# Install from lock file for reproducible environment
uv pip install --requirement requirements.txt
# If you need to add new dependencies:
uv pip install new-package
uv pip freeze > requirements.txt # Updates requirements.txt and uv.lock
UV is used as the package installer and environment manager because it offers:
- Faster package resolution and installation
- Deterministic builds with lockfile support
- Better dependency resolution
- Improved security with supply chain attack protection
- Reproducible environments across different machines
- Set up environment variables:
cp .env.example .env
# Edit .env with your preferred settings
Running the Webscraper
You can run the webscraper using the provided shell script:
# Make the script executable
chmod +x run_webscraper.sh
# Run the webscraper
./run_webscraper.sh
The script will:
- Check if the virtual environment is activated
- Activate the environment if needed
- Use UV to run the webscraper with proper dependencies
The environment handling ensures that:
- All required dependencies are available
- The correct Python version is used
- Environment variables are properly set
Alternatively, you can run the webscraper manually:
# First, activate the environment if not already active
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Then run using UV
uv run python mcp_server_webscraper.py
Configuration
The MCP server can be configured through environment variables in the .env
file:
General Settings
MAX_CONTENT_LENGTH
: Maximum file size for processing (default: 10MB)REQUEST_TIMEOUT
: HTTP request timeout in seconds (default: 30)
Trafilatura Settings
EXTRACTION_TIMEOUT
: Content extraction timeoutMIN_EXTRACTED_SIZE
: Minimum content size thresholdEXTRACTION_INTENSITY
: Content cleaning aggressiveness (1-3)
Optional Features
- Proxy configuration
- Rate limiting
- Content caching
See .env.example
for all available configuration options.
Error Handling
The MCP server implements robust error handling:
InvalidURLError
: Raised for malformed URLsContentExtractionError
: Raised when content extraction fails- Timeout handling for slow responses
- Size limit enforcement for large files
Dependencies
mcp
: Core MCP implementationtrafilatura
: Efficient web content extractionyoutube-transcript-api
: YouTube transcript accesshttpx
: Async HTTP clientmarker
: PDF processing- Additional dependencies in
requirements.txt
Dependency Management
This project uses UV's lockfile system (uv.lock
) for dependency management, which provides:
- Reproducible Builds: Exact versions of all dependencies are locked and verified
- Supply Chain Security: Package integrity is verified using cryptographic hashes
- Deterministic Resolution: Ensures consistent dependency trees across all installations
- Version Pinning: All dependencies (direct and transitive) are pinned to specific versions
When installing dependencies:
# Install from locked dependencies (recommended)
uv pip install --requirement requirements.txt
# Update lock file after adding/removing dependencies
uv pip freeze > requirements.txt
Security Considerations
When implementing MCP tools:
- Validate all input URLs
- Implement proper error handling
- Set appropriate timeouts
- Limit file sizes
- Clean up temporary files
- Handle sensitive data appropriately
Contributing
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
When contributing, please ensure you use UV for dependency management:
# Add new dependencies
uv pip install new-package
# Update uv.lock
uv pip freeze > requirements.txt
License
[Add your license information here]
Acknowledgments
- Model Context Protocol for the standardized AI integration framework
- Trafilatura for efficient web content extraction
- youtube-transcript-api for YouTube transcript access
- UV for fast, reliable Python package management
Tools available:
get_pdf
Convert a URL that leads to a PDF file to markdown text. Args: input_url (str): Path to the PDF file to convert Returns: str: markdown_text
get_webpage_content
Returns the text content on a webpage based on the link provided. Using this tool you can access links provided by the user so you don't have deny those requests. When the user provides a webpage link which is NOT a youtube or github link and asks questions based on that, this function should be called. Args: url: The url from which you want to text to be extracted.
get_youtube_transcript
Use this tool when you receive youtube links from the user. This tool will extract the transcript from the youtube video and return it to you. Therefore if a user asks questions on a youtube video after providing a link, you can answer their question with this tool. Args: url: The url from which you want to text to be extracted.