A powerful and flexible web scraping tool built with Python that can discover, categorize, and download various types of digital files from websites.
web-scrapper/
│
├── main.py # Main application entry point
├── script.py # Alternative/legacy script
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── utils/
├── config.py # Configuration settings
├── scraper.py # Core scraping functionality
├── downloader.py # File download utilities
├── scraper_utils.py # Helper utilities
└── utils.py # Additional utilities
- Multi-format Support: Scrapes various file types (documents, images, videos, audio, etc.)
- Smart Categorization: Automatically categorizes files by type
- Domain Filtering: Optional filtering by domain
- Progress Tracking: Visual progress bars for downloads
- Logging: Comprehensive logging for debugging and monitoring
- Safe Downloads: Validates URLs and handles errors gracefully
cd web-scrapperpip install -r requirements.txtpython main.py- Enter Target URL: Provide the website URL to scrape
- Set Domain Filter (Optional): Filter results by specific domain
- Select Files: Choose which files to download from the discovered list
- Confirm Download: Review selections and start the download process
- Documents: PDF, DOC, DOCX, TXT, etc.
- Images: JPG, PNG, GIF, SVG, etc.
- Videos: MP4, AVI, MOV, etc.
- Audio: MP3, WAV, FLAC, etc.
- Archives: ZIP, RAR, TAR, etc.
- And many more...
Edit utils/config.py to customize:
- File type categories
- Save directory paths
- Download settings
- Logging preferences
- Downloads are saved to a
downloads/folder (automatically created) - Log files track all scraping activities
- Respects robots.txt and implements polite scraping practices
- Always ensure you have permission to scrape target websites
# 🌐 Python Web Scraper
A simple, efficient, and easy-to-use Python web scraper built to extract digital files (images, videos, documents, etc.) from URLs on websites.
---
## ✨ Features
- ✅ **Simple URL scraping** for multiple file types.
- 📁 **Automatically organizes downloads** into categories: Images, Videos, Audio, Documents, and Others.
- 🛠️ **User-friendly prompts** for easy navigation.
- 📈 **Progress bars** to visually track downloads.
- 🔒 **Secure**: Automatically sanitizes filenames and URLs.
---
## 📌 Requirements
- Python 3.8 or higher
- Libraries: `requests`, `beautifulsoup4`, `tqdm`
```bash
pip install requests beautifulsoup4 tqdm
```- Clone the repository:
git clone <repository-url>
cd <repository-name>- Run the main script:
python main.py- Follow the on-screen prompts to input URLs and select files to download.
By using this tool, you agree to abide by the legal and ethical standards applicable in your jurisdiction.
Contributions are welcome! Please submit pull requests or open issues to improve this project.
For further inquiries, please open an issue or submit your questions through the repository.
🌟 Happy scraping! 🌟