Excel API Web Scraper is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from NYC InfoHub. It features a modular, object-oriented design with built-in security checks (virus scanning and MIME-type validation) for downloaded Excel files, with an option to skip antivirus scans on Windows if ClamAV isn’t readily available.
- Asynchronous HTTP/2 downloads via
httpx.AsyncClient
- Recursive subpage discovery with Selenium
- Parallel CPU-bound hashing with
ProcessPoolExecutor
- Detailed logging with a rotating file handler
- Progress tracking via
tqdm
- SecurityManager for optional virus scanning (ClamAV) and file-type checks
- Skips ClamAV scanning on Windows by default, avoiding setup complexities while still functioning seamlessly
-
Web Scraping with Selenium
- Automates loading InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
-
Retries for Slow Connections
- Uses
tenacity
to retry downloads when timeouts or transient errors occur.
- Uses
-
Sub-Page Recursion
- Follows a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
-
HTTP/2 Async Downloads
- Downloads Excel files using
httpx
in streaming mode, allowing concurrent I/O while efficiently handling large files.
- Downloads Excel files using
-
Year Filtering
- Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
-
Parallel Hashing
- Uses
ProcessPoolExecutor
to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
- Uses
-
Security Checks
- In-memory virus scanning with ClamAV (via
pyclamd
orclamd
) - MIME-type validation with
python-magic
, ensuring files are truly Excel - Skip scanning on Windows by default (see below)
- In-memory virus scanning with ClamAV (via
-
Prevents Redundant Downloads
- Compares new file hashes with stored hashes; downloads only if the file has changed.
-
Progress & Logging
tqdm
for progress bars during downloads and hashing.- Detailed logs in
logs/excel_fetch.log
(rotated at 5MB, up to 2 backups).
By default, ClamAV scanning is not performed on Windows to avoid environment complexities—ClamAV is primarily a Linux/UNIX daemon. The SecurityManager
class checks platform.system()
, and if it’s Windows
, it short-circuits scanning and returns a clean status. This behavior can be overridden by setting:
security_manager = SecurityManager(skip_windows_scan=False)
…in which case the code will attempt a normal ClamAV call. To make this work on Windows, you’d typically need:
- WSL or Docker to run the ClamAV daemon, or
- Another setup that exposes a ClamAV socket/port for Python to connect to.
- Python 3.8 or higher
- ChromeDriver (installed and in your PATH for Selenium)
- ClamAV daemon (optional but recommended for virus scanning):
- On Linux/WSL, install via
sudo apt-get install clamav clamav-daemon
. - On Windows, you may run ClamAV in WSL or a Docker container, or skip virus scanning.
- On Linux/WSL, install via
Install required packages:
pip install -r requirements.txt
Dependencies:
httpx[http2]
: For performing asynchronous HTTP requests and HTTP/2 supporttenacity
: For retyingselenium
: For web scrapingpandas
: For processing Excel files (optional)tqdm
: To display download progressconcurrent.futures
: For multithreadingopenpyxl
,pyxlsb
,xlrd
: For handling different Excel file typespytest
,pytest-asyncio
,pytest-cov
: For module testingclamd
orpyclamd
: For ClamAV virus scanning (if enabled)python-magic
: For MIME-type checks
This project supports two primary scheduling methods:
- WSL / Linux Cron Jobs
- Windows Task Scheduler
Depending on where you prefer to run the scraper, you can pick one or both. Each method automates running the script on a specified interval (e.g., daily, weekly, monthly).
-
Why Use Cron: If you’re working in a Linux-like environment (including WSL), cron allows you to schedule commands at fixed times, dates, or intervals without any additional GUIs.
-
Setup:
-
Create a WSL-based virtual environment (e.g.,
venv_wsl
) and install project dependencies (includingselenium
,httpx
, etc.). -
Edit your cron table:
crontab -e
-
Add a line for your desired schedule, for example, to run every Sunday at 2:30 AM:
30 2 * * 0 /path/to/project/venv_wsl/bin/python /path/to/project/src/main.py >> /path/to/logs/cron_crawl.log 2>&1
-
Confirm that WSL remains active or open; if WSL is shut down, cron will not run.
-
-
Advantages:
- Well-known, lightweight system utility.
- Great if you’re primarily in Linux/WSL and comfortable with Bash.
-
Caveats:
- WSL suspends when no console is running—so if you close WSL entirely, cron won’t run.
- If you need a 24/7 environment, consider a real Linux server or keep WSL open.
-
Why Use Task Scheduler: On a Windows system, Task Scheduler is the built-in service for scheduling tasks when the machine is on, without needing WSL open.
-
Setup:
-
Create a Windows-based virtual environment (e.g.,
venv_win
) in your project folder:cd C:\path\to\project python -m venv venv_win .\venv_win\Scripts\activate pip install -r requirements.txt
-
Open Task Scheduler (
taskschd.msc
from Run or Start). -
Create a Basic Task:
- Name: e.g. “NYC InfoHub Scraper (Weekly).”
- Trigger: Weekly (pick day/time).
- Action: “Start a program.”
- Program/Script:
C:\path\to\project\venv_win\Scripts\python.exe
- Arguments:
C:\path\to\project\src\main.py
- Program/Script:
- Finish & test by right-clicking the task → “Run.”
-
-
Advantages:
- Runs natively in Windows, even if WSL is closed.
- Easy to configure via a GUI.
-
Caveats:
- If your environment or script references Linux paths, adapt them to Windows.
- For storing logs, you can rely on Python’s rotating file handler or redirect output to a custom file (e.g.,
task_crawl.log
).
project_root/
│
├── .github # Workflow CI/CD integration
├── .gitignore # Ignore logs, venv, data, and cache files
├── .env # Environment variables (excluded from version control)
├── README.md # Project documentation
├── requirements.txt # Project dependencies
├── setup.py # Project packaging file
├── pyproject.toml # Specify build system requirements
├── LICENSE # License file
│
├── venv_wsl/ # WSL Virtual Environment (ignored by version control)
├── venv_win/ # Windows Virtual Environment (ignored by version control)
│
├── src/
│ ├── __init__.py # Package initializer
│ ├── main.py # Main scraper script
│ └── excel_scraper.py # Web scraping module
│
├── logs/ # Directory for log files
│
├── tests/ # Directory for unit, integration, and end-to-end testing
│
├── data/ # Directory for downloaded Excel files
│ ├── graduation/
│ ├── attendance/
│ ├── demographics/
│ ├── test_results/
│ └── other_reports/
│
└── hashes/ # Directory for storing file hashes
The structure is well-organized for both manual execution and packaging as a Python module.
-
Run the script to scrape and fetch new datasets:
python -m src.main
This treats src as a package and executes main.py accordingly (avoiding relative import errors).
-
View logs for download status and debugging:
tail -f logs/excel_fetch.log
- Subpage Discovery
- The scraper uses a regex (SUB_PAGE_PATTERN) to find subpages like graduation-results, school-quality, etc.
- Filtered Excel Links
- Each subpage is loaded in Selenium;
<a>
tags ending with .xls/.xlsx/.xlsb are collected, then further filtered if they do not contain a relevant year (≥ 2018).
- Async Streaming Download
- Downloads use httpx.AsyncClient(http2=True) to fetch files in parallel. A progress bar (tqdm) shows how many files are in flight.
- Security Checks
- In-memory virus scanning (ClamAV) via SecurityManager.scan_for_viruses.
- MIME-type validation (ensure .xls/.xlsx is truly an Excel file).
- Parallel Hashing
- Each downloaded file’s hash is computed using a ProcessPoolExecutor so multiple CPU cores can do the work simultaneously.
- Save if Changed
- If the file’s new hash differs from the previously stored one, the file is saved and the .hash file updated.
- Logs
- The rotating log captures successes, skips, errors, etc.
- By default, if ClamAV is installed and
SecurityManager
is configured, files are scanned in memory before saving. - If you run on Windows without ClamAV, you can disable scanning by returning
"OK"
orTrue
inscan_for_viruses
or using a stub. - Make sure
StreamMaxLength
and other ClamAV config values (in/etc/clamav/clamd.conf
) are large enough to handle your file sizes.
- Includes:
- Log file: logs/excel_fetch.log
- Rotating File Handler: Rolls over at 5 MB, retains 2 backups.
- Console Output: Also mirrors log messages for convenience.
- Progress Bars: tqdm for both downloading and hashing steps.
We use Pytest for our test suite, located in the tests/
folder.
-
Install dev/test dependencies (either in your
setup.py
or viapip install -r requirements.txt
if you listed them there). -
Run tests:
python -m pytest tests/
- View Coverage (if you have
pytest-cov
):
python -m pytest tests/ --cov=src
A GitHub Actions workflow is set up in .github/workflows/ci-cd.yml
. It:
- Builds and tests the project on push or pull request to the
main
branch. - If tests pass and you push a tagged release, it builds a distribution and can upload to PyPI using Twine.
- Check the Actions tab on your repo to see logs and statuses of each workflow run.
- Connection Pooling: Addressed by a persistent
httpx.AsyncClient
. - Redundant Downloads: Prevented by storing file hashes and only updating on changes.
- Virus Scan Overhead: In-memory scanning might add overhead, but ensures security.
- Size Limit Errors: If you see “INSTREAM: Size limit reached” warnings, increase
StreamMaxLength
inclamd.conf
. - Windows Skipping: If you can’t run ClamAV natively, the skip mechanism means the scraper still works without throwing errors.
- HTTP/2 Support: If the server lacks HTTP/2, it falls back to HTTP/1.1 automatically.
- Depth Control: Subpage recursion is limited (by default depth=1). Extend if deeper crawling is needed.
- Year Parsing: Skips links that don’t contain a straightforward 4-digit year (≥ 2018). Update the regex for more complex date formats.
- ClamAV on Windows: Typically requires WSL, Docker, or disabling the scanning code.
- Custom Retry Logic: For advanced back-off patterns, modify the
tenacity
settings or override in your subclass.
- Email Notifications: Notify users when a new dataset is fetched.
- Database Integration: Store metadata in a database for better tracking.
- More Robust Exception Handling: Log specific error types or integrate external alerting.
- Integrating BeautifulSoup: Incorporate
BeautifulSoup
as opposed to Selenium for a faster setup for raw Web Scraping.
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by Dylan Picart at Partnership With Children.
For questions or contributions, contact: dpicart@partnershipwithchildren.org.