Skip to content

Commit ac4cb6f

Browse files
committed
Refactors codebase to enhance modularity and maintainability
Introduces an object-oriented architecture by implementing distinct classes for file management and scraping logic. Improves code organization, making it easier to maintain and test while ensuring a clean separation of concerns. Updates logging and error handling practices for better debugging and user feedback. Relates to ongoing efforts for improved project structure.
1 parent d51c74e commit ac4cb6f

File tree

6 files changed

+317
-295
lines changed

6 files changed

+317
-295
lines changed

README.md

+36-18
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,11 @@
22

33
## Description
44

5-
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. The scraper dynamically discovers subpages, detects relevant Excel links (filtered by year), downloads them asynchronously, and ensures that only new or changed files are saved.
5+
**Excel API Web Scraper** is a Python-based project that automates the process of web scraping, downloading, and storing Excel files from the NYC InfoHub website. It employs an **object-oriented architecture**—splitting functionality across multiple classes to improve maintainability and testability:
6+
7+
- **`NYCInfoHubScraper`** (subclass of `BaseScraper`) provides specialized logic for discovering and filtering NYC InfoHub Excel links.
8+
- **`FileManager`** encapsulates file I/O (saving, hashing, directory organization).
9+
- **`BaseScraper`** handles core scraping concerns such as Selenium setup, asynchronous HTTP downloads, concurrency, and hashing.
610

711
This version features:
812

@@ -11,34 +15,38 @@ This version features:
1115
- **Parallel CPU-bound hashing** with `ProcessPoolExecutor`
1216
- **Detailed logging** with a rotating file handler
1317
- **Progress tracking** via `tqdm`
18+
- **Clean separation** of concerns thanks to the new classes (`FileManager`, `NYCInfoHubScraper`, etc.)
1419

1520
---
1621

1722
## Features
1823

1924
- **Web Scraping with Selenium**
20-
Automatically loads InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
25+
Automatically loads InfoHub pages (and sub-pages) in a headless Chrome browser to discover Excel file links.
2126

2227
- **Retries for Slow Connections**
23-
Uses `tenacity` to retry downloads when timeouts or transient errors occur.
28+
Uses `tenacity` to retry downloads when timeouts or transient errors occur.
2429

2530
- **Sub-Page Recursion**
26-
Uses a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
31+
Uses a regex-based pattern to find and crawl subpages (e.g., graduation results, attendance data).
2732

2833
- **HTTP/2 Async Downloads**
29-
Downloads Excel files using `httpx` in **streaming mode**, allowing concurrent IO while efficiently handling large files.
34+
Downloads Excel files using `httpx` in **streaming mode**, allowing concurrent IO while efficiently handling large files.
3035

3136
- **Year Filtering**
32-
Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
37+
Only keeps Excel files that have at least one year >= 2018 in the link (skips older or irrelevant data).
3338

3439
- **Parallel Hashing**
35-
Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
40+
Uses `ProcessPoolExecutor` to compute SHA-256 hashes in parallel, fully utilizing multi-core CPUs without blocking the async loop.
3641

3742
- **Prevents Redundant Downloads**
38-
Compares new file hashes with stored hashes; downloads only if the file has changed.
43+
Compares new file hashes with stored hashes; downloads only if the file has changed.
3944

4045
- **Progress & Logging**
41-
Progress bars via `tqdm` for both downloads and hashing. Detailed logs to `logs/excel_fetch.log` (rotated at 5MB, up to 2 backups).
46+
Progress bars via `tqdm` for both downloads and hashing. Detailed logs to `logs/excel_fetch.log` (rotated at 5MB, up to 2 backups).
47+
48+
- **Refactored OOP Architecture**
49+
With the introduction of `FileManager` for file operations, `BaseScraper` for shared scraping logic, and `NYCInfoHubScraper` for specialized InfoHub routines, the code is more modular and maintainable.
4250

4351
---
4452

@@ -55,17 +63,19 @@ To install required packages:
5563

5664
```bash
5765
pip install -r requirements.txt
66+
```
5867

68+
**Dependencies**:
5969

60-
Dependencies:
6170
- `httpx[http2]`: For performing asynchronous HTTP requests and HTTP/2 support
6271
- `tenacity`: For retying
6372
- `selenium`: For web scraping
64-
- `pandas`: For processing Excel files
73+
- `pandas`: For processing Excel files (optional)
6574
- `tqdm`: To display download progress
6675
- `concurrent.futures`: For multithreading
6776
- `openpyxl`, `pyxlsb`, `xlrd`: For handling different Excel file types
68-
- `pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
77+
- `pytest`, `pytest-asyncio`, `pytest-cov`: For module testing
78+
6979
```
7080
7181
---
@@ -148,10 +158,11 @@ Depending on where you prefer to run the scraper, you can pick one or both. Each
148158
149159
## Directory Structure
150160
151-
```
161+
```text
162+
152163
project_root/
153164
154-
├── __init__.py # Package initializer
165+
├── **init**.py # Package initializer
155166
├── .github # Workflow CI/CD integration
156167
├── .gitignore # Ignore logs, venv, data, and cache files
157168
├── .env # Environment variables (excluded from version control)
@@ -163,25 +174,27 @@ project_root/
163174
164175
├── venv_wsl/ # WSL Virtual Environment (ignored by version control)
165176
├── venv_win/ # Windows Virtual Environment (ignored by version control)
166-
177+
167178
├── src/
168179
│ ├── main.py # Main scraper script
169180
│ └── excel_scraper.py # Web scraping module
170181
171182
├── logs/ # Directory for log files
172183
173-
├── tests/ # Directory for unit, integration, and end-to-end testing
184+
├── tests/ # Directory for unit, integration, and end-to-end testing
174185
175186
├── data/ # Directory for downloaded Excel files
176187
│ ├── graduation/
177188
│ ├── attendance/
178189
│ ├── demographics/
190+
│ ├── test_results/
179191
│ └── other_reports/
180192
181193
└── hashes/ # Directory for storing file hashes
194+
182195
```
183196

184-
This structure ensures that the project is well-organized for both manual execution and packaging as a Python module.
197+
The structure is well-organized for both manual execution and packaging as a Python module.
185198

186199
---
187200

@@ -285,6 +298,7 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
285298
2. Persistent HTTP Sessions: Using httpx.AsyncClient ensures that HTTP connections are reused, reducing overhead.
286299
3. Efficient Hashing: Files are saved only if they have changed, determined by a computed hash. This ensures no unnecessary downloads.
287300
4. Excluded older datasets by added `re` filtering logic to scrape only the latest available data.
301+
5. Clearer Architecture: Splitting logic into `FileManager`, `BaseScraper`, and `NYCInfoHubScraper` has improved modularity and test coverage.
288302

289303
---
290304

@@ -295,14 +309,18 @@ A GitHub Actions workflow is set up in `.github/workflows/ci-cd.yml`. It:
295309
- **Year Parsing**: If year formats differ (e.g., “19-20” instead of “2019-2020”), the regex must be adjusted or enhanced.
296310
- **Retries**: Now incorporates an automatic retry strategy via `tenacity`. For highly customized or advanced retry logic, the code can be extended further.
297311
- **Dual Virtual Environments**: Separate venvs are maintained (one for WSL and one for Windows). Both can run the script successfully when properly configured—via cron on WSL or Task Scheduler on Windows.
312+
- File Download Security: Currently relies on chunked streaming and SHA-256 hashing for change detection, but does not verify the authenticity of the files. For higher security:
313+
- Integrate virus scanning or malware checks after download.
314+
- Validate MIME type or file signatures to confirm it’s an Excel file.
315+
- Use TLS certificate pinning if the host supports it.
298316

299317
---
300318

301319
## **Other Potential Improvements**
302320

303321
- **Email Notifications**: Notify users when a new dataset is fetched.
304322
- **Database Integration**: Store metadata in a database for better tracking.
305-
- **Better Exception Handling**: Improve error logging for specific failures.
323+
- **More Robust Exception Handling**: Log specific error types or integrate external alerting.
306324

307325
---
308326

0 commit comments

Comments
 (0)