GitHub - AlexTkDev/PDF-to-Word-Conversion: A parser that will retype text from a PDF into an MS Word document with the specified specifications

PDF-to-Word-Conversion

Description

The PDF-to-Word-Conversion project is designed to convert pages of a PDF document into separate pages of a Microsoft Word document while adhering to specific technical requirements. The project employs two main processing strategies: using the Google Cloud Vision API for text extraction and local processing with PyMuPDF and Tesseract.

Tech Stack

Google Cloud Vision API: For extracting text from images.
PyMuPDF: For working with PDF documents.
Tesseract OCR: For recognizing text in images.
pdf2image: For converting PDF pages into images.
Pillow: For image processing.
python-docx: For creating and editing Word documents.
tqdm: For displaying progress.
Flake8: For linting.
Pytest: For tests.

Installation

Clone the repository:

git clone https://github.com/AlexTkDev/PDF-to-Word-Conversion.git
cd PDF-to-Word-Conversion

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # For Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Note: Ensure that Google Cloud SDK and Tesseract OCR are installed and configured.
Set up Google Cloud Vision API:
- Obtain a Google Cloud credentials file (JSON format).
- Set the path to this file in the environment variable GOOGLE_APPLICATION_CREDENTIALS.
Setting up Tesseract OCR:
- For Windows:
  - Download the Tesseract OCR installer from the official repository or source.
  - Run the installer and follow the instructions.
  - Add the path to Tesseract in the PATH environment variable (e.g., C:\Program Files\Tesseract-OCR).
- For macOS:
  - Use Homebrew to install Tesseract. Open a terminal and run:
```
brew install tesseract
```
- For Linux:
  - Install Tesseract using a package manager. For example, on Ubuntu:
```
sudo apt-get update
sudo apt-get install tesseract-ocr
```
- Installing Language Packages:
  - To add additional language packages, follow the instructions for your operating system.
- Using in Python:
  - Install the pytesseract library:
```
pip install pytesseract
```
  - Ensure that pytesseract knows where Tesseract is installed:
```
import pytesseract
    # Specify the path to Tesseract if it's not in PATH
 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    # for Windows
```
Run the scripts:
- To process PDF using Google Cloud Vision API:
```
python PDF_processing_with_Google_Cloud_Vision_API.py
```
- To process PDF locally using PyMuPDF and Tesseract:
```
python processing_PDF_locally.py
```

Running tests

To run tests, use the command:

 pytest tests/

Linting

To check your code for standards compliance, use Flake8. Run the command:

flake8 PDF_processing_with_Google_Cloud_Vision_API.py processing_PDF_locally.py --show-source --statistics
flake8 PDF_processing_with_Google_Cloud_Vision_API.py processing_PDF_locally.py --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics

Notes

In the script PDF_processing_with_Google_Cloud_Vision_API.py, replace 'path_to_your_google_credentials.json' with the path to your Google Cloud credentials file.
In the script processing_PDF_locally.py, replace 'Open Project 100 images.pdf' with the path to your PDF document.

Project Structure

PDF_processing_with_Google_Cloud_Vision_API.py: Script for extracting text from images using Google Cloud Vision API and creating a Word document.
processing_PDF_locally.py: Script for extracting text from images using Tesseract and creating a Word document.

Contributing

If you want to make changes to the project, fork the repository, make your changes, and create a Pull Request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
PDF_processing_with_Google_Cloud_Vision_API.py		PDF_processing_with_Google_Cloud_Vision_API.py
README.md		README.md
processing_PDF_locally.py		processing_PDF_locally.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF-to-Word-Conversion

Description

Tech Stack

Installation

Running tests

Linting

Notes

Project Structure

Contributing

License

About

Releases

Packages

Languages

License

AlexTkDev/PDF-to-Word-Conversion

Folders and files

Latest commit

History

Repository files navigation

PDF-to-Word-Conversion

Description

Tech Stack

Installation

Running tests

Linting

Notes

Project Structure

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages