Skip to content

Latest commit

 

History

History
121 lines (82 loc) · 4.16 KB

README.md

File metadata and controls

121 lines (82 loc) · 4.16 KB

PDF Extractor API

PDF Extractor API

Overview

The PDF Extractor API is a FastAPI-based application designed to extract text and metadata from PDF files. It supports authentication using JWT tokens and rate limiting to manage API usage. The API allows users to upload PDF files, extract headers and items based on provided keywords, and handle responses in a user-friendly format.

Features

  • Authentication: Secure API access with JWT tokens.
  • File Upload: Upload PDF files in base64 format.
  • PDF Extraction: Extract headers and items from PDF files.
  • Rate Limiting: Protect the API from excessive usage.

Getting Started

To get started with the PDF Extractor API, follow these instructions to set up your development environment and run the application.

Prerequisites

  • Python 3.11+
  • Docker (optional, for containerized deployment)

Installation

  1. Clone the Repository

    git clone https://github.com/yourusername/pdf-extractor-api.git
    cd pdf-extractor-api
  2. Set Up a Virtual Environment

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install Dependencies

    pip install -r requirements.txt
  4. Configure Environment

    Create a config.json file in the root directory with the following content:

       {
          "client_id": "your_client_id",
          "client_secret": "your_client_secret",
          "url_auth": "your_auth_url",
          "api_url": "your_api_url",
          "access_token": "",
          "expires_at": ""
       }

    Replace the placeholders with your actual configuration values.

Running the Application

  1. Start the Server

    uvicorn main:app --host 0.0.0.0 --port 8000
  2. Access the API

    Open your browser or API client and navigate to http://localhost:8000/docs to access the interactive API documentation provided by FastAPI.

  3. API Endpoints

    • POST /token: Obtain an access token.
    • GET /users/me: Get information about the current user.
    • POST /upload: Upload a PDF file in base64 format.
    • POST /extract-header: Extract header information from a PDF.
    • POST /extract-items: Extract item information from a PDF.

Example Usage

  1. Authenticate and Get a Token

    curl -X POST "http://localhost:8000/token" -H "Content-Type: application/x-www-form-urlencoded" -d "username=TSPABAP&password=Welcome@321"
  2. Upload a PDF File

    curl -X POST "http://localhost:8000/upload" -H "Content-Type: application/json" -d '{"base64_string": "your_base64_encoded_pdf"}'
  3. Extract Header

    curl -X POST "http://localhost:8000/extract-header" -H "Authorization: Bearer your_access_token" -H "Content-Type: application/json" -d '{"file_id": "your_file_id", "keywords": ["keyword1", "keyword2"], "prompt": "Extract the header from the PDF."}'
  4. Extract Items

    curl -X POST "http://localhost:8000/extract-items" -H "Authorization: Bearer your_access_token" -H "Content-Type: application/json" -d '{"file_id": "your_file_id", "keywords": ["keyword1", "keyword2"], "prompt": "Extract the items from the PDF."}'

License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the LICENSE file for more details.

Contribution

We welcome contributions to improve the PDF Extractor API. Please follow these steps to contribute:

  • Fork the repository.
  • Create a new branch for your changes.
  • Make your changes and test them.
  • Submit a pull request with a detailed description of your changes.

Contact

For any questions or support, please open an issue in the repository.