Skip to content

add examples for Local models #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ jobs:
pip install pylint
pip install -r requirements.txt
- name: Analysing the code with pylint
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py examples/**/*.py tests/**/*.py
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,45 @@ Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.c
You can use the `SmartScraper` class to extract information from a website using a prompt.

The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
### Case 1: Extracting informations using a local LLM
### Case 1: Extracting informations using Ollama
Remember to download the model on Ollama separately!
```python
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
}
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the news with their description.",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

```

### Case 2: Extracting informations using Docker

Note: before using the local model remeber to create the docker container!
```text
docker-compose up -d
docker exec -it ollama ollama run stablelm-zephyr
```
You can use which model you want instead of stablelm-zephyr
You can use which models avaiable on Ollama or your own model instead of stablelm-zephyr
```python
from scrapegraphai.graphs import SmartScraperGraph

Expand All @@ -75,7 +106,7 @@ print(result)
```


### Case 2: Extracting informations using Openai model
### Case 3: Extracting informations using Openai model
```python
from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"
Expand All @@ -98,7 +129,7 @@ result = smart_scraper_graph.run()
print(result)
```

### Case 3: Extracting informations using Gemini
### Case 4: Extracting informations using Gemini
```python
from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"
Expand Down
1 change: 1 addition & 0 deletions examples/gemini/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
This folder contains an example of how to use ScrapeGraph-AI with Gemini, a large language model (LLM) from Google AI. The example shows how to extract information from a website using a natural language prompt.
2 changes: 0 additions & 2 deletions examples/gemini/results/result.csv

This file was deleted.

1 change: 0 additions & 1 deletion examples/gemini/results/result.json

This file was deleted.

Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,14 @@
# ************************************************
# Define the configuration for the graph
# ************************************************
"""
Avaiable models:
- ollama/llama2
- ollama/mistral
- ollama/codellama
- ollama/dolphin-mixtral
- ollama/mistral-openorca
"""

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
# "model_tokens": 2000, # set context length arbitrarily,
# "base_url": "http://ollama:11434", # set ollama URL arbitrarily
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
}
}

# ************************************************
Expand Down
120 changes: 120 additions & 0 deletions examples/local_models/Ollama/inputs/books.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
105 changes: 105 additions & 0 deletions examples/local_models/Ollama/inputs/plain_html_example.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<body class="fixed-top-nav " style="padding-top: 57px;">
<header>
<nav id="navbar" class="navbar navbar-light navbar-expand-sm fixed-top">
<div class="container">
<a class="navbar-brand title font-weight-lighter" href="/"><span class="font-weight-bold">Marco&nbsp;</span>Perini</a> <button class="navbar-toggler collapsed ml-auto" type="button" data-toggle="collapse" data-target="#navbarNav" aria-controls="navbarNav" aria-expanded="false" aria-label="Toggle navigation"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar top-bar"></span> <span class="icon-bar middle-bar"></span> <span class="icon-bar bottom-bar"></span> </button>
<div class="collapse navbar-collapse text-right" id="navbarNav">
<ul class="navbar-nav ml-auto flex-nowrap">
<li class="nav-item "> <a class="nav-link" href="/">About</a> </li>
<li class="nav-item dropdown active">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">Projects<span class="sr-only">(current)</span></a>
<div class="dropdown-menu dropdown-menu-right" aria-labelledby="navbarDropdown">
<a class="dropdown-item" href="/projects/">Projects</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="/competitions/">Competitions</a>
</div>
</li>
<li class="nav-item "> <a class="nav-link" href="/cv/">CV</a> </li>
<li class="toggle-container"> <button id="light-toggle" title="Change theme"> <i class="fa-solid fa-moon"></i> <i class="fa-solid fa-sun"></i> </button> </li>
</ul>
</div>
</div>
</nav>
<progress id="progress" value="0" max="284" style="top: 57px;">
<div class="progress-container"> <span class="progress-bar"></span> </div>
</progress>
</header>
<div class="container mt-5">
<div class="post">
<header class="post-header">
<h1 class="post-title">Projects</h1>
<p class="post-description"></p>
</header>
<article>
<div class="projects">
<div class="grid" style="position: relative; height: 861.992px;">
<div class="grid-sizer"></div>
<div class="grid-item" style="position: absolute; left: 0px; top: 0px;">
<a href="/projects/rotary-pendulum-rl/">
<div class="card hoverable">
<figure>
<picture> <img src="/assets/img/rotary_pybullet.jpg" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
</figure>
<div class="card-body">
<h4 class="card-title">Rotary Pendulum RL</h4>
<p class="card-text">Open Source project aimed at controlling a real life rotary pendulum using RL algorithms</p>
<div class="row ml-1 mr-1 p-0"> </div>
</div>
</div>
</a>
</div>
<div class="grid-sizer"></div>
<div class="grid-item" style="position: absolute; left: 260px; top: 0px;">
<a href="https://github.com/PeriniM/DQN-SwingUp" rel="external nofollow noopener" target="_blank">
<div class="card hoverable">
<figure>
<picture> <img src="/assets/img/value-policy-heatmaps.jpg" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
</figure>
<div class="card-body">
<h4 class="card-title">DQN Implementation from scratch</h4>
<p class="card-text">Developed a Deep Q-Network algorithm to train a simple and double pendulum</p>
<div class="row ml-1 mr-1 p-0"> </div>
</div>
</div>
</a>
</div>
<div class="grid-sizer"></div>
<div class="grid-item" style="position: absolute; left: 0px; top: 447.414px;">
<a href="https://github.com/PeriniM/Multi-Agents-HAED" rel="external nofollow noopener" target="_blank">
<div class="card hoverable">
<figure>
<picture> <img src="/assets/img/multi_agents_haed.gif" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
</figure>
<div class="card-body">
<h4 class="card-title">Multi Agents HAED</h4>
<p class="card-text">University project which focuses on simulating a multi-agent system to perform environment mapping. Agents, equipped with sensors, explore and record their surroundings, considering uncertainties in their readings.</p>
<div class="row ml-1 mr-1 p-0"> </div>
</div>
</div>
</a>
</div>
<div class="grid-sizer"></div>
<div class="grid-item" style="position: absolute; left: 260px; top: 370.172px;">
<a href="/projects/wireless-esc-drone/">
<div class="card hoverable">
<figure>
<picture> <img src="/assets/img/wireless_esc.gif" width="auto" height="auto" alt="project thumbnail" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"> </picture>
</figure>
<div class="card-body">
<h4 class="card-title">Wireless ESC for Modular Drones</h4>
<p class="card-text">Modular drone architecture proposal and proof of concept. The project received maximum grade.</p>
<div class="row ml-1 mr-1 p-0"> </div>
</div>
</div>
</a>
</div>
</div>
</div>
</article>
</div>
</div>
<footer class="fixed-bottom">
<div class="container mt-0"> © Copyright 2023 Marco Perini. Powered by <a href="https://jekyllrb.com/" target="_blank" rel="external nofollow noopener">Jekyll</a> with <a href="https://github.com/alshedivat/al-folio" rel="external nofollow noopener" target="_blank">al-folio</a> theme. Hosted by <a href="https://pages.github.com/" target="_blank" rel="external nofollow noopener">GitHub Pages</a>. </div>
</footer>
<div class="hiddendiv common"></div>
</body>
Empty file.
55 changes: 55 additions & 0 deletions examples/local_models/Ollama/scrape_plain_text_ollama.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
"""
Basic example of scraping pipeline using SmartScraper from text
"""

import os
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json

# ************************************************
# Read the text file
# ************************************************

FILE_NAME = "inputs/plain_html_example.txt"
curr_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(curr_dir, FILE_NAME)

# It could be also a http request using the request model
with open(file_path, 'r', encoding="utf-8") as file:
text = file.read()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
# "model_tokens": 2000, # set context length arbitrarily
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
}
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the news with their description.",
source=text,
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result")
Loading
Loading