diff --git a/APIExample.md b/APIExample.md index 14a0322..defb429 100644 --- a/APIExample.md +++ b/APIExample.md @@ -325,7 +325,7 @@ print 'Tesseract-ocr version', tesseract_version print result_text ``` -Example of passing python file object to C-API can be found at [pastebin](http://pastebin.com/yDTkNfNm). +Example of passing python file object to C-API can be found at [pastebin](https://pastebin.com/yDTkNfNm). Example of extracting orientation from Tesseract 4.0: diff --git a/AddOns.md b/AddOns.md index 11d6eca..d12e13c 100644 --- a/AddOns.md +++ b/AddOns.md @@ -10,7 +10,7 @@ Platform support depends on used language and experience of user. #### Box file editors -[jTessBoxEditor](http://vietocr.sourceforge.net/training.html) +[jTessBoxEditor](https://vietocr.sourceforge.net/training.html) ### For Tesseract 3.0x @@ -18,13 +18,13 @@ Platform support depends on used language and experience of user. | **Name** | **Last update** | **Language** | Multipage support | |:---------|:----------------|:-------------|:------------------| -| [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) | 2023 | Java | yes | -| [QT Box Editor](http://zdenop.github.com/qt-box-editor/) | 2019 | C++, Qt4/Qt5 | yes | +| [jTessBoxEditor](https://vietocr.sourceforge.net/training.html) | 2023 | Java | yes | +| [QT Box Editor](https://zdenop.github.com/qt-box-editor/) | 2019 | C++, Qt4/Qt5 | yes | | [tesseract-box-editor](https://github.com/scotts48/tesseract-box-editor) | 2013 | .NET 4 | yes | -| [Tesseract-OCR boxfile AJAX editor](http://pp19dd.com/tesseract-ocr-chopper/) | 2012 | online tool | -| [cowboxer](http://code.google.com/p/cowboxer/) | 2012 | C++, Qt4 | no | -| [moshPyTT ](http://code.google.com/p/moshpytt/) | 2011 | Python, GTK2 | no | -| [pytesseracttrainer](http://code.google.com/p/pytesseracttrainer/) | 2011 | Python, GTK2 | no | +| [Tesseract-OCR boxfile AJAX editor](https://pp19dd.com/tesseract-ocr-chopper/) | 2012 | online tool | +| [cowboxer](https://code.google.com/p/cowboxer/) | 2012 | C++, Qt4 | no | +| [moshPyTT ](https://code.google.com/p/moshpytt/) | 2011 | Python, GTK2 | no | +| [pytesseracttrainer](https://code.google.com/p/pytesseracttrainer/) | 2011 | Python, GTK2 | no | ### For Tesseract-OCR 2.0x @@ -34,36 +34,36 @@ Platform support depends on used language and experience of user. | **Name** | **Last update** | **Language** | |:---------|:----------------|:-------------| -| [Tesseract-OCR boxfile AJAX editor](http://pp19dd.com/tesseract-ocr-chopper/) | 2012 | online tool | -| [owlboxer](http://code.google.com/p/owlboxer/) | 2010 | C++, Qt4 | -| [Tessboxer](http://sites.google.com/site/spilkaondrej) | 2009 | .NET | -| [boxfilereader.php](http://tesseract-ocr.googlecode.com/files/boxfilereader.php) | 2009 | php | -| [tessboxes](http://www.lbreyer.com/tessboxes.html) | 2008 | C | -| [JTesseract](http://code.google.com/p/jtesseract/) | 2008 | C# | -| [wx-tetra](http://code.google.com/p/wx-tetra/) | 2008 | perl, wx | -| [bbtesseract](http://code.google.com/p/bbtesseract/) | 2008 | VB.NET 2008 | +| [Tesseract-OCR boxfile AJAX editor](https://pp19dd.com/tesseract-ocr-chopper/) | 2012 | online tool | +| [owlboxer](https://code.google.com/p/owlboxer/) | 2010 | C++, Qt4 | +| [Tessboxer](https://sites.google.com/site/spilkaondrej) | 2009 | .NET | +| [boxfilereader.php](https://tesseract-ocr.googlecode.com/files/boxfilereader.php) | 2009 | php | +| [tessboxes](https://www.lbreyer.com/tessboxes.html) | 2008 | C | +| [JTesseract](https://code.google.com/p/jtesseract/) | 2008 | C# | +| [wx-tetra](https://code.google.com/p/wx-tetra/) | 2008 | perl, wx | +| [bbtesseract](https://code.google.com/p/bbtesseract/) | 2008 | VB.NET 2008 | ## Other Training Tools - * [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) - Box Editor and Training Tool + * [jTessBoxEditor](https://vietocr.sourceforge.net/training.html) - Box Editor and Training Tool * [MzTesseract](https://github.com/mazluta/MzTesseract) - MS Windows program that can train new language from top to bottom - * [FrankenPlus](https://github.com/this-is-ari/python-tesseract-3.02-training) - tool for creating font training for Tesseract OCR engine from page images. More information about Franken+ is at at [IT'S ALIVE!](http://emop.tamu.edu/node/54Franken+:) and [Franken+ homepage](http://dh-emopweb.tamu.edu/Franken+/). + * [FrankenPlus](https://github.com/this-is-ari/python-tesseract-3.02-training) - tool for creating font training for Tesseract OCR engine from page images. More information about Franken+ is at at [IT'S ALIVE!](https://emop.tamu.edu/node/54Franken+:) and [Franken+ homepage](http://dh-emopweb.tamu.edu/Franken+/). * [python-tesseract-3.02-training](https://github.com/this-is-ari/python-tesseract-3.02-training) - script to automate the generation of Tesseract 3.02 training files * [tesseract-box-file](https://code.google.com/p/tesseract-box-file/) - autoit script to make editing the box file easier * [Serak Tesseract Trainer for Tesseract 3.02](https://code.google.com/p/serak-tesseract-trainer/) - a front end GUI for training tesseract 3.02 - * [BoxMaker](http://reza1615.github.com/index.html) is online tool for generating image&box pair. Offline version is available in download section of [PersianOCR project](https://github.com/reza1615/PersianOcr/downloads) - * [boxFactory](http://www.dinosaursandmoustaches.com/boxFactory.php) is a tool for quickly creating box files to train the Tesseract OCR engine. You can identify characters in the image by simply drawing boxes around them. + * [BoxMaker](https://reza1615.github.com/index.html) is online tool for generating image&box pair. Offline version is available in download section of [PersianOCR project](https://github.com/reza1615/PersianOcr/downloads) + * [boxFactory](https://www.dinosaursandmoustaches.com/boxFactory.php) is a tool for quickly creating box files to train the Tesseract OCR engine. You can identify characters in the image by simply drawing boxes around them. * https://github.com/BaltoRouberol/TesseractTrainer - TesseractTrainer is a simple Python API, taking over the tedious process of manually training Tesseract3 * [tess\_school](https://github.com/ddohler/tess_school) - a set of handy scripts to make the tesseract training process a bit easier - * [txt2img](http://code.google.com/p/txt2img/) - Qt GUI application that generates image and box file based on text input - * [DangAmbigs Generator](http://www.cs.toronto.edu/~mreimer/tesseract.html) - Creates a DangAmbigs file automatically given a set of OCR text output and correct text. _Requirements:_ Python - * [train.ps1](http://sourceforge.net/p/vietocr/code/HEAD/tree/jTessBoxEditor/trunk/tools/) - Windows powershell script for Automate Tesseract 3.01 language data pack generation process. - * [Update unicharambigs.exe](http://code.google.com/p/tesseract-ocr/issues/detail?id=544) - A small (windows) C# program for editing "lang.unicharambigs" file - * [train\_tess.pl](http://code.google.com/p/tesseract-ocr/issues/detail?id=640) - perl script to facilitate training + * [txt2img](https://code.google.com/p/txt2img/) - Qt GUI application that generates image and box file based on text input + * [DangAmbigs Generator](https://www.cs.toronto.edu/~mreimer/tesseract.html) - Creates a DangAmbigs file automatically given a set of OCR text output and correct text. _Requirements:_ Python + * [train.ps1](https://sourceforge.net/p/vietocr/code/HEAD/tree/jTessBoxEditor/trunk/tools/) - Windows powershell script for Automate Tesseract 3.01 language data pack generation process. + * [Update unicharambigs.exe](https://code.google.com/p/tesseract-ocr/issues/detail?id=544) - A small (windows) C# program for editing "lang.unicharambigs" file + * [train\_tess.pl](https://code.google.com/p/tesseract-ocr/issues/detail?id=640) - perl script to facilitate training * [boxedit](https://github.com/danvk/boxedit/) - A web-based editor for Tesseract box files - * [TrainYourTesseract](http://trainyourtesseract.com) - Free online "no-hassle" TTF file to trainedata converter + * [TrainYourTesseract](https://trainyourtesseract.com) - Free online "no-hassle" TTF file to trainedata converter ## Community training projects @@ -72,20 +72,20 @@ Platform support depends on used language and experience of user. * **MRZ**: https://groups.google.com/group/tesseract-ocr/attach/10d7c711c9cc80/mrz.traineddata * **Latin**: https://github.com/ryanfb/latinocr-lattraining * **tesseract-georgian**: https://github.com/ddohler/tesseract-georgian - * **Polish Fraktur**: training as [result of the IMPACT project](http://dl.psnc.pl/activities/projekty/impact/results/), [trained dataset](http://dl.psnc.pl/download/tesseract_traineddata.zip) - * **Ancient Greek**: http://ancientgreekocr.org - * **Indic**: http://code.google.com/p/tesseractindic/, https://github.com/debayan/Tesseract-Indic-OCR/, http://code.google.com/p/parichit/ (All are Obsolete) - * **Indic-OCR** http://indic-ocr.github.io/tessdata/ + * **Polish Fraktur**: training as [result of the IMPACT project](https://dl.psnc.pl/activities/projekty/impact/results/), [trained dataset](http://dl.psnc.pl/download/tesseract_traineddata.zip) + * **Ancient Greek**: https://ancientgreekocr.org + * **Indic**: https://code.google.com/p/tesseractindic/, https://github.com/debayan/Tesseract-Indic-OCR/, http://code.google.com/p/parichit/ (All are Obsolete) + * **Indic-OCR** https://indic-ocr.github.io/tessdata/ * **Irish uncial**: https://github.com/jimregan/tesseract-gle-uncial - * **Polish**: http://code.google.com/p/tesseract-polish/ + * **Polish**: https://code.google.com/p/tesseract-polish/ * **Fraktur** (dan, deu, swe): https://github.com/paalberti/tesseract-dan-fraktur - * **Myanmar**: http://code.google.com/p/myaocr/ + * **Myanmar**: https://code.google.com/p/myaocr/ * **Persian (Farsi)**: https://github.com/reza1615/PersianOcr * **7 segments font**: https://github.com/arturaugusto/display_ocr/tree/master/letsgodigital ## Ports - * [Project Naptha](http://projectnaptha.com/) + * [Project Naptha](https://projectnaptha.com/) * [tesseract.js-core](https://github.com/naptha/tesseract.js-core) - Emscripten port of Tesseract C++ API * [tesseract.js](https://github.com/naptha/tesseract.js) - Pure Javascript OCR @@ -94,7 +94,7 @@ Platform support depends on used language and experience of user. ### Tesseract 4.0x **Java** - * [tess4j](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - http://tess4j.sourceforge.net/ + * [tess4j](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - https://tess4j.sourceforge.net/ * [bytedeco](https://github.com/bytedeco/javacpp-presets/tree/master/tesseract) - Java configuration and interface classes for Tesseract based on the [JavaCPP-Presets](https://github.com/bytedeco/javacpp-presets) library from https://bytedeco.org **Python** @@ -143,7 +143,7 @@ Platform support depends on used language and experience of user. * [tesseract-sip](https://github.com/virtuald/python-tesseract-sip) - A python SIP wrapper for libtesseract (Apache license) * [pytesseract](https://github.com/madmaze/pytesseract) - a wrapper class for Tesseract OCR (requires tesseract executable) * [python-tesseract](https://github.com/cookbrite/python-tesseract/commits/master) - A wrapper class for Tesseract OCR that allows any conventional image files (SWIG based) - * http://code.google.com/p/pytess/ - A simple SWIG-based interface to Tesseract + * https://code.google.com/p/pytess/ - A simple SWIG-based interface to Tesseract * [aiopytesseract](https://github.com/amenezes/aiopytesseract) - asyncio tesseract wrapper for Tesseract-OCR. **R** @@ -155,7 +155,7 @@ Platform support depends on used language and experience of user. **Java** * [bytedeco](https://github.com/bytedeco/javacpp-presets/tree/master/tesseract) - Java configuration and interface classes for Tesseract based on 'JavaCPP-Presets' library from https://bytedeco.org - https://github.com/bytedeco/javacpp-presets - * [tess4j](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - http://tess4j.sourceforge.net/ + * [tess4j](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - https://tess4j.sourceforge.net/ **Node.js** * [penteract](https://github.com/kaelzhang/node-penteract) - The native node.js bindings to the Tesseract OCR project. @@ -176,11 +176,11 @@ Platform support depends on used language and experience of user. ### Tesseract 2.0x **Python** - * http://code.google.com/p/pytesser/ - * http://code.google.com/p/tesseract-python (pytesser clone) + * https://code.google.com/p/pytesser/ + * https://code.google.com/p/tesseract-python (pytesser clone) **.NET** - * http://www.pixel-technology.com/freeware/tessnet2/ + * https://www.pixel-technology.com/freeware/tessnet2/ **Java** - * [tess4j (0.4)](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - http://tess4j.sourceforge.net/ + * [tess4j (0.4)](https://github.com/nguyenq/tess4j) - JNA wrapper. Docs and discussions - https://tess4j.sourceforge.net/ diff --git a/Command-Line-Usage.md b/Command-Line-Usage.md index 3f35f8d..3695c84 100644 --- a/Command-Line-Usage.md +++ b/Command-Line-Usage.md @@ -177,8 +177,8 @@ Partial Output ``` - + "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + diff --git a/Compiling.md b/Compiling.md index a11b586..e6ab0c0 100644 --- a/Compiling.md +++ b/Compiling.md @@ -36,7 +36,7 @@ The following instructions are for building on Linux, which also can be applied * A compiler for C and C++: GCC or Clang * GNU Autotools: autoconf, automake, libtool * pkg-config -* [Leptonica](http://www.leptonica.org/) +* [Leptonica](https://www.leptonica.org/) * (optional) zlib, libpng, libjpeg, libtiff, giflib, openjpeg, webp, archive, curl @@ -66,7 +66,7 @@ sudo apt-get install libcairo2-dev ### Leptonica -You also need to install [Leptonica](http://www.leptonica.org/). Ensure that the development headers for Leptonica are installed before compiling Tesseract. +You also need to install [Leptonica](https://www.leptonica.org/). Ensure that the development headers for Leptonica are installed before compiling Tesseract. Tesseract versions and the minimum version of Leptonica required: @@ -74,8 +74,8 @@ Tesseract versions and the minimum version of Leptonica required: :-------------------: | :---------------------------------------: | :--------- 4.00 | 1.74.2 | [Ubuntu 18.04](https://packages.ubuntu.com/bionic/tesseract-ocr) 3.05 | 1.74.0 | Must build from source -3.04 | 1.71 | [Ubuntu 16.04](http://packages.ubuntu.com/xenial/tesseract-ocr) -3.03 | 1.70 | [Ubuntu 14.04](http://packages.ubuntu.com/trusty/tesseract-ocr) +3.04 | 1.71 | [Ubuntu 16.04](https://packages.ubuntu.com/xenial/tesseract-ocr) +3.03 | 1.70 | [Ubuntu 14.04](https://packages.ubuntu.com/trusty/tesseract-ocr) 3.02 | 1.69 | Ubuntu 12.04 3.01 | 1.67 | @@ -87,9 +87,9 @@ sudo apt-get install libleptonica-dev **but if you are using an oldish version of Linux, the Leptonica version may be too old, so you will need to build from source.** -The sources are at https://github.com/DanBloomberg/leptonica . The instructions for building are given in [Leptonica README](http://www.leptonica.org/source/README.html). +The sources are at https://github.com/DanBloomberg/leptonica . The instructions for building are given in [Leptonica README](https://www.leptonica.org/source/README.html). -Note that if building Leptonica from source, you may need to ensure that /usr/local/lib is in your library path. This is a standard Linux bug, and the information at [Stackoverflow](http://stackoverflow.com/questions/4743233/is-usr-local-lib-searched-for-shared-libraries) is very helpful. +Note that if building Leptonica from source, you may need to ensure that /usr/local/lib is in your library path. This is a standard Linux bug, and the information at [Stackoverflow](https://stackoverflow.com/questions/4743233/is-usr-local-lib-searched-for-shared-libraries) is very helpful. ## Installing Tesseract from Git @@ -266,12 +266,12 @@ If you have Visual Studio 2015, checkout the https://github.com/peirick/VS2015_T ## 3.03rc-1 -Have a look at blog [How to build Tesseract 3.03 with Visual Studio 2013](http://vorba.ch/2014/tesseract-3.03-vs2013.html). +Have a look at blog [How to build Tesseract 3.03 with Visual Studio 2013](https://vorba.ch/2014/tesseract-3.03-vs2013.html). ## 3.02 -For tesseract-ocr 3.02 please follow instruction in [Visual Studio 2008 Developer Notes for Tesseract-OCR](http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/setup.html#using-the-latest-tesseractocr-sources). +For tesseract-ocr 3.02 please follow instruction in [Visual Studio 2008 Developer Notes for Tesseract-OCR](https://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/setup.html#using-the-latest-tesseractocr-sources). ## 3.01 @@ -289,7 +289,7 @@ Windows relevant files are located in vs2008 directory (e.g. `tesseract-3.01\vs2 ## Mingw+Msys -For Mingw+Msys have a look at blog [Compiling Leptonica and Tesseract-ocr with Mingw+Msys](http://www.sk-spell.sk.cx/compiling-leptonica-and-tesseract-ocr-with-mingwmsys). +For Mingw+Msys have a look at blog [Compiling Leptonica and Tesseract-ocr with Mingw+Msys](https://www.sk-spell.sk.cx/compiling-leptonica-and-tesseract-ocr-with-mingwmsys). ## Msys2 @@ -307,7 +307,7 @@ To build the tesseract-ocr release package, use PKGBUILD from https://github.com ## Cygwin -To build on Cygwin have a look at blog [How to build Tesseract on Cygwin](http://vorba.ch/2014/tesseract-cygwin.html). +To build on Cygwin have a look at blog [How to build Tesseract on Cygwin](https://vorba.ch/2014/tesseract-cygwin.html). Tesseract as well as the training utilities for 3.04.00 onwards are available as Cygwin packages. @@ -324,7 +324,7 @@ tesseract-training-util 3.04.01-1 ## Mingw-w64 -[Mingw-w64](http://mingw-w64.org/) allows building 32- or 64-bit executables for Windows. +[Mingw-w64](https://mingw-w64.org/) allows building 32- or 64-bit executables for Windows. It can be used for native compilations on Windows, but also for cross compilations on Linux (which are easier and faster than native compilations). Most large Linux distributions already contain packages with the tools need for a cross build. @@ -631,4 +631,4 @@ In this case you must create m4 directory (`mkdir m4`), and then rerun the above # Miscellaneous -* [Standalone Tesseract build bash script](http://pastebin.com/VnGLHfbr) +* [Standalone Tesseract build bash script](https://pastebin.com/VnGLHfbr) diff --git a/Downloads.md b/Downloads.md index 0dac013..a6d2a1d 100644 --- a/Downloads.md +++ b/Downloads.md @@ -12,7 +12,7 @@ Tesseract is included in most Linux distributions. ### Old Downloads -[Downloads Archive on SourceForge](http://sourceforge.net/projects/tesseract-ocr-alt/files/). +[Downloads Archive on SourceForge](https://sourceforge.net/projects/tesseract-ocr-alt/files/). There you can find, among other files, Windows installer for the **old** version 3.02. Currently, there is no **official** Windows installer for newer versions. diff --git a/FAQ.md b/FAQ.md index c2f0807..f7dce12 100644 --- a/FAQ.md +++ b/FAQ.md @@ -62,11 +62,11 @@ If you get an error message saying eng.traineddata not found, try setting `TESSD - tsv - pdf with text layer only -Tesseract’s standard output is a plain txt file (UTF-8 encoded, with *' as [end-of-line marker](http://en.wikipedia.org/wiki/Newline)) and 'FF* as a form feed character after each page. +Tesseract’s standard output is a plain txt file (UTF-8 encoded, with *' as [end-of-line marker](https://en.wikipedia.org/wiki/Newline)) and 'FF* as a form feed character after each page. With the configfile option set to `pdf`, tesseract will produce searchable PDF pages containing images with a hidden, searchable text layer. -With the configfile option set to `hocr`, tesseract will produce XHTML output compliant with the [hOCR specification](https://docs.google.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0&pli=1) (the input image name must be ASCII if the operating system use something other than UTF-8 encoding for filenames - see [issue 809](https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=809) for some details). +With the configfile option set to `hocr`, tesseract will produce XHTML output compliant with the [hOCR specification](https://docs.google.com/document/preview?id=1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0&pli=1) (the input image name must be ASCII if the operating system use something other than UTF-8 encoding for filenames - see [issue 809](https://web.archive.org/web/*/https://code.google.com/p/tesseract-ocr/issues/detail?id=809) for some details). With the configfile option set to `tsv`, tesseract will produce [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) file. @@ -124,7 +124,7 @@ Call the file `logfile` and put it in `tessdata/configs/`, then add `logfile` to ### How can I suppress tesseract info line? -See [issue 579](https://web.archive.org/web/*/http://code.google.com/p/tesseract-ocr/issues/detail?id=579). On Linux you can redirect stderr and stdout output to `/dev/null`. E.g.: +See [issue 579](https://web.archive.org/web/*/https://code.google.com/p/tesseract-ocr/issues/detail?id=579). On Linux you can redirect stderr and stdout output to `/dev/null`. E.g.: tesseract phototest.tif phototest 1>/dev/null 2>&1 @@ -214,11 +214,11 @@ No. Tesseract is for text recognition. ### Where is the documentation? -You’re looking at it. If things aren’t clear, search on the [Tesseract Google Group](http://groups.google.com/group/tesseract-ocr/) or ask us there. If you want to help us write more, please do, and post it to the group! +You’re looking at it. If things aren’t clear, search on the [Tesseract Google Group](https://groups.google.com/group/tesseract-ocr/) or ask us there. If you want to help us write more, please do, and post it to the group! ### My question isn’t in here! -Try searching the forum: as well as open and closed issues on GitHub: , as your question may have come up before even if it is not listed here. +Try searching the forum: as well as open and closed issues on GitHub: , as your question may have come up before even if it is not listed here. *** If you have a question which is not answered by the FAQ, Wiki pages and Issues, please search in the [users mailing-list/forum](https://groups.google.com/d/forum/tesseract-ocr) before posting it there. diff --git a/Fonts.md b/Fonts.md index 0f57db7..82aef99 100644 --- a/Fonts.md +++ b/Fonts.md @@ -101,11 +101,11 @@ The installed fonts are shown by the command `fc-list`. See also the [Debian wik * https://fontlibrary.org/en (GFS Bodoni) * https://fonts.google.com/ -* http://iginomarini.com/fell/the-revival-fonts/ -* http://scholarsfonts.net/ (Cardo) -* http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=FontDownloads (SIL Fonts) -* http://www.ctan.org/tex-archive/fonts (GFS Bodoni) -* http://www.steffmann.de/wordpress/test-2/ +* https://iginomarini.com/fell/the-revival-fonts/ +* https://scholarsfonts.net/ (Cardo) +* https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=FontDownloads (SIL Fonts) +* https://www.ctan.org/tex-archive/fonts (GFS Bodoni) +* https://www.steffmann.de/wordpress/test-2/ #### Arabic Fonts @@ -113,64 +113,64 @@ The installed fonts are shown by the command `fc-list`. See also the [Debian wik #### Devanagari Fonts -* [Aksharayogini2](http://aksharyogini.sudhanwa.com/download/Aksharyogini2Normal.ttf) -* [AksharayoginiBoldItalic](http://aksharyogini.sudhanwa.com/download/AksharyoginiBoldItalic.ttf) -* [AksharayoginiBold](http://aksharyogini.sudhanwa.com/download/AksharyoginiBold.ttf) -* [AksharayoginiItalic](http://aksharyogini.sudhanwa.com/download/AksharyoginiItalic.ttf) -* [Aksharayogini](http://aksharyogini.sudhanwa.com/download/AksharyoginiNormal.ttf) -* [Ananda Akchyar Devanagari](http://www.deviantart.com/download/528435924/ananda_akchyar_devanagari_unicode_by_lalitkala-d8qm7ro.zip?token=93007db762db7368ba4846c0de5b4e5f3dfdadd8&ts=1501873924) -* [AnnapurnaSIL](http://software.sil.org/downloads/d/annapurna/AnnapurnaSIL-1.201.zip) -* [CDAC-Surekh Bold](http://biharvidhanparishad.gov.in/Fonts/CDACSRBT.TTF) -* [CDAC-Surekh Normal](http://biharvidhanparishad.gov.in/Fonts/CDACSRNT.TTF) -* [CDAC-Yogesh Bold](http://biharvidhanparishad.gov.in/Fonts/CDACOTYGB.TTF) -* [CDAC-Yogesh Italic](http://biharvidhanparishad.gov.in/Fonts/CDACYGIT.TTF) -* [CDAC-Yogesh Normal](http://biharvidhanparishad.gov.in/Fonts/CDACOTYGN.TTF) -* [Chandas](http://www.sanskritweb.net/cakram/chandas.ttf) +* [Aksharayogini2](https://aksharyogini.sudhanwa.com/download/Aksharyogini2Normal.ttf) +* [AksharayoginiBoldItalic](https://aksharyogini.sudhanwa.com/download/AksharyoginiBoldItalic.ttf) +* [AksharayoginiBold](https://aksharyogini.sudhanwa.com/download/AksharyoginiBold.ttf) +* [AksharayoginiItalic](https://aksharyogini.sudhanwa.com/download/AksharyoginiItalic.ttf) +* [Aksharayogini](https://aksharyogini.sudhanwa.com/download/AksharyoginiNormal.ttf) +* [Ananda Akchyar Devanagari](https://www.deviantart.com/download/528435924/ananda_akchyar_devanagari_unicode_by_lalitkala-d8qm7ro.zip?token=93007db762db7368ba4846c0de5b4e5f3dfdadd8&ts=1501873924) +* [AnnapurnaSIL](https://software.sil.org/downloads/d/annapurna/AnnapurnaSIL-1.201.zip) +* [CDAC-Surekh Bold](https://biharvidhanparishad.gov.in/Fonts/CDACSRBT.TTF) +* [CDAC-Surekh Normal](https://biharvidhanparishad.gov.in/Fonts/CDACSRNT.TTF) +* [CDAC-Yogesh Bold](https://biharvidhanparishad.gov.in/Fonts/CDACOTYGB.TTF) +* [CDAC-Yogesh Italic](https://biharvidhanparishad.gov.in/Fonts/CDACYGIT.TTF) +* [CDAC-Yogesh Normal](https://biharvidhanparishad.gov.in/Fonts/CDACOTYGN.TTF) +* [Chandas](https://www.sanskritweb.net/cakram/chandas.ttf) * [Gotu](https://ektype.in/gotu.html) * [Jaini](https://ektype.in/jaini-1096.html) * [Jaini Purva](https://ektype.in/jaini-1096.html) * [Lohit Devanagari](https://releases.pagure.org/lohit/Lohit-Devanagari.ttf) -* [Nakula](http://bombay.indology.info/software/fonts/devanagari/nakula.ttf) +* [Nakula](https://bombay.indology.info/software/fonts/devanagari/nakula.ttf) * [Mukta](https://ektype.in/mukta.html) -* [Murty Hindi](http://www.murtylibrary.com/mcli-fonts.php) -* [Murty Sanskrit](http://www.murtylibrary.com/mcli-fonts.php) -* [Sahadeva](http://bombay.indology.info/software/fonts/devanagari/sahadeva.ttf) -* [Sanskrit2003](http://www.sanskritweb.net/itrans/sanskrit2003.zip) -* [Santipur OT](http://www.sanskritweb.net/itrans/santipurot.zip) -* [Sharad76](http://www.setuadvertising.com/sharad76/) +* [Murty Hindi](https://www.murtylibrary.com/mcli-fonts.php) +* [Murty Sanskrit](https://www.murtylibrary.com/mcli-fonts.php) +* [Sahadeva](https://bombay.indology.info/software/fonts/devanagari/sahadeva.ttf) +* [Sanskrit2003](https://www.sanskritweb.net/itrans/sanskrit2003.zip) +* [Santipur OT](https://www.sanskritweb.net/itrans/santipurot.zip) +* [Sharad76](https://www.setuadvertising.com/sharad76/) * [Shobhika](https://github.com/Sandhi-IITBombay/Shobhika/releases/) -* [Shree-DV0726-OT](http://biharvidhanparishad.gov.in/Fonts/SHREE-DV0726-OT.TTF) +* [Shree-DV0726-OT](https://biharvidhanparishad.gov.in/Fonts/SHREE-DV0726-OT.TTF) * [Siddhanta](https://sites.google.com/site/bayaryn/siddhanta-variations.zip?attredirects=0) -* [Uttara](http://www.sanskritweb.net/cakram/uttara.ttf) +* [Uttara](https://www.sanskritweb.net/cakram/uttara.ttf) * [Yashomudra Fonts](https://github.com/RajyaMarathiVikasSanstha/Yashomudra/tree/master/TTF%20Files) * [Google Devanagari Fonts](https://fonts.google.com/?subset=devanagari) -* [fonts from TDIL Hindi CD](http://ildc.in/Hindi/GIST/hindi_cd_2/windows/index.htm) -* [Linked from Bihar Vidhan Parishad](http://biharvidhanparishad.gov.in/HindiFonts.htm) -* [Linked from bih.nic.in](http://industries.bih.nic.in/HindiFonts.htm) +* [fonts from TDIL Hindi CD](https://ildc.in/Hindi/GIST/hindi_cd_2/windows/index.htm) +* [Linked from Bihar Vidhan Parishad](https://biharvidhanparishad.gov.in/HindiFonts.htm) +* [Linked from bih.nic.in](https://industries.bih.nic.in/HindiFonts.htm) #### Fraktur Fonts -* http://unifraktur.sourceforge.net/maguntia.html (UnifrakturMaguntia) -* http://www.orbitals.com/self/ligature/ligature.htm (Wyld) +* https://unifraktur.sourceforge.net/maguntia.html (UnifrakturMaguntia) +* https://www.orbitals.com/self/ligature/ligature.htm (Wyld) * https://www.fontyukle.net/de/1,Walbaum -* http://de.ffonts.net/Walbaum-Fraktur.font.download -* http://www.1001fonts.com/fraktur-fonts.html -* http://www.dafont.com/fette-unz-fraktur.font -* http://www.1001freefonts.com/fette_fraktur.font -* http://www.ligafaktur.de/Schriften.html -* http://www.morscher.com/3r/fonts/fraktur.htm +* https://de.ffonts.net/Walbaum-Fraktur.font.download +* https://www.1001fonts.com/fraktur-fonts.html +* https://www.dafont.com/fette-unz-fraktur.font +* https://www.1001freefonts.com/fette_fraktur.font +* https://www.ligafaktur.de/Schriften.html +* https://www.morscher.com/3r/fonts/fraktur.htm #### Hebrew Fonts -* [A list of Hebrew fonts from the Open Siddur Project](http://opensiddur.org/tools/fonts/) +* [A list of Hebrew fonts from the Open Siddur Project](https://opensiddur.org/tools/fonts/) #### Collections of fonts -* http://www.abstractfonts.com/ -* http://www.schriftarten-fonts.de/ (German) +* https://www.abstractfonts.com/ +* https://www.schriftarten-fonts.de/ (German) ### More information on fonts * https://en.wikipedia.org/wiki/Fraktur -* http://www.orbitals.com/self/ligature/ligature.htm 18th Century Ligatures and Fonts -* http://www.steffmann.de/wordpress/ (German) +* https://www.orbitals.com/self/ligature/ligature.htm 18th Century Ligatures and Fonts +* https://www.steffmann.de/wordpress/ (German) diff --git a/ImproveQuality.md b/ImproveQuality.md index 15e53a2..d4b5f27 100644 --- a/ImproveQuality.md +++ b/ImproveQuality.md @@ -54,7 +54,7 @@ Noise is random variation of brightness or colour in an image, that can make the ### Dilation and Erosion -Bold characters or Thin characters (especially those with [Serifs](https://en.wikipedia.org/wiki/Serif)) may impact the recognition of details and reduce recognition accuracy. Many image processing programs allow [Dilation and Erosion](http://www.mif.vu.lt/atpazinimas/dip/FIP/fip-Morpholo.html#Heading96) of edges of characters against a common background to dilate or grow in size (Dilation) or shrink (Erosion). +Bold characters or Thin characters (especially those with [Serifs](https://en.wikipedia.org/wiki/Serif)) may impact the recognition of details and reduce recognition accuracy. Many image processing programs allow [Dilation and Erosion](https://www.mif.vu.lt/atpazinimas/dip/FIP/fip-Morpholo.html#Heading96) of edges of characters against a common background to dilate or grow in size (Dilation) or shrink (Erosion). Heavy ink bleeding from historical documents can be compensated for by using an Erosion technique. Erosion can be used to shrink characters back to their normal glyph structure. @@ -80,7 +80,7 @@ A skewed image is when a page has been scanned when not straight. The quality of #### Missing borders -If you OCR just text area without any border, tesseract could have problems with it. See for some details in [tesseract user forum](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/v26a-RYPSOE/2Sppq61GBwAJ)[#427](https://github.com/tesseract-ocr/tesseract/issues/427) . You can easy add small border (e.g. 10 px) with [ImageMagick®](http://imagemagick.org/script/index.php): +If you OCR just text area without any border, tesseract could have problems with it. See for some details in [tesseract user forum](https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/v26a-RYPSOE/2Sppq61GBwAJ)[#427](https://github.com/tesseract-ocr/tesseract/issues/427) . You can easy add small border (e.g. 10 px) with [ImageMagick®](https://imagemagick.org/script/index.php): ``` convert 427-1.jpg -bordercolor White -border 10x10 427-1b.jpg ``` @@ -112,23 +112,23 @@ Tesseract 4.00 removes the alpha channel with leptonica function [pixRemoveAlpha ### Tools / Libraries -* [Leptonica](http://leptonica.com) -* [OpenCV](http://opencv.org/) +* [Leptonica](https://leptonica.com) +* [OpenCV](https://opencv.org/) * [ScanTailor Advanced](https://github.com/4lex4/scantailor-advanced#-scantailor-advanced) -* [ImageMagick](http://www.imagemagick.org) +* [ImageMagick](https://www.imagemagick.org) * [unpaper](https://www.flameeyes.eu/projects/unpaper) -* [ImageJ](http://rsb.info.nih.gov/ij/) -* [Gimp](http://www.gimp.org) +* [ImageJ](https://rsb.info.nih.gov/ij/) +* [Gimp](https://www.gimp.org) * [PRLib](https://github.com/leha-bot/PRLib) - Pre-Recognize Library with algorithms for improving OCR quality ### Examples If you need an example how to improve image quality programmatically, have a look at this examples: -* [OpenCV - Rotation (Deskewing)](http://felix.abecassis.me/2011/10/opencv-rotation-deskewing/) - c++ example -* [Fred's ImageMagick TEXTCLEANER](http://www.fmwconcepts.com/imagemagick/textcleaner/index.php) - bash script for processing a scanned document of text to clean the text background. +* [OpenCV - Rotation (Deskewing)](https://felix.abecassis.me/2011/10/opencv-rotation-deskewing/) - c++ example +* [Fred's ImageMagick TEXTCLEANER](https://www.fmwconcepts.com/imagemagick/textcleaner/index.php) - bash script for processing a scanned document of text to clean the text background. * [rotation\_spacing.py](https://gist.github.com/endolith/334196bac1cac45a4893#) - python script for automatic detection of rotation and line spacing of an image of text -* [crop\_morphology.py](https://github.com/danvk/oldnyc/blob/master/ocr/tess/crop_morphology.py) - [Finding blocks of text in an image using Python, OpenCV and numpy](http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html) +* [crop\_morphology.py](https://github.com/danvk/oldnyc/blob/master/ocr/tess/crop_morphology.py) - [Finding blocks of text in an image using Python, OpenCV and numpy](https://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html) * [Credit card OCR with OpenCV and Python](https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python) * [noteshrink](https://github.com/mzucker/noteshrink) - python example how to clean up scans. Details in blog [Compressing and enhancing hand-written notes](https://mzucker.github.io/2016/09/20/noteshrink.html). * [uproject text](https://github.com/mzucker/unproject_text) - python example how to recover perspective of image. Details in blog [Unprojecting text with ellipses](https://mzucker.github.io/2016/10/11/unprojecting-text-with-ellipses.html). diff --git a/Installation.md b/Installation.md index 757bdb4..64c2ae0 100644 --- a/Installation.md +++ b/Installation.md @@ -1,6 +1,6 @@ # Introduction -Tesseract is an open source [text recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition) Engine, available under the [Apache 2.0 license.](http://www.apache.org/licenses/LICENSE-2.0) It can be used directly, or (for programmers) using an [API](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) to extract printed text from images. It supports a wide variety of languages. +Tesseract is an open source [text recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition) Engine, available under the [Apache 2.0 license.](https://www.apache.org/licenses/LICENSE-2.0) It can be used directly, or (for programmers) using an [API](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) to extract printed text from images. It supports a wide variety of languages. Tesseract doesn't have a built-in GUI, but there are several available from the [3rdParty](User-Projects-%E2%80%93-3rdParty.md) page. @@ -39,10 +39,10 @@ sudo apt install libtesseract-dev ``` sudo vi /etc/apt/sources.list -Copy the first line "deb http://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line. +Copy the first line "deb https://archive.ubuntu.com/ubuntu bionic main" and paste it as shown below on the next line. If you are using a different release of ubuntu, then replace bionic with the respective release name. -deb http://archive.ubuntu.com/ubuntu bionic universe +deb https://archive.ubuntu.com/ubuntu bionic universe ``` ### Debian packages @@ -114,7 +114,7 @@ The traineddata is currently not shipped with the snap package and must be place ### macOS -You can install Tesseract using either [MacPorts](https://www.macports.org/) or [Homebrew](http://brew.sh). +You can install Tesseract using either [MacPorts](https://www.macports.org/) or [Homebrew](https://brew.sh). A macOS wrapper for the Tesseract API is also available at [Tesseract macOS](https://github.com/scott0123/Tesseract-macOS). @@ -147,7 +147,7 @@ Installer for Windows for Tesseract 3.05, Tesseract 4 and Tesseract 5 are availa An installer for the **OLD version 3.02** is available for Windows from our [download](Downloads.md) page. This includes the English training data. If you want to use another language, [download the appropriate training data](Data-Files.md), -unpack it using [7-zip](http://www.7-zip.org), and copy the .traineddata file into the 'tessdata' directory, probably `C:\Program Files\Tesseract-OCR\tessdata`. +unpack it using [7-zip](https://www.7-zip.org), and copy the .traineddata file into the 'tessdata' directory, probably `C:\Program Files\Tesseract-OCR\tessdata`. To access tesseract-OCR from any location you may have to add the directory where the tesseract-OCR binaries are located to the Path variables, probably `C:\Program Files\Tesseract-OCR`. @@ -173,7 +173,7 @@ and the data files: pacman -S mingw-w64-{i686,x86_64}-tesseract-data-eng ``` -In the above command, "eng" may be replaced with the [ISO 639 3-letter language code](http://www.loc.gov/standards/iso639-2/php/code_list.php) for supported languages. For a list of available language packages use: +In the above command, "eng" may be replaced with the [ISO 639 3-letter language code](https://www.loc.gov/standards/iso639-2/php/code_list.php) for supported languages. For a list of available language packages use: ``` pacman -Ss tesseract-data @@ -232,7 +232,7 @@ It can also be trained to support other languages and scripts; for more details # Development -Tesseract can also be used in your own project, under the terms of the [Apache License 2.0.](http://www.apache.org/licenses/LICENSE-2.0) It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the [3rdParty](User-Projects-%E2%80%93-3rdParty) page for a sample of what has been done with it. Note that as yet there are very few 3rdParty Tesseract OCR projects [being developed for Mac](https://machow2.com/ocr-for-mac-best-software/#Tesseract_Freesoftware/) (with the only one being [Tesseract macOS](https://github.com/scott0123/Tesseract-macOS).md), although there are several online OCR services that can be used on Mac that may use Tesseract as their OCR engine. +Tesseract can also be used in your own project, under the terms of the [Apache License 2.0.](https://www.apache.org/licenses/LICENSE-2.0) It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the [3rdParty](User-Projects-%E2%80%93-3rdParty) page for a sample of what has been done with it. Note that as yet there are very few 3rdParty Tesseract OCR projects [being developed for Mac](https://machow2.com/ocr-for-mac-best-software/#Tesseract_Freesoftware/) (with the only one being [Tesseract macOS](https://github.com/scott0123/Tesseract-macOS).md), although there are several online OCR services that can be used on Mac that may use Tesseract as their OCR engine. Also, it is free software, so if you want to pitch in and help, please do! If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the [Issues List](https://github.com/tesseract-ocr/tesseract/issues) @@ -240,5 +240,5 @@ If you find a bug and fix it yourself, the best thing to do is to attach the pat # Support First read the [documentation](https://tesseract-ocr.github.io/), particularly the [FAQ](FAQ.md) to see if your problem is addressed there. -If not, search the [Tesseract user forum](http://groups.google.com/group/tesseract-ocr) or the -[Tesseract developer forum](http://groups.google.com/group/tesseract-dev), and if you still can't find what you need, please ask us there. +If not, search the [Tesseract user forum](https://groups.google.com/group/tesseract-ocr) or the +[Tesseract developer forum](https://groups.google.com/group/tesseract-dev), and if you still can't find what you need, please ask us there. diff --git a/README.md b/README.md index a463adb..863b88b 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ For versions `4.x.x`, `3.05.02` and older, see the [documentation for old versio ## Introduction -Tesseract is an open source [text recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition) Engine, available under the [Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0). +Tesseract is an open source [text recognition (OCR)](https://en.wikipedia.org/wiki/Optical_character_recognition) Engine, available under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0). * Major version 5 is the current stable version and started with release [5.0.0](https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0) on November 30, 2021. * Newer minor versions and bugfix versions are available from [GitHub](https://github.com/tesseract-ocr/tesseract/releases/). @@ -32,14 +32,14 @@ and [planning documentation](https://tesseract-ocr.github.io/tessdoc/Planning.ht Tesseract can be used directly via [command line](Command-Line-Usage.md), or (for programmers) by using an [API](https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/baseapi.h) to extract printed text from images. It supports a [wide variety of languages](Data-Files-in-different-versions.md). Tesseract doesn't have a built-in GUI, but there are several available from the [3rdParty](User-Projects-–-3rdParty.md) page. External tools, wrappers and training projects for Tesseract are listed under [AddOns](AddOns.md). -Tesseract can be used in your own project, under the terms of the [Apache License 2.0.](http://www.apache.org/licenses/LICENSE-2.0) It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the [3rdParty](User-Projects-–-3rdParty.md) and [AddOns](AddOns.md) pages for samples of what has been done with it. +Tesseract can be used in your own project, under the terms of the [Apache License 2.0.](https://www.apache.org/licenses/LICENSE-2.0) It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the [3rdParty](User-Projects-–-3rdParty.md) and [AddOns](AddOns.md) pages for samples of what has been done with it. If you have a question, first read the [documentation](https://tesseract-ocr.github.io/), particularly the **[FAQ](FAQ.md)** to see if your problem is addressed there. If not, search the [Issues List](https://github.com/tesseract-ocr/tesseract/issues), -[Tesseract user forum](http://groups.google.com/group/tesseract-ocr), +[Tesseract user forum](https://groups.google.com/group/tesseract-ocr), and if you still can't find what you need, please ask your question in -[Tesseract user forum Google group](http://groups.google.com/group/tesseract-ocr). +[Tesseract user forum Google group](https://groups.google.com/group/tesseract-ocr). Tesseract is free software, so if you want to pitch in and help, please do! If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the [Issues List](https://github.com/tesseract-ocr/tesseract/issues). diff --git a/TesseractOpenCL.md b/TesseractOpenCL.md index 2251638..b3cbb61 100644 --- a/TesseractOpenCL.md +++ b/TesseractOpenCL.md @@ -20,8 +20,8 @@ By using that compute power, Tesseract ideally can be made faster. 3. Set up the OpenCL paths in “tesseract” project: * Right click on “tesseract” project and select Properties - * Header file paths : Go to Configuration Properties -> C/C++ -> General -> Additional Include Directories. Add the directory path where OpenCL header files are located on the given machine. E.g: On a machine with [AMD APP SDK](http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/) installed, this path will be `$(AMDAPPSDKROOT)include`. - * Library file path : Go to Configuration Properties -> Linker -> General -> Additional Library Directories. Add the directory path where OpenCL library file, `OpenCL.lib` is located on the given machine. E.g: On a machine with [AMD APP SDK](http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/) installed, this path will be `$(AMDAPPSDKROOT)lib\x86`. + * Header file paths : Go to Configuration Properties -> C/C++ -> General -> Additional Include Directories. Add the directory path where OpenCL header files are located on the given machine. E.g: On a machine with [AMD APP SDK](https://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/) installed, this path will be `$(AMDAPPSDKROOT)include`. + * Library file path : Go to Configuration Properties -> Linker -> General -> Additional Library Directories. Add the directory path where OpenCL library file, `OpenCL.lib` is located on the given machine. E.g: On a machine with [AMD APP SDK](https://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/) installed, this path will be `$(AMDAPPSDKROOT)lib\x86`. * Library reference : Go to Configuration Properties -> Linker -> Input -> Additional Dependencies. Add OpenCL.lib to the list of dependent libraries. * Preprocessor definition : Go to Configuration Properties -> C/C++ -> Preprocessor -> Preprocessor Definitions. Add USE\_OPENCL to the list of preprocessor definitions list. @@ -65,9 +65,9 @@ These Debian packages provide such drivers: * nvidia-legacy-304xx-opencl-icd – NVIDIA GPU * nvidia-legacy-340xx-opencl-icd – NVIDIA GPU * nvidia-opencl-icd – NVIDIA GPU -* [pocl-opencl-icd](http://portablecl.org/) – native CPU +* [pocl-opencl-icd](https://portablecl.org/) – native CPU -It is possible to enable debug messages for some drivers by setting environment variables ([example](http://portablecl.org/docs/html/)). +It is possible to enable debug messages for some drivers by setting environment variables ([example](https://portablecl.org/docs/html/)). ## OpenCL devices (examples) diff --git a/UNLV-Testing-of-Tesseract.md b/UNLV-Testing-of-Tesseract.md index 2c45440..062aabc 100644 --- a/UNLV-Testing-of-Tesseract.md +++ b/UNLV-Testing-of-Tesseract.md @@ -3,7 +3,7 @@ ## Introduction Tesseract 2.0+ provided scripts that make it possible to run some of the UNLV tests published in the Fourth Annual Test of OCR Accuracy. -See [AT-1995.pdf](https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf) (originally available at http://www.isri.unlv.edu/). The main purpose of providing these test scripts is to enable Tesseract users to verify that their installation is correct, and that no architecture-specific problems are causing bad recognition accuracy. It also serves as a benchmark to demonstrate accuracy improvements of each version. Developers working on Tesseract may find the benchmarking tools useful for measuring experimental new modules. +See [AT-1995.pdf](https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf) (originally available at https://www.isri.unlv.edu/). The main purpose of providing these test scripts is to enable Tesseract users to verify that their installation is correct, and that no architecture-specific problems are causing bad recognition accuracy. It also serves as a benchmark to demonstrate accuracy improvements of each version. Developers working on Tesseract may find the benchmarking tools useful for measuring experimental new modules. Note that **some** architecture-specific variation is bound to occur. Most of these should be caused by varying treatment and optimization of floating-point arithmetic between compilers. It is also possible of course that there are memory initialization errors that show up as differences between architectures, but we claim to have found most of these already in the unicodeization process. diff --git a/ViewerDebugging.md b/ViewerDebugging.md index db8162f..f971e5e 100644 --- a/ViewerDebugging.md +++ b/ViewerDebugging.md @@ -8,16 +8,16 @@ Tesseract has a built-in capability to display its internal state, so that you c The following components are required to run the viewer: * Java runtime - * [piccolo2d-core-3.0.jar](http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar) - * [piccolo2d-extras-3.0.jar](http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar) - * [jaxb-api-2.3.1.jar](http://search.maven.org/remotecontent?filepath=javax/xml/bind/jaxb-api/2.3.1/jaxb-api-2.3.1.jar) + * [piccolo2d-core-3.0.jar](https://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar) + * [piccolo2d-extras-3.0.jar](https://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar) + * [jaxb-api-2.3.1.jar](https://search.maven.org/remotecontent?filepath=javax/xml/bind/jaxb-api/2.3.1/jaxb-api-2.3.1.jar) * `ScrollView.jar`, built from the source in tesseract/java or download [ScrollView.jar](ScrollView.jar) (build on 64bit Linux with jaxb-api-2.3.1.jar, piccolo2d-core-3.0.jar, piccolo2d-extras-3.0.jar and javac 1.8.0_181.md) `make ScrollView.jar` will download them automatically to `tesseract/java` if `curl `is present in your path. All these jar files need to go in a single directory. Tesseract learns the location either through the environment variable SCROLLVIEW\_PATH or a compiler define of the same name. -Alternative download link by Dmitri Silaev is available from http://www.4shared.com/zip/FnP8RSu0/tess_debug_3_02.html. +Alternative download link by Dmitri Silaev is available from https://www.4shared.com/zip/FnP8RSu0/tess_debug_3_02.html. Copy `piccolo-1.2.jar`, `piccolox-1.2.jar` and `ScrollView.jar` from the downloaded package to `C:\Tesseract-OCR\java`. **On Linux:** diff --git a/tess3/Technical-Documentation.md b/tess3/Technical-Documentation.md index 3fb3698..9b35369 100644 --- a/tess3/Technical-Documentation.md +++ b/tess3/Technical-Documentation.md @@ -26,23 +26,23 @@ Spain July 25, 2009. https://dl.acm.org/citation.cfm?id=1577804 ## Other publications from Ray Smith - * [Ray Smith Publications](http://research.google.com/pubs/author4479.html) - * [The extraction and recognition of text from multimedia document images](http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.380162) by Smith, R.W. (Ph.D. thesis), 1987 + * [Ray Smith Publications](https://research.google.com/pubs/author4479.html) + * [The extraction and recognition of text from multimedia document images](https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.380162) by Smith, R.W. (Ph.D. thesis), 1987 * [Slides from Tutorial on Tesseract presented at DAS2014](https://drive.google.com/file/d/0B7l10Bj_LprhbUlIUFlCdGtDYkE/edit?usp=sharing) * [Slides from Tutorial on Tesseract presented at DAS2016](https://github.com/tesseract-ocr/docs/tree/main/das_tutorial2016) ## Other * Video [PhotoTechEDU Day 11: Document Image Analysis with Leptonica](https://www.youtube.com/watch?v=pCZtGRUa_7s) - * [Training Tesseract for Ancient Greek OCR](http://eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf) by Nick White - * [Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition](http://research.ijcaonline.org/volume39/number6/pxc3877076.pdf) by Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh - * [Report on the comparison of Tesseract and ABBYY FineReader OCR engines](http://lib.psnc.pl/dlibra/docmetadata?id=358&from=publication&showContent=true) by Heliński, Kmieciak, and Parkoła + * [Training Tesseract for Ancient Greek OCR](https://eutypon.gr/eutypon/pdf/e2012-29/e29-a01.pdf) by Nick White + * [Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition](https://research.ijcaonline.org/volume39/number6/pxc3877076.pdf) by Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh + * [Report on the comparison of Tesseract and ABBYY FineReader OCR engines](https://lib.psnc.pl/dlibra/docmetadata?id=358&from=publication&showContent=true) by Heliński, Kmieciak, and Parkoła * [The hOCR Embedded OCR Workflow and Output Format](https://github.com/kba/hocr-spec/) (hOCR specification) * [Text Detection on Nokia N900 Using Stroke Width Transform](https://sites.google.com/site/roboticssaurav/strokewidthnokia) (with source code) * [Generic Text Recognition using Long Short-Term Memory Networks - Ph.D. Thesis](https://kluedo.ub.uni-kl.de/files/4353/PhD_Thesis_Ul-Hasan.pdf) * [Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning](https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/) * [Translation-Inspired OCR](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37260.pdf) by Dmitriy Genzel, Ashok C. Popat, Nemanja Spasojevic, Michael Jahr, Andrew Senior, Eugene le, Frank ... Keywords-Optical character recognition; statistical machine ... (character) locations in Arabic, English, and Hindi PRAN-data examples. - * [Developing Multilingual OCR and Handwriting Recognition at Google](http://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/SSDA/slides/AshokPopat-IAPRJaipurJan2017.pdf) by Ashok Popat. Research Scientist, Google Inc. IAPR Summer School, Jaipur: Jan 23 2017. + * [Developing Multilingual OCR and Handwriting Recognition at Google](https://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/SSDA/slides/AshokPopat-IAPRJaipurJan2017.pdf) by Ashok Popat. Research Scientist, Google Inc. IAPR Summer School, Jaipur: Jan 23 2017. * [General-Purpose OCR Paragraph Identification by Graph Convolutional Neural Networks](https://arxiv.org/pdf/2101.12741.pdf) by Renshen Wang, Yasuhisa Fujii, Ashok C. Popat January 2021 \ No newline at end of file diff --git "a/tess3/Training-Tesseract-3.00\342\200\2233.02.md" "b/tess3/Training-Tesseract-3.00\342\200\2233.02.md" index 2f2a457..4b04ac1 100644 --- "a/tess3/Training-Tesseract-3.00\342\200\2233.02.md" +++ "b/tess3/Training-Tesseract-3.00\342\200\2233.02.md" @@ -75,8 +75,8 @@ The traineddata file is simply a concatenation of the input files, with a table ## Requirements for text input files Text input files (lang.config, lang.unicharambigs, font\_properties, box files, wordlists for dictionaries...) need to meet these criteria: - * ASCII or UTF-8 encoding without [BOM](http://en.wikipedia.org/wiki/Byte_order_mark) - * Unix [end-of-line marker](http://en.wikipedia.org/wiki/Newline) ('\n') + * ASCII or UTF-8 encoding without [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) + * Unix [end-of-line marker](https://en.wikipedia.org/wiki/Newline) ('\n') * The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will get an error message containing "last\_char == '\n':Error:Assert failed..." ## How little can you get away with? @@ -388,9 +388,9 @@ Seven of the files are coded as a Directed Acyclic Word Graph (DAWG), and the ot | fixed-length-dawgs | dawg | Several dawgs of different fixed lengths —— useful for languages like Chinese. [Not used since version 3.03] | | bigram-dawg | dawg | A dawg of word bigrams where the words are separated by a space and each digit is replaced by a _?_. | | unambig-dawg | dawg | TODO: Describe. | -| user-words | text | A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see [tesseract(1)](http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data). | +| user-words | text | A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see [tesseract(1)](https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data). | -To make the DAWG dictionary files, you first need a wordlist for your language. You may find an appropriate dictionary file to use as the basis for a wordlist from the spellcheckers (e. g. [ispell](http://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html), [aspell](http://aspell.net/) or [hunspell](http://hunspell.sourceforge.net/)) - be careful about the license. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into needed sets e.g.: the frequent words, and the rest of the words, and then use `wordlist2dawg` to make the DAWG files: +To make the DAWG dictionary files, you first need a wordlist for your language. You may find an appropriate dictionary file to use as the basis for a wordlist from the spellcheckers (e. g. [ispell](https://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html), [aspell](http://aspell.net/) or [hunspell](http://hunspell.sourceforge.net/)) - be careful about the license. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into needed sets e.g.: the frequent words, and the rest of the words, and then use `wordlist2dawg` to make the DAWG files: ``` wordlist2dawg frequent_words_list [lang].freq-dawg [lang].unicharset @@ -449,7 +449,7 @@ The `unicharambigs` file may also be non-existent. # Putting it all together -That is all there is to it! All you need to do now is collect together all the files (`shapetable`, `normproto`, `inttemp`, `pffmtable`) and rename them with a `lang.` prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and then run `combine_tessdata` on them as follows: +That is all there is to it! All you need to do now is collect together all the files (`shapetable`, `normproto`, `inttemp`, `pffmtable`) and rename them with a `lang.` prefix, where lang is the 3-letter code for your language taken from https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and then run `combine_tessdata` on them as follows: ``` combine_tessdata [lang]. ``` diff --git "a/tess3/Training-Tesseract-3.03\342\200\2233.05.md" "b/tess3/Training-Tesseract-3.03\342\200\2233.05.md" index 6133907..45c3821 100644 --- "a/tess3/Training-Tesseract-3.03\342\200\2233.05.md" +++ "b/tess3/Training-Tesseract-3.03\342\200\2233.05.md" @@ -112,8 +112,8 @@ The traineddata file is simply a concatenation of the input files, with a table ## Requirements for text input files Text input files (lang.config, lang.unicharambigs, font\_properties, box files, wordlists for dictionaries...) need to meet these criteria: - * ASCII or UTF-8 encoding without [BOM](http://en.wikipedia.org/wiki/Byte_order_mark) - * Unix [end-of-line marker](http://en.wikipedia.org/wiki/Newline) ('\n') + * ASCII or UTF-8 encoding without [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) + * Unix [end-of-line marker](https://en.wikipedia.org/wiki/Newline) ('\n') * The last character must be an end of line marker ('\n'). Some text editors will show this as an empty line at the end of file. If you omit this you will get an error message containing `last_char == '\n':Error:Assert failed...`. ## How little can you get away with? @@ -322,7 +322,7 @@ Seven of the files are coded as a Directed Acyclic Word Graph (DAWG), and the ot | bigram-dawg | dawg | A dawg of word bigrams where the words are separated by a space and each digit is replaced by a _?_. | | user-words | text | A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see [tesseract(1)](https://github.com/tesseract-ocr/tesseract/blob/13b7900ebf21fbccbc3d89ebf63cc7165b6ae2ca/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data). | -To make the DAWG dictionary files, you first need a wordlist for your language. You may find an appropriate dictionary file to use as the basis for a wordlist from the spellcheckers (e. g. [ispell](http://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html), [aspell](http://aspell.net/) or [hunspell](http://hunspell.sourceforge.net/)) - be careful about the license. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into needed sets e.g.: the frequent words, and the rest of the words, and then use `wordlist2dawg` to make the DAWG files: +To make the DAWG dictionary files, you first need a wordlist for your language. You may find an appropriate dictionary file to use as the basis for a wordlist from the spellcheckers (e. g. [ispell](https://ficus-www.cs.ucla.edu/geoff/ispell-dictionaries.html), [aspell](http://aspell.net/) or [hunspell](http://hunspell.sourceforge.net/)) - be careful about the license. The wordlist is formatted as a UTF-8 text file with one word per line. Split the wordlist into needed sets e.g.: the frequent words, and the rest of the words, and then use `wordlist2dawg` to make the DAWG files: ``` wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset @@ -402,7 +402,7 @@ combine_tessdata lang. ``` Although you can use any string you like for the language code, we recommend that you use a 3-letter code -for your language matching one of the [ISO 639-2 codes](http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes). +for your language matching one of the [ISO 639-2 codes](https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes). The resulting lang.traineddata goes in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following: ``` diff --git "a/tess3/Training-Tesseract-\342\200\223-Make-Box-Files.md" "b/tess3/Training-Tesseract-\342\200\223-Make-Box-Files.md" index b01fcfb..eb9ba3a 100644 --- "a/tess3/Training-Tesseract-\342\200\223-Make-Box-Files.md" +++ "b/tess3/Training-Tesseract-\342\200\223-Make-Box-Files.md" @@ -111,7 +111,7 @@ This should make the 2nd box file easier to make, as there is a good chance that ### Tif/Box pairs provided! -Tif/Box file pairs are available in the [Downloads Archive on SourceForge](http://sourceforge.net/projects/tesseract-ocr-alt/files/) for these languages: +Tif/Box file pairs are available in the [Downloads Archive on SourceForge](https://sourceforge.net/projects/tesseract-ocr-alt/files/) for these languages: [Dutch](https://sourceforge.net/projects/tesseract-ocr-alt/files/boxtiff-2.01.nld.tar.gz/download) [English](https://sourceforge.net/projects/tesseract-ocr-alt/files/boxtiff-2.01.eng.tar.gz/download) [French](https://sourceforge.net/projects/tesseract-ocr-alt/files/boxtiff-2.01.fra.tar.gz/download) diff --git "a/tess3/Training-Tesseract-\342\200\223-tesstrain.sh.md" "b/tess3/Training-Tesseract-\342\200\223-tesstrain.sh.md" index 485477b..e719aff 100644 --- "a/tess3/Training-Tesseract-\342\200\223-tesstrain.sh.md" +++ "b/tess3/Training-Tesseract-\342\200\223-tesstrain.sh.md" @@ -62,7 +62,7 @@ These are general files that can affect multiple languages, but may be edited if Nick White's xheight tool can be used to find xheight of different fonts.To clone it and build the xheights tool, do the following: ``` -$ git clone http://ancientgreekocr.org/grctraining.git +$ git clone https://ancientgreekocr.org/grctraining.git $ cd grctraining $ make tools/xheight ``` diff --git a/tess3/TrainingTesseract2.md b/tess3/TrainingTesseract2.md index 0afbdda..3fc9596 100644 --- a/tess3/TrainingTesseract2.md +++ b/tess3/TrainingTesseract2.md @@ -313,7 +313,7 @@ The DangAmbigs file may also be empty. # Putting it all together -That is all there is to it! All you need to do now is collect together all 8 files and rename them with a `lang.` prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and put them in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following: +That is all there is to it! All you need to do now is collect together all 8 files and rename them with a `lang.` prefix, where lang is the 3-letter code for your language taken from https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and put them in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following: ``` tesseract image.tif output -l lang ``` diff --git a/tess4/TrainingTesseract-4.00---Finetune.md b/tess4/TrainingTesseract-4.00---Finetune.md index d2fbd50..87b2617 100644 --- a/tess4/TrainingTesseract-4.00---Finetune.md +++ b/tess4/TrainingTesseract-4.00---Finetune.md @@ -9,7 +9,7 @@ ### Modified training scripts created by Tesseract users: * [By J Klein at pastebin](https://pastebin.com/gNLvXkiM) -* [wiki.wareya.moe - info](http://wiki.wareya.moe/Tesseract) +* [wiki.wareya.moe - info](https://wiki.wareya.moe/Tesseract) * [wiki.wareya.moe - tesstrain.sh at pastebin](https://pastebin.com/cD5wctUG) * [wiki.wareya.moe - tesstrain_utils.sh at pastebin](https://pastebin.com/TfqJUxSR)