
TextRipper (aka T-Rip)
Source (link to git-repo or to original if based on someone elses unmodified work): Add the source-code for this project on opencode.net
An OCR, Optical Character Recognition, gui application or cli script
# Supports the Tesseract engine by default!
# Optionally supports the Ocrad engine for multi-column text.
# These recognition engines have a very high character recognition success rate compared to other OCR's, including proprietary software.
# New: multi-page and multiple file selection support!
# Enhanced XSANE output and TIFF compatibility.
# New: now handles nearly any format out there!
# This script will convert any image of text into editable and indexable text. (for a full list of compatible file formats see the first filter below)
#
# REM: The better/cleaner/higher contrasted/higher resolution your image or scan is the better the results
#
# Dependencies: libtiff-dev (or -devel)(installed FIRST), tesseract-2.04 (latest stable-version), your chosen language data for Tesseract (2.00 and up) *1,
# ImageMagick, ghostscript, Zenity, and OpenOffice or other text editor *2
# This version of tesseract can be downloaded from here: http://code.google.com/p/tesseract-ocr/downloads/list
# Warning: This script will not work with the latest beta version (tesseract 3.00 pre-release) due to database structure modifications.
#
# Optional dependencies: ocrad ->an alternate recognition engine
# If inital results are unsatisfactory, maybe this engine will do better. Most importantly, it supports basic page format recognition. *3
# The latest version of ocrad can be downloaded off the GNU mirror list here: http://www.gnu.org/software/ocrad/
#
# Also: Make sure to select Unicode UTF-8 in OpenOffice's pop-up window (or text editor of your choice).
#
#
#
# *1 Install Tesseract after libtiff-dev. Then extract all the language databases you need into the "wherever_you_installed/tesseract-2.04/tessdata" directory.
# This is done automatically if you extract the language databases from WITHIN the "tesseract-2.04" directory (and allow overwriting).
# This script allows the use of multiple language databases. Default is English and French. For adding others see comments below.
# You NEED at least one language database or tesseract will not work.
# *2 Simply change the occurance of "soffice -writer" below to a text editor of your choice, ie: gedit, KWrite, etc
# Some systems call on OpenOffice Writer differently. If unsure, check the properties tab of your Writer launcher.
# Ie: On customized versions of OOo (such as the ones provided by Linux Mandrake or Gentoo), you start Writer with: oowriter
# *3 If you install ocrad also, TextRipper will recognize this and prompt you to choose between the two offering better recognition or page format support
#
# Troubleshooting:
# If this script ends saying your text editor can't open "OCR output-editable text.txt",
# or if run off the cli: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
# do (as superuser):
# echo /usr/local/share /usr/share | xargs -n 1 cp -R wherever_you_installed/tesseract-2.04/tessdata
# Explanation: Tesseract may call on the tessdata directory from the /share directory of your filesystem,
# so you need to make your language databases available from there.
polardude1983
9 years ago
/home/christoph/Downloads/dog_petition10001.jpg (editable and indexable 001.txt does not exist
And I believe I installed everything correctly.
I have Zenity, Tesseract-ocr, Tesseract-ocr-eng, imagemagick, libtiff4-dev, ghostscript.
Any help would be appreciated. I have tried it on different images in different formats, jpg, png, pdf. Same error for all
Report
kickass
9 years ago
try this:
do (as superuser):
# echo /usr/local/share /usr/share | xargs -n 1 cp -R wherever_you_installed/tesseract-2.04/tessdata
Explanation: Tesseract may call on the tessdata directory from the /share directory of your filesystem,
so you need to make your language databases available from there.
let me know if this was it.
d.
Report
I4C
7 years ago
Report
kickass
7 years ago
ps to all users: there's a great new one out there called YAGF. check it out.
Report
agentkiller4
10 years ago
Report
kickass
10 years ago
Report
kayce
10 years ago
The download link is broken. Can somebody fix it please? Thanks.
Report
kickass
10 years ago
HOWEVER, soon, very soon, I'll release the new version of Text Recognition now rebaptized TextRipper. It can rip text off anything!
till then,
d.
Report
kayce
10 years ago
Well when I clicked on "Download", I have a new page coming up saying that the download popup should appear soon. But instead, no popup appears and it redirects me to the following link:
http://gtk-apps.org/CONTENT/content-files/132759-Text%20Recognition
Any insight regarding this? Thanks,
Report
kickass
10 years ago
Otherwise, like i said above, in about a week i'll release TextRipper.
Report
kayce
10 years ago
Btw, does it handle hand-written cases? Most OCR out there (including Tessract) cannot handle hand-written characters. From my understanding, Tessract expects a well-segmented and well-defined fonts.
Cheers
Report
kickass
10 years ago
d.
Report
kickass
10 years ago
the main difficulty in recognizing handwritten text has less to do with the "font" (caligraphy) but rather whether the letters are joined or distictly separate. Tesseract does a pretty fine job if the letters aren't linked. Try it out for yourself. If you find an engine that beats tesseract in this please let me know.
dave
Report
kickass
10 years ago
the main difficulty in recognizing handwritten text has less to do with the "font" (caligraphy) but rather whether the letters are joined or distictly separate. Tesseract does a pretty fine job if the letters aren't linked. Try it out for yourself. If you find an engine that beats tesseract in this please let me know.
dave
Report
inameiname
10 years ago
/home/me/Tmp/OCR output-editable text.txt does not exist.
This occurred in both versions of your script. I don't understand.
Thanks in advance.
Report
kickass
10 years ago
There are only two possible causes for this error message.
The first is treated clearly albeit concisely in the heading comments of the script itself under troubleshooting.
The second is an incompatible image format because either 1) you are missing libraries such as libtiff-dev or 2) the tesseract engine just can't treat that particular file. In this case a conversion usually fails. You must rescan preferrably to a different format such pnm. There have also been reports of success in such cases after switching to the ocrad engine.
Your pick.
cheers,
d.
Report
inameiname
10 years ago
Report
chric
10 years ago
Report
kickass
10 years ago
Thanks for your input.
You'll be happy with Ver 1.1.
It's made for tesseract but still allows for ocrad if you like.
cheers,
d.
Report