字符识别(OCR)相关工具/库/教材/论文等资源整理

分享一些与OCR相关的软件、库和文章等资源，欢迎大家补充。文末附相关资源下载地址。

· 软件方面

OCR引擎

老的OCR引擎

OCR文件格式

HOCR

ALTO XML

TEI

OCR CLI

OCR GUI

OCR预处理

OCR服务

OCR评估

OCR库（按编程语言排序）

Java

.Net

Javascript

PHP

Python

Ruby

OCR培训工具

· 学术方面

OCR相关出版物和链接列表

博客帖子和教程

OCR一些实例

学术文章

软件方面

OCR引擎

··· tesseract - 最基础的奠基性的OCR引擎Apache 2.0

··· ocropus - 基于LSTM的OCR引擎，Apache 2.0

··· ocropus 0.4 - 较老的v0.4版本的Ocropus，包含tesseract 2.04和iulib，C ++

··· Kraken - Ocropus交叉版

··· Ocrad - GNU OCR，GPL

··· Digit - OCR用于数字显示，例如功率计，使用Caffe

··· ocular - 机器学习OCR用于历史文件

··· SwiftOCR - 快速简单的OCR库，用Swift编写

··· Attention OCR - 使用视觉注意机制OCR引擎

··· RWTH-OCR - 亚琛大学光学字符识别系统

··· simple-ocr-opencv及其fork - 一个使用opencv和numpy的简单的pythonic OCR引擎

比较老且可能被废弃的OCR引擎

··· Clara OCR - C GPL中的开源OCR引擎

··· CuneiForm - CuneiForm OCR由Cognitive Technologies开发

··· Eye - 实验性Java OCR（图像到文本）应用程序

··· kognition - 用于KDE的Omnifont OCR软件

··· OCRchie - 模块化光学字符识别软件

··· ocre - ocr易用版

··· xplab - 用于模式匹配的GTK 2工具

··· hebOCR - 希伯来文字符识别库（以前命名为hocr，请参阅维基百科文章）GPL

OCR文件格式

··· abby2hocr.xslt XSLT脚本

··· OCR转换脚本

HOCR

··· horc-tools - 使用hOCR文件格式，用于做各种有用事情的工具，Apache 2.0

··· hocr-spec -hOCR 1.1规范

··· ocr-transform - 用于在hOCR和ALTO之间转换的CLI工具，MIT

··· hocr-parser - hOCR规范Python Parser

··· hOCRTools - hOCR到为ALTO转换XSLT

ALTO XML

··· ALTO XML Schema - XML Schema和ALTO XML格式的开发

··· ALTO XML Documentation - ALTO的文档和用例

··· alto-tools - 使用ALTO文件的各种工具，Python

··· AbbyyToAlto - 从Abbyy 6转换来的ALTO XML的PHP脚本

TEI

··· TEI-OCR -TEI，为OCR定制，用于生成的布局和内容信息

··· TEI SIGon Libraries - TEI库的最佳实践

··· GDZ - 基于METS / TEI的GDZ文档格式

OCR CLI

··· OCRmyPDF - OCRmyPDF将OCR文本层添加到经过扫描的PDF文件中，允许搜索

··· Ocrocis - Ocropy项目管理界面，另见外部项目主页

OCR GUI

··· moz-hocr-editor - 用于编辑hOCR文件的Firefox Addon 已停用

··· qt-box-editor - tesseract-ocr文件的QT4编辑器。

··· ocr-gt-tools - 用于编辑OCR实况的客户端 - 服务器应用程序。

··· PaperWork - 使用扫描仪和OCR轻松打印纸质文档（仅限于Linux）。

··· Paperless - 扫描，索引和归档所有纸张文档。

··· gImageReader - gImageReader是一个简单的到tesseract-ocr的Gtk / Qt前端。

··· VietOCR - 用于Tesseract OCR引擎的一个Java / .NET GUI前端，包括jTessBoxEditor，一个图形Tesseract 框数据编辑器

··· PoCoTo

··· OCRFeeder - GTK图形用户界面，允许用户更正字符或边框，支持ODT导出等。

OCR预处理

··· NoiseRemove.java in MarhOCR - Java实现

··· binarize.c in ZBar - 基于Sauvola，由C语言实现two binarization algorithms，

··· typeface-corpus - 用于训练Tesseract和OCRopus的自然历史收藏和数字人文学科的一个库。

··· binarizewolfjolion - 二值化算法的比较。博客帖子

··· crop_morphology.py in oldnyc - 将页面裁剪成文本块

··· Whiteboard Picture Cleaner - one-liner/script,用于清理和美化白板照片

··· Fred的ImageMagick脚本textcleaner - 处理扫描文档，以清理文本背景

OCR服务

··· Open OCR - 在Docker容器中运行Tesseract

··· tesseract-web-service - 使用tornado实现的，用于tesseract-OCR的RESTful Web服务。

··· docker-ocropy - 用于运行ocropy OCR系统的Docker容器。

··· ABBYY Cloud OCR SDK代码示例 - 使用专有商业版权的ABBYY OCR API的代码示例。

··· nidaba - 可扩展的OCR pipeline

··· gamera - 用于构建文档处理应用程序的基本框架，例如OCR

··· ocr-tools - 为普通OCR引擎提供CLI和Web服务接口的项目

··· ocrad-docker - 在docker容器中运行ocrad OCR引擎

··· Kraken-docker - 在docker中运行kraken OCR引擎

··· ocr.space - 基于Tesseract ，免费在线OCR和OCR API （代码不开放）

OCR评估

··· ISRI OCR评估工具，以及1996年的用户指南：！：

o isri-ocr-evaluation-tools - 由@eddieantonio（2015年，2016年）的进一步开发

o ancientgreekocr-evaluation-tools - 由@nickjwhite进一步开发（2013年，2014年）

··· ocrevalUAtion - 跨格式评估，CLI和GUI

··· ngram-ocr-eval - 使用ngram进行粗暴且简单的OCR评估

··· quack - 质量保证工具 - 用于对应的ALTO文件扫描

OCR库（按编程语言排序）

··· gosseract - Golang OCR库，包装Tesseract-ocr。

Java

··· Tess4J - Java本地访问，绑定到Tesseract。

··· tess-two - 用于在Android和Java API上编译Tesseract的工具。

.Net

··· tesseract for .net - Tesseract-ocr的.Net包装器。

PHP

··· Tesseract OCR for PHP - Tesseract PHP

Python

··· pythesseract - 一个用于Google Tesseract的Python包装器。

··· pyocr - Tesseract和Cuneiform的Python包装。

··· ocrodjvu - 基于DjVu文件格式，执行OCR的库和独立工具，包装Cuneiform，gocr，ocrad，ocropus和tesseract

Javascript

··· ocracy - 基于ocropus的，纯JavaScript lstm rnn实现

··· gocr.js - gocr的Javascript端口（emscripten）

··· ocrad.js - ocrad的 Javascript端口（emscripten）

··· tesseract.js - Tesseract的 Javascript端口（emscripten）

··· node-tesseract - Tesseract OCR包的简单包装器。

··· node-tesseract-native - C ++模块，使用tesseract和leptonica，用于node providing测OCR。

Ruby

··· rtesseract - 包含tesseract和imagemagick可执行文件的Ruby库。

··· ruby-tesseract - 用于Ruby MRI和JRuby的Native Tesseract Binding

··· ocr_space - API包装器，用于免费的ocr服务ocr.space。包括CLI

OCR训练工具

··· glyph-miner - 从早期print中提取glyphs的系统

学术方面

OCR-相关的著作或链接

··· IMPACT: Tools for text digitisation - 与OCR相关的一些软件、工具

··· OCR-D - 与OCR相关的一些学术文章.

··· Mendeley Group "OCR - Optical Character Recognition" - 34篇与OCR相关的文章

··· http://eadh.org projects - 欧洲一些与人脸识别相关的文章, 其中一些与OCR相关

··· Wikipedia: Comparison of optical character recognition software

··· OCR [and Deep Learning] 由 @handong1587编写

··· Ocropus Wiki: Publications

博客文章或者入门指导

··· Tesseract Blends Old and New OCR Technology (2016) @theraysmith

-Tutorial@DAS2016, 添加了 "What You Always Wanted to Know" PPT

··· What You Always Wanted To Know About Tesseract (2014) @theraysmith

-Tutorial@DAS2014, 包含demos

··· Extracting text from an image using Ocropus (2015)

··· Training an Ocropus OCR model (2015) @danvk

··· Ocropus Wiki: Compute errors and confusions (2016) @zuphilip

··· Ocropus Wiki: Working with Ground Truth (2016) @zuphilip

··· OCRopus (2016) @jze

o ocropus

··· 10 Tips for making your OCR project succeed (2013) @cneud

-关于OCR projects一些需要考虑的东西

··· Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -

-用于商业图像预处理的列表；具有用于OCR预处理的步骤。

··· Extracting Text from PDFs; Doing OCR; all within R @shawngraham

-如何基于PDF文件进行OCR， R programming environment

··· Tutorial: Command-line OCR on a Mac @bmschmidt

-在Mac OSX运行tesseract的指导教程

··· Practical Expercience with OCRopus Model Training (2016) @jze

··· Homemade Manuscript OCR (1): OCRopy (2017) @Jean-Baptiste-Camps

-如何将OCR用于处理医疗文档的指导教程

··· Optimizing Binarization for OCRopus (2017) @jze

··· Prototype demo for OCR postfix in Danish Newspapers (2016) @thomasegense

··· How Can I OCR My Dictionary? (2016) @JessedeDoes

··· "Needlessly complex" blog (2016) @mzucker. 基于Python的几种图片处理指导教程:

-Page dewarping (code)

-Compressing and enhancing hand-written notes (code)

-Unprojecting text with ellipses (code)

OCR 示例

··· abbyy-finereader-ocr-senate - 使用OCR解析扫描的参议院财务报表。

··· cvOCR - 用于识别简历或cv文本的OCR系统，基于tesseract，由Python和C实现

··· MathOCR - 打印的科学文档识别系统, pre-alpha

一些重要的学术论文

2011 and before

··· High performance document layout analysis (2003) Breuel

··· Adaptive degraded document image binarization (2006) Gatos, Pratikakis, Perantonis

··· [Internship Report] (2007) Gupta

··· OCRopus Addons (Internship Report) (2007) Dantrey

2012

··· Local Logistic Classifiers for Large Scale Learning (2012) Yousefi, Breuel

2013

··· High Performance OCR for Printed English and Fraktur using LSTM Networks (2013) Breuel, Ul-Hasan, Mayce Al Azawi. Shafait

··· Can we build language-independent OCR using LSTM networks? (2013) Ul-Hasan, Breuel

··· Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks (2013) Ul-Hasan, Ahmed, Rashid, Shafait, Breuel

2014

··· OCR of historical printings of Latin texts: Problems, Prospects, Progress. (2014) Springmann, Najock, Morgenroth, Schmid, Gotscharek, Fink

··· Correcting Noisy OCR: Context beats Confusion (2014) Evershed, Fitch

2015

··· TypeWright: An Experiment in Participatory Curation (2015) Bilansky

o On crowd-sourcing OCR postcorrection

··· Benchmarking of LSTM Networks (2015) Breuel

··· Recognition of Historical Greek Polytonic Scripts Using LSTM (2015) Simistira, Ul-Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki

··· A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Karayil, Ul-Hasan, Breuel

··· A Sequence Learning Approach for Multiple Script Identification (2015) Ul-Hasan, Afzal, Shfait, Liwicki, Breuel

2016

··· Important New Developments in Arabographic Optical Character Recognition (OCR) (2016) Romanov, Miller, Savant, Kiessling

o on kraken

o using OpenArabic/OCR_GS_Data for ground truth data

··· OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus (2016) Springmann, Lüdeling

··· Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016) Springmann, Fink, Schulz

··· Generic Text Recognition using Long Short-Term Memory Networks (2016) Ul-Hasan -- Ph.D Thesis

··· OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters (2016) Andreas Dengel, Ul-Hasan, Bukhari

2017

··· Telugu OCR Framework using Deep Learning (2015/2017) Achanta, Hastie

o see also TeluguOCR, banti_telugu_ocr, chamanti_ocr, #49

文章资源下载地址：

密码: p85r

往期内容推荐：