将hOCR转换为HTML表 - Golang教程网

我正在寻找一个工具或者一个在python中实现的想法，它将hocr文件（由tesseract在应用程序中生成）转换为html表。
其思想是利用hocr文件中的文本位置信息（在bbox属性中提供）创建基于所提供位置的表。
我提供了一个例子来解释上述想法：
我使用slideshare.net中的image作为我的应用程序的输入，该应用程序使用了tesseract，我得到了下面的hocr/xml文件作为输出。
HOCR文件：

  <div class='ocr_page' id='page_2' title='image "sample_slide.jpg"; bbox 0 0 638 479; ppageno 1'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 638 479">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 31 104 620 439">
     <span class='ocr_line' id='line_1' title="bbox 32 104 613 138"><span class='ocrx_word' id='word_1' title="bbox 32 105 119 131">done:</span> <span class='ocrx_word' id='word_2' title="bbox 132 104 262 138">working</span> <span class='ocrx_word' id='word_3' title="bbox 273 105 405 138">product,</span> <span class='ocrx_word' id='word_4' title="bbox 419 104 517 132">hotels</span> <span class='ocrx_word' id='word_5' title="bbox 528 104 613 132">listed</span> 
     </span>
     <span class='ocr_line' id='line_2' title="bbox 31 160 471 194"><span class='ocrx_word' id='word_6' title="bbox 31 164 62 187">to</span> <span class='ocrx_word' id='word_7' title="bbox 75 161 122 187">do:</span> <span class='ocrx_word' id='word_8' title="bbox 134 164 227 187">smart</span> <span class='ocrx_word' id='word_9' title="bbox 236 160 330 187">trafﬁc</span> <span class='ocrx_word' id='word_10' title="bbox 342 160 471 194">building</span> 
     </span>
     <span class='ocr_line' id='line_3' title="bbox 32 243 284 280"><span class='ocrx_word' id='word_11' title="bbox 32 243 128 280">seed</span> <span class='ocrx_word' id='word_12' title="bbox 148 243 284 280">round:</span> 
     </span>
     <span class='ocr_line' id='line_4' title="bbox 71 316 619 361"><span class='ocrx_word' id='word_13' title="bbox 71 321 156 356">CEO</span> <span class='ocrx_word' id='word_14' title="bbox 171 319 240 355">will</span> <span class='ocrx_word' id='word_15' title="bbox 260 321 384 356">invest</span> <span class='ocrx_word' id='word_16' title="bbox 517 316 619 361">$30k</span> 
     </span>
     <span class='ocr_line' id='line_5' title="bbox 75 392 620 439"><span class='ocrx_word' id='word_17' title="bbox 75 397 252 433">investor</span> <span class='ocrx_word' id='word_18' title="bbox 489 392 620 439">$120k</span> 
     </span>
    </p>
   </div>
  </div>

我需要的是根据下一个的位置将hocr文件转换为html表。预期的表应该类似于this table。
表格单元格的大小和位置反映了HOCR文件中提供的信息。
图片来源：slideshare.net

最佳答案：

检查this document。我相信它描述了你需要的很多（或全部）。
引言：
本文档描述了OCR的各个方面的表示。
以类似XML的格式输出。也就是说，我们定义为一组标记
包含文本和其他标记，以及这些标记的属性
标签。但是，因为我们表示的内容是格式化的
但是，我们实际上并没有为
表示法；而是将表示法嵌入到XHTML（或HTML）中。
因为XHTML和XHTML处理已经定义了OCR的许多方面
输出表示，否则需要额外的、单独的
以及特别定义。
XML也可以是converted to HTML using XSLT。实际上，这里有a project which plans to do just that。
另外，this project (hocr-tools)可能有帮助。
最后请注意，FAQ of Tesseract提到：
通过配置文件“hocr”，Tesseract将生成xhtml输出
符合HOCR规范