βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
Yan Shi
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This java package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.
Given an image that contains a tableβ¦
Extract the the text into a CSV formatβ¦
θ欑 ζζ,ε¨β,ε¨δΊ,ε¨δΈ,ε¨ε,ε¨δΊ
δΈ,θ―ζ,θ±θ―,θ±θ―,θͺηΆ,ζ°ε¦
δΊ,θ―ζ,θ±θ―,θ±θ―,θ―ζ,ζ°ε¦
δΈ,ζ°ε¦,θ―ζ,ζ°ε¦,θ―ζ,θ±θ―
ε,ζ°ε¦,θ―ζ,ζ°ε¦,δ½θ²,θ±θ―
δΊ,δ½θ²,ζζ³εεΎ·,θ―ζ,ζ°ε¦,ζε·₯
ε
,ηΎζ―,ι³δΉ,θ―ζ,ζ°ε¦,εε
See maven dependency jar package.
pdfbox2.0.26javacv1.5.7djl0.17.0- ...
There is a demo module that will try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.
1.table/demo/MainDemo
That will run against the following image:
The following should be saved to your directory after running the class table/demo/MainDemo.java.
Extracted the following tables from the image:
[('/img_test/simple.png', ['/img_test/simple/table-0.png'])]
Extracted cells from /img_test/simple/table-0.png
Cells:
/img_test/simple/table-0/0-0.png
/img_test/simple/table-0/0-1.png
/img_test/simple/table-0/0-2.png
...
Here is the entire CSV output:
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
The package is split into modules with narrow focuses.
pdf_to_imagesuses pdfbox to extract images from a PDF.extract_tablesfinds and extracts table-looking things from an image.extract_tables_dnnfinds and extracts table-looking things from an image by deep learning model.extract_cellsextracts and orders cells from a table.ocr_imageuses djl to OCR the text from an image of a cell.ocr_to_csvconverts into a CSV the directory structure thatocr_imageoutputs.
The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow.
1γgithubοΌhttps://github.com/jiangnanboy
2γQQ:2229029156
3γEmail:2229029156@qq.com
https://github.com/jiangnanboy/doc_ai
https://github.com/deepjavalibrary/djl

