docling_experiments

Docling Experiments

References

Docling OCR Engines: Default vs EasyOCR

Current Code (Default Engine)

Your current working code uses Docling's default OCR engine (typically Tesseract):

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.images_scale = 2.0
# Uses default OCR engine (Tesseract)

What's the Difference?

Default Engine (Tesseract OCR)

Pros:

✅ Fast processing
✅ Good for typed text
✅ Works well with clean PDFs
✅ No extra dependencies
✅ Low memory usage

Cons:

❌ Less accurate with handwritten text
❌ Struggles with poor quality scans
❌ Limited language support
❌ Not great with complex backgrounds

Best for:

Digital PDFs (born-digital documents)
Clean, typed text
Standard fonts
High-quality scans
English and major Latin-script languages

EasyOCR Engine

Pros:

✅ More accurate on low-quality images
✅ Better with handwritten text
✅ Supports 80+ languages
✅ Works with complex backgrounds
✅ Better with non-Latin scripts (Chinese, Japanese, Korean, Arabic, Thai, etc.)
✅ Good with noisy images

Cons:

❌ Slower processing (uses deep learning)
❌ Higher memory usage (requires GPU for speed)
❌ Larger model downloads
❌ Requires additional dependencies

Best for:

Scanned documents (photos of documents)
Handwritten notes
Poor quality images
Multi-language documents
Asian languages (Chinese, Japanese, Korean)
Complex backgrounds
Historical documents

Visual Comparison

Scenario 1: Clean Digital PDF

Document Type: Born-digital PDF with typed text
Quality: High
Background: Clean white

DEFAULT ENGINE:  ⭐⭐⭐⭐⭐ (Fast, accurate)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Slower, but accurate)

✅ Use: DEFAULT ENGINE (faster, no benefit from EasyOCR)

Scenario 2: Scanned Handwritten Notes

Document Type: Photo of handwritten notes
Quality: Medium
Background: Lined paper with shadows

DEFAULT ENGINE:  ⭐⭐☆☆☆ (Poor accuracy)
EASYOCR ENGINE:  ⭐⭐⭐⭐⭐ (Much better accuracy)

✅ Use: EASYOCR ENGINE (significantly better)

Scenario 3: Old Scanned Document

Document Type: 1980s photocopy scan
Quality: Low (faded, grainy)
Background: Gray with noise

DEFAULT ENGINE:  ⭐⭐☆☆☆ (Many errors)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Better recognition)

✅ Use: EASYOCR ENGINE (handles noise better)

Scenario 4: Multi-language Document

Document Type: English + Chinese mixed document
Quality: High
Background: Clean

DEFAULT ENGINE:  ⭐⭐☆☆☆ (English OK, Chinese poor)
EASYOCR ENGINE:  ⭐⭐⭐⭐⭐ (Both languages accurate)

✅ Use: EASYOCR ENGINE (multi-language support)

Scenario 5: Receipt/Invoice Photo

Document Type: Photo of receipt
Quality: Medium (smartphone photo)
Background: Counter, shadows, other items visible

DEFAULT ENGINE:  ⭐⭐⭐☆☆ (OK but some errors)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Better with complex background)

✅ Use: EASYOCR ENGINE (more robust)

When to Use Each

Use DEFAULT Engine When:

✅ Processing born-digital PDFs
✅ Clean, typed text
✅ High-quality scans
✅ Speed is priority
✅ English or major European languages
✅ Standard business documents

Examples:

Corporate reports
Books (ebooks)
Legal documents (digital)
Academic papers (PDFs)
Invoices (digital)

Use EasyOCR Engine When:

✅ Processing scanned documents
✅ Handwritten text
✅ Poor quality images
✅ Non-Latin scripts
✅ Multi-language documents
✅ Photos of documents
✅ Complex backgrounds

Examples:

Handwritten forms
Historical documents
Photos of receipts
Asian language documents
Old photocopies
Medical prescriptions
Field notes

Performance Comparison

Processing Speed

Document: 100-page PDF

DEFAULT ENGINE:    ~2-5 minutes
EASYOCR (CPU):    ~15-30 minutes
EASYOCR (GPU):    ~5-10 minutes

Accuracy Comparison

Clean digital PDF:
DEFAULT:  95-98% accuracy
EASYOCR:  96-99% accuracy (+1-2%)

Poor quality scan:
DEFAULT:  70-80% accuracy
EASYOCR:  85-95% accuracy (+15-20%)

Handwritten:
DEFAULT:  50-70% accuracy
EASYOCR:  75-90% accuracy (+20-30%)

Chinese text:
DEFAULT:  20-40% accuracy
EASYOCR:  90-95% accuracy (+60%)

Memory Usage

DEFAULT ENGINE:
- RAM: ~500MB - 1GB
- VRAM: Not required
- Disk: Minimal

EASYOCR ENGINE:
- RAM: ~2GB - 4GB
- VRAM: ~2GB (if using GPU)
- Disk: ~500MB (model downloads)

Language Support

DEFAULT (Tesseract)

Primary: English, Spanish, French, German, Italian
Good: Most Latin-script European languages
Limited: Asian languages, Arabic, Hebrew

EasyOCR

Excellent: 80+ languages including:
- Asian: Chinese (Simplified/Traditional), Japanese, Korean, Thai, Vietnamese
- Middle East: Arabic, Persian, Hebrew, Urdu
- European: All major languages
- Other: Hindi, Bengali, Tamil, Telugu, and many more

Cost Implications

Infrastructure Costs

DEFAULT ENGINE:
- CPU: Standard server (~$50/month)
- Total: ~$50/month

EASYOCR (CPU):
- CPU: Higher specs (~$100/month)
- Total: ~$100/month

EASYOCR (GPU):
- GPU: Cloud GPU (~$200-500/month)
- Total: ~$200-500/month

When Cost Matters

Low volume + high quality: DEFAULT (cheaper)
High volume + low quality: EASYOCR with GPU (better accuracy worth cost)
Multi-language: EASYOCR (no alternative)

Decision Matrix

┌─────────────────────────┬──────────┬──────────┐
│ Use Case                │ Default  │ EasyOCR  │
├─────────────────────────┼──────────┼──────────┤
│ Digital PDFs            │    ✅    │    ☑️     │
│ High-quality scans      │    ✅    │    ☑️     │
│ Low-quality scans       │    ❌    │    ✅    │
│ Handwritten text        │    ❌    │    ✅    │
│ Asian languages         │    ❌    │    ✅    │
│ Multi-language          │    ❌    │    ✅    │
│ Complex backgrounds     │    ❌    │    ✅    │
│ Speed priority          │    ✅    │    ❌    │
│ Cost sensitive          │    ✅    │    ❌    │
│ English only            │    ✅    │    ☑️     │
│ Historical docs         │    ❌    │    ✅    │
│ Photos of documents     │    ❌    │    ✅    │
└─────────────────────────┴──────────┴──────────┘

✅ = Best choice
☑️  = Works fine
❌ = Not recommended

Quick Decision Tree

START
  │
  ├─ Is it a born-digital PDF? ──YES─→ DEFAULT
  │                            
  └─ NO
     │
     ├─ Is quality poor/handwritten? ──YES─→ EASYOCR
     │
     └─ NO
        │
        ├─ Non-English/Asian languages? ──YES─→ EASYOCR
        │
        └─ NO
           │
           ├─ Speed is critical? ──YES─→ DEFAULT
           │
           └─ NO ──→ Try DEFAULT first, use EASYOCR if poor results

Hybrid Approach

You can use BOTH strategically:

def choose_ocr_engine(document_path):
    """Intelligently choose OCR engine"""
    
    # Check if it's a born-digital PDF
    if is_digital_pdf(document_path):
        return "default"  # Fast and accurate enough
    
    # Check language
    if contains_asian_languages(document_path):
        return "easyocr"  # Much better for CJK
    
    # Check quality
    quality = assess_image_quality(document_path)
    if quality < 0.5:
        return "easyocr"  # Better for poor quality
    
    # Default choice
    return "default"  # Faster for most cases

Summary

Your Current Code (DEFAULT)

✅ Good for:

90% of business documents
Clean PDFs
Fast processing needed

EasyOCR Version

✅ Good for:

Scanned documents
Handwritten text
Multi-language
Poor quality images

Rule of Thumb

Start with DEFAULT, switch to EasyOCR if:

Results are poor
You need non-English languages
Documents are handwritten
Quality is low

Most users should stick with DEFAULT unless they have specific needs!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
doc_images		doc_images
images_table_ocr		images_table_ocr
model		model
ocr_engine_example		ocr_engine_example
2206.01062v1.md		2206.01062v1.md
2408.09869v5.pdf		2408.09869v5.pdf
Download_Files.md		Download_Files.md
EasyOCR.md		EasyOCR.md
README.md		README.md
binary_stream_way.py		binary_stream_way.py
data.py		data.py
docling (3).ipynb		docling (3).ipynb
docling_exp.py		docling_exp.py
ds.py		ds.py
easy_ocr_download.py		easy_ocr_download.py
image_descriptions.json		image_descriptions.json
openai_docling.py		openai_docling.py
output.md		output.md
output_original.md		output_original.md
output_with_descriptions.md		output_with_descriptions.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock
welcome.py		welcome.py
welcome_dummy.pdf		welcome_dummy.pdf
welcome_new.py		welcome_new.py

Folders and files

Latest commit

History

Repository files navigation

docling_experiments

References

Docling OCR Engines: Default vs EasyOCR

Current Code (Default Engine)

What's the Difference?

Default Engine (Tesseract OCR)

EasyOCR Engine

Visual Comparison

Scenario 1: Clean Digital PDF

Scenario 2: Scanned Handwritten Notes

Scenario 3: Old Scanned Document

Scenario 4: Multi-language Document

Scenario 5: Receipt/Invoice Photo

When to Use Each

Use DEFAULT Engine When:

Use EasyOCR Engine When:

Performance Comparison

Processing Speed

Accuracy Comparison

Memory Usage

Language Support

DEFAULT (Tesseract)

EasyOCR

Cost Implications

Infrastructure Costs

When Cost Matters

Decision Matrix

Quick Decision Tree

Hybrid Approach

Summary

Your Current Code (DEFAULT)

EasyOCR Version

Rule of Thumb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages