Skip to content

anjijava16/docling_ocr_experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docling_experiments

Docling Experiments

References

  1. https://heidloff.net/article/docling/
  2. https://www.youtube.com/watch?v=B5XD-qpL0FU

Docling OCR Engines: Default vs EasyOCR

Current Code (Default Engine)

Your current working code uses Docling's default OCR engine (typically Tesseract):

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.images_scale = 2.0
# Uses default OCR engine (Tesseract)

What's the Difference?

Default Engine (Tesseract OCR)

Pros:

  • ✅ Fast processing
  • ✅ Good for typed text
  • ✅ Works well with clean PDFs
  • ✅ No extra dependencies
  • ✅ Low memory usage

Cons:

  • ❌ Less accurate with handwritten text
  • ❌ Struggles with poor quality scans
  • ❌ Limited language support
  • ❌ Not great with complex backgrounds

Best for:

  • Digital PDFs (born-digital documents)
  • Clean, typed text
  • Standard fonts
  • High-quality scans
  • English and major Latin-script languages

EasyOCR Engine

Pros:

  • ✅ More accurate on low-quality images
  • ✅ Better with handwritten text
  • ✅ Supports 80+ languages
  • ✅ Works with complex backgrounds
  • ✅ Better with non-Latin scripts (Chinese, Japanese, Korean, Arabic, Thai, etc.)
  • ✅ Good with noisy images

Cons:

  • ❌ Slower processing (uses deep learning)
  • ❌ Higher memory usage (requires GPU for speed)
  • ❌ Larger model downloads
  • ❌ Requires additional dependencies

Best for:

  • Scanned documents (photos of documents)
  • Handwritten notes
  • Poor quality images
  • Multi-language documents
  • Asian languages (Chinese, Japanese, Korean)
  • Complex backgrounds
  • Historical documents

Visual Comparison

Scenario 1: Clean Digital PDF

Document Type: Born-digital PDF with typed text
Quality: High
Background: Clean white

DEFAULT ENGINE:  ⭐⭐⭐⭐⭐ (Fast, accurate)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Slower, but accurate)

✅ Use: DEFAULT ENGINE (faster, no benefit from EasyOCR)

Scenario 2: Scanned Handwritten Notes

Document Type: Photo of handwritten notes
Quality: Medium
Background: Lined paper with shadows

DEFAULT ENGINE:  ⭐⭐☆☆☆ (Poor accuracy)
EASYOCR ENGINE:  ⭐⭐⭐⭐⭐ (Much better accuracy)

✅ Use: EASYOCR ENGINE (significantly better)

Scenario 3: Old Scanned Document

Document Type: 1980s photocopy scan
Quality: Low (faded, grainy)
Background: Gray with noise

DEFAULT ENGINE:  ⭐⭐☆☆☆ (Many errors)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Better recognition)

✅ Use: EASYOCR ENGINE (handles noise better)

Scenario 4: Multi-language Document

Document Type: English + Chinese mixed document
Quality: High
Background: Clean

DEFAULT ENGINE:  ⭐⭐☆☆☆ (English OK, Chinese poor)
EASYOCR ENGINE:  ⭐⭐⭐⭐⭐ (Both languages accurate)

✅ Use: EASYOCR ENGINE (multi-language support)

Scenario 5: Receipt/Invoice Photo

Document Type: Photo of receipt
Quality: Medium (smartphone photo)
Background: Counter, shadows, other items visible

DEFAULT ENGINE:  ⭐⭐⭐☆☆ (OK but some errors)
EASYOCR ENGINE:  ⭐⭐⭐⭐☆ (Better with complex background)

✅ Use: EASYOCR ENGINE (more robust)

When to Use Each

Use DEFAULT Engine When:

  • ✅ Processing born-digital PDFs
  • ✅ Clean, typed text
  • ✅ High-quality scans
  • ✅ Speed is priority
  • ✅ English or major European languages
  • ✅ Standard business documents

Examples:

  • Corporate reports
  • Books (ebooks)
  • Legal documents (digital)
  • Academic papers (PDFs)
  • Invoices (digital)

Use EasyOCR Engine When:

  • ✅ Processing scanned documents
  • ✅ Handwritten text
  • ✅ Poor quality images
  • ✅ Non-Latin scripts
  • ✅ Multi-language documents
  • ✅ Photos of documents
  • ✅ Complex backgrounds

Examples:

  • Handwritten forms
  • Historical documents
  • Photos of receipts
  • Asian language documents
  • Old photocopies
  • Medical prescriptions
  • Field notes

Performance Comparison

Processing Speed

Document: 100-page PDF

DEFAULT ENGINE:    ~2-5 minutes
EASYOCR (CPU):    ~15-30 minutes
EASYOCR (GPU):    ~5-10 minutes

Accuracy Comparison

Clean digital PDF:
DEFAULT:  95-98% accuracy
EASYOCR:  96-99% accuracy (+1-2%)

Poor quality scan:
DEFAULT:  70-80% accuracy
EASYOCR:  85-95% accuracy (+15-20%)

Handwritten:
DEFAULT:  50-70% accuracy
EASYOCR:  75-90% accuracy (+20-30%)

Chinese text:
DEFAULT:  20-40% accuracy
EASYOCR:  90-95% accuracy (+60%)

Memory Usage

DEFAULT ENGINE:
- RAM: ~500MB - 1GB
- VRAM: Not required
- Disk: Minimal

EASYOCR ENGINE:
- RAM: ~2GB - 4GB
- VRAM: ~2GB (if using GPU)
- Disk: ~500MB (model downloads)

Language Support

DEFAULT (Tesseract)

Primary: English, Spanish, French, German, Italian
Good: Most Latin-script European languages
Limited: Asian languages, Arabic, Hebrew

EasyOCR

Excellent: 80+ languages including:
- Asian: Chinese (Simplified/Traditional), Japanese, Korean, Thai, Vietnamese
- Middle East: Arabic, Persian, Hebrew, Urdu
- European: All major languages
- Other: Hindi, Bengali, Tamil, Telugu, and many more

Cost Implications

Infrastructure Costs

DEFAULT ENGINE:
- CPU: Standard server (~$50/month)
- Total: ~$50/month

EASYOCR (CPU):
- CPU: Higher specs (~$100/month)
- Total: ~$100/month

EASYOCR (GPU):
- GPU: Cloud GPU (~$200-500/month)
- Total: ~$200-500/month

When Cost Matters

  • Low volume + high quality: DEFAULT (cheaper)
  • High volume + low quality: EASYOCR with GPU (better accuracy worth cost)
  • Multi-language: EASYOCR (no alternative)

Decision Matrix

┌─────────────────────────┬──────────┬──────────┐
│ Use Case                │ Default  │ EasyOCR  │
├─────────────────────────┼──────────┼──────────┤
│ Digital PDFs            │    ✅    │    ☑️     │
│ High-quality scans      │    ✅    │    ☑️     │
│ Low-quality scans       │    ❌    │    ✅    │
│ Handwritten text        │    ❌    │    ✅    │
│ Asian languages         │    ❌    │    ✅    │
│ Multi-language          │    ❌    │    ✅    │
│ Complex backgrounds     │    ❌    │    ✅    │
│ Speed priority          │    ✅    │    ❌    │
│ Cost sensitive          │    ✅    │    ❌    │
│ English only            │    ✅    │    ☑️     │
│ Historical docs         │    ❌    │    ✅    │
│ Photos of documents     │    ❌    │    ✅    │
└─────────────────────────┴──────────┴──────────┘

✅ = Best choice
☑️  = Works fine
❌ = Not recommended

Quick Decision Tree

START
  │
  ├─ Is it a born-digital PDF? ──YES─→ DEFAULT
  │                            
  └─ NO
     │
     ├─ Is quality poor/handwritten? ──YES─→ EASYOCR
     │
     └─ NO
        │
        ├─ Non-English/Asian languages? ──YES─→ EASYOCR
        │
        └─ NO
           │
           ├─ Speed is critical? ──YES─→ DEFAULT
           │
           └─ NO ──→ Try DEFAULT first, use EASYOCR if poor results

Hybrid Approach

You can use BOTH strategically:

def choose_ocr_engine(document_path):
    """Intelligently choose OCR engine"""
    
    # Check if it's a born-digital PDF
    if is_digital_pdf(document_path):
        return "default"  # Fast and accurate enough
    
    # Check language
    if contains_asian_languages(document_path):
        return "easyocr"  # Much better for CJK
    
    # Check quality
    quality = assess_image_quality(document_path)
    if quality < 0.5:
        return "easyocr"  # Better for poor quality
    
    # Default choice
    return "default"  # Faster for most cases

Summary

Your Current Code (DEFAULT)

Good for:

  • 90% of business documents
  • Clean PDFs
  • Fast processing needed

EasyOCR Version

Good for:

  • Scanned documents
  • Handwritten text
  • Multi-language
  • Poor quality images

Rule of Thumb

Start with DEFAULT, switch to EasyOCR if:

  • Results are poor
  • You need non-English languages
  • Documents are handwritten
  • Quality is low

Most users should stick with DEFAULT unless they have specific needs!

About

Docling Experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors