DeepSeek-OCR Tutorial: From Messy Documents to Structured Data with Python

Ready to unlock the data trapped in your scanned documents, images, and PDFs? DeepSeek-OCR isn’t just another OCR tool; it’s a generative AI model that understands document structure and outputs clean, structured Markdown. This tutorial is your step-by-step guide to harnessing its power with Python, transforming messy visual information into analysis-ready data.

Executive Overview

In our previous Radar post on DeepSeek-OCR, we explored how this multimodal vision-language model redefines document intelligence by treating OCR as a text generation task. This tutorial focuses on the practical application: setting up your environment, loading the model, and performing OCR on both single images and multi-page PDFs. By the end, you’ll have a working Python script to convert your documents into structured Markdown, ready for further processing or integration into AI workflows.

1. Setting Up Your Environment

To begin, ensure you have a suitable Python environment. A dedicated virtual environment is highly recommended.

Prerequisites:

Python 3.10+
NVIDIA GPU (Recommended): For optimal performance, a CUDA-enabled GPU with at least 8GB VRAM is highly recommended. While DeepSeek-OCR can run on CPU, inference will be significantly slower.
Git LFS: Required for downloading large model files from Hugging Face.

Installation Steps:

Create and Activate Virtual Environment:

# Create a new virtual environment
python3 -m venv .venv

# Activate it (macOS/Linux)
source .venv/bin/activate

# On Windows, use: .venv\\Scripts\\activate

Install Core Libraries: Install PyTorch, Hugging Face Transformers, and Pillow.

# Install PyTorch (adjust for your CUDA version if needed)
pip install torch torchvision torchaudio

# Install Hugging Face Transformers, Tokenizers, and Pillow
pip install transformers tokenizers pillow

# Install pypdfium2 for PDF processing
pip install pypdfium2

# Optional: Install Flash Attention for GPU performance boost
# Note: May require specific GPU/driver compatibility
pip install flash-attn --no-build-isolation

2. Loading the DeepSeek-OCR Model

Once your environment is set up, you can load the DeepSeek-OCR model and its tokenizer. This process will automatically download the model weights (approximately 10-15 GB) the first time you run it.

from transformers import AutoModel, AutoTokenizer
import torch
from PIL import Image

# Define the model repository on Hugging Face
model_name = "deepseek-ai/DeepSeek-OCR"

# Load the tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Load the model
print("Loading model (this may download weights on first run)...")
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    _attn_implementation="flash_attention_2" if torch.cuda.is_available() else None # Use Flash Attention if GPU is available
)

# Move model to GPU and set to bfloat16 for efficiency if CUDA is available
if torch.cuda.is_available():
    model = model.eval().cuda().to(torch.bfloat16)
else:
    model = model.eval() # CPU fallback

print("DeepSeek-OCR model loaded successfully!")

3. OCR on Images: Your First Document

Now, let’s perform OCR on a single image. Ensure you have a sample image file (e.g., sample_document.png) in your working directory.

# Path to your sample image file
image_path = "sample_document.png"

try:
    image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
    print(f"Error: The image file '{image_path}' was not found. Please ensure it exists.")
    exit()

# The prompt instructs the model to convert the document to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# Prepare inputs for the model
inputs = tokenizer(
    [prompt],
    [image],
    return_tensors="pt",
    padding="longest"
)

# Move inputs to GPU if available
if torch.cuda.is_available():
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate OCR output
print("Running inference on image...")
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)

# Decode the output to get the Markdown text
ocr_result = tokenizer.decode(output[0], skip_special_tokens=True)

print("\n--- OCR Result (Markdown) ---")
print(ocr_result)

# Optionally, save the output to a Markdown file
output_filename = "output_image_document.md"
with open(output_filename, "w", encoding="utf-8") as f:
    f.write(ocr_result)
print(f"\nOCR result saved to {output_filename}")

4. Mastering PDFs: Multi-Page Document Processing

DeepSeek-OCR processes images. To handle multi-page PDFs, you first need to convert each page into an image. The pypdfium2 library is ideal for this.

import pypdfium2 as pdfium

def convert_pdf_to_images(pdf_path):
    """Converts a PDF file into a list of PIL Image objects, one per page."""
    pdf = pdfium.PdfDocument(pdf_path)
    return [page.render().to_pil() for page in pdf]

# Path to your sample PDF file
pdf_path = "sample_document.pdf"
all_pages_markdown = ""

try:
    pdf_images = convert_pdf_to_images(pdf_path)
except FileNotFoundError:
    print(f"Error: The PDF file '{pdf_path}' was not found. Please ensure it exists.")
    exit()

print(f"Processing {len(pdf_images)} pages from PDF...")
for i, page_image in enumerate(pdf_images):
    print(f"  Processing page {i+1}/{len(pdf_images)}...")
    
    # Use the same inference logic as for single images
    inputs = tokenizer([prompt], [page_image], return_tensors="pt", padding="longest")
    if torch.cuda.is_available():
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=2048)
    
    page_markdown = tokenizer.decode(output[0], skip_special_tokens=True)
    all_pages_markdown += f"\n\n--- Page {i+1} ---\n\n" + page_markdown # Add a separator between pages

# Save the combined Markdown output for the entire PDF
output_pdf_filename = "output_pdf_document.md"
with open(output_pdf_filename, "w", encoding="utf-8") as f:
    f.write(all_pages_markdown)
print(f"\nPDF processing complete. Combined Markdown saved to {output_pdf_filename}")

5. What’s Next: Integrating DeepSeek-OCR into Your Workflows

DeepSeek-OCR provides a powerful foundation for automating document-centric tasks. Here are some ideas for integrating it into your projects:

Automated Data Entry: Extract structured data from invoices, receipts, and forms directly into databases or spreadsheets.
Knowledge Base Creation: Convert scanned manuals, reports, and articles into searchable Markdown for LLM-powered knowledge bases.
Process Automation: Combine DeepSeek-OCR with AI agents to create end-to-end workflows that read documents, extract information, and trigger subsequent actions.
Data Analysis: Quickly digitize historical documents for quantitative analysis.