DeepSeek-OCR Tutorial: From Messy Documents to Structured Data with Python
Ready to unlock the data trapped in your scanned documents, images, and PDFs? DeepSeek-OCR isn’t just another OCR tool; it’s a generative AI model that understands document structure and outputs clean, structured Markdown. This tutorial is your step-by-step guide to harnessing its power with Python, transforming messy visual information into analysis-ready data.
Executive Overview
In our previous Radar post on DeepSeek-OCR, we explored how this multimodal vision-language model redefines document intelligence by treating OCR as a text generation task. This tutorial focuses on the practical application: setting up your environment, loading the model, and performing OCR on both single images and multi-page PDFs. By the end, you’ll have a working Python script to convert your documents into structured Markdown, ready for further processing or integration into AI workflows.
1. Setting Up Your Environment
To begin, ensure you have a suitable Python environment. A dedicated virtual environment is highly recommended.
Prerequisites:
- Python 3.10+
- NVIDIA GPU (Recommended): For optimal performance, a CUDA-enabled GPU with at least 8GB VRAM is highly recommended. While DeepSeek-OCR can run on CPU, inference will be significantly slower.
- Git LFS: Required for downloading large model files from Hugging Face.
Installation Steps:
Create and Activate Virtual Environment:
# Create a new virtual environment python3 -m venv .venv # Activate it (macOS/Linux) source .venv/bin/activate # On Windows, use: .venv\\Scripts\\activateInstall Core Libraries: Install PyTorch, Hugging Face Transformers, and Pillow.
# Install PyTorch (adjust for your CUDA version if needed) pip install torch torchvision torchaudio # Install Hugging Face Transformers, Tokenizers, and Pillow pip install transformers tokenizers pillow # Install pypdfium2 for PDF processing pip install pypdfium2 # Optional: Install Flash Attention for GPU performance boost # Note: May require specific GPU/driver compatibility pip install flash-attn --no-build-isolation
2. Loading the DeepSeek-OCR Model
Once your environment is set up, you can load the DeepSeek-OCR model and its tokenizer. This process will automatically download the model weights (approximately 10-15 GB) the first time you run it.
from transformers import AutoModel, AutoTokenizer
import torch
from PIL import Image
# Define the model repository on Hugging Face
model_name = "deepseek-ai/DeepSeek-OCR"
# Load the tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Load the model
print("Loading model (this may download weights on first run)...")
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
_attn_implementation="flash_attention_2" if torch.cuda.is_available() else None # Use Flash Attention if GPU is available
)
# Move model to GPU and set to bfloat16 for efficiency if CUDA is available
if torch.cuda.is_available():
model = model.eval().cuda().to(torch.bfloat16)
else:
model = model.eval() # CPU fallback
print("DeepSeek-OCR model loaded successfully!")
3. OCR on Images: Your First Document
Now, let’s perform OCR on a single image. Ensure you have a sample image file (e.g., sample_document.png) in your working directory.
# Path to your sample image file
image_path = "sample_document.png"
try:
image = Image.open(image_path).convert("RGB")
except FileNotFoundError:
print(f"Error: The image file '{image_path}' was not found. Please ensure it exists.")
exit()
# The prompt instructs the model to convert the document to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."
# Prepare inputs for the model
inputs = tokenizer(
[prompt],
[image],
return_tensors="pt",
padding="longest"
)
# Move inputs to GPU if available
if torch.cuda.is_available():
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate OCR output
print("Running inference on image...")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=2048)
# Decode the output to get the Markdown text
ocr_result = tokenizer.decode(output[0], skip_special_tokens=True)
print("\n--- OCR Result (Markdown) ---")
print(ocr_result)
# Optionally, save the output to a Markdown file
output_filename = "output_image_document.md"
with open(output_filename, "w", encoding="utf-8") as f:
f.write(ocr_result)
print(f"\nOCR result saved to {output_filename}")
4. Mastering PDFs: Multi-Page Document Processing
DeepSeek-OCR processes images. To handle multi-page PDFs, you first need to convert each page into an image. The pypdfium2 library is ideal for this.
import pypdfium2 as pdfium
def convert_pdf_to_images(pdf_path):
"""Converts a PDF file into a list of PIL Image objects, one per page."""
pdf = pdfium.PdfDocument(pdf_path)
return [page.render().to_pil() for page in pdf]
# Path to your sample PDF file
pdf_path = "sample_document.pdf"
all_pages_markdown = ""
try:
pdf_images = convert_pdf_to_images(pdf_path)
except FileNotFoundError:
print(f"Error: The PDF file '{pdf_path}' was not found. Please ensure it exists.")
exit()
print(f"Processing {len(pdf_images)} pages from PDF...")
for i, page_image in enumerate(pdf_images):
print(f" Processing page {i+1}/{len(pdf_images)}...")
# Use the same inference logic as for single images
inputs = tokenizer([prompt], [page_image], return_tensors="pt", padding="longest")
if torch.cuda.is_available():
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=2048)
page_markdown = tokenizer.decode(output[0], skip_special_tokens=True)
all_pages_markdown += f"\n\n--- Page {i+1} ---\n\n" + page_markdown # Add a separator between pages
# Save the combined Markdown output for the entire PDF
output_pdf_filename = "output_pdf_document.md"
with open(output_pdf_filename, "w", encoding="utf-8") as f:
f.write(all_pages_markdown)
print(f"\nPDF processing complete. Combined Markdown saved to {output_pdf_filename}")
5. What’s Next: Integrating DeepSeek-OCR into Your Workflows
DeepSeek-OCR provides a powerful foundation for automating document-centric tasks. Here are some ideas for integrating it into your projects:
- Automated Data Entry: Extract structured data from invoices, receipts, and forms directly into databases or spreadsheets.
- Knowledge Base Creation: Convert scanned manuals, reports, and articles into searchable Markdown for LLM-powered knowledge bases.
- Process Automation: Combine DeepSeek-OCR with AI agents to create end-to-end workflows that read documents, extract information, and trigger subsequent actions.
- Data Analysis: Quickly digitize historical documents for quantitative analysis.