Donut (🍩), or Document Understanding Transformer, is an innovative OCR-free, end-to-end Transformer model designed for document understanding. Unlike traditional methods, Donut bypasses the need for external OCR engines or APIs, yet it achieves state-of-the-art results across a range of visual document tasks, such as document classification and information extraction (also known as document parsing). While datasets for left-to-right languages like English, Spanish, and Chinese are readily available and can be generated with SynthDog, creating datasets for right-to-left languages, such as Hindi, Arabic, and Urdu, is somewhat complex
We'll use SynthDoG 🐶, a Synthetic Document Generator, to make model pretraining adaptable across various languages and domains.
Here is a list of the main languages that use right to left scripts:
Step 1: Installing Required Libraries
First, clone the SynthDoG-RTL GitHub repository to get access to all the necessary tools and configurations:
This repository contains everything you need to get started, including configuration examples, templates, and background resources.
Next, install the required dependencies:
Step 2: Setting Up Your Project Structure
Inside the cloned Synthdog-RTL
directory, organize the project with the following structure:
Explanation:
Step 3: Creating config_ur.yaml Configuration File
Below is a sample config_ur.yaml
file configured for Urdu. This file determines how your synthetic dataset will be generated, including text layout, image effects, and dataset size:
Step 4: Creating Sample Corpus (urdu_sample.txt)
Create a text file named urdu_sample.txt
inside the resources/corpus/
folder. This file should contain sample Urdu text paragraphs. You can replace the text with any other RTL language content:
Example of urdu_sample.txt
:
Step 5: Adding Fonts
Place .ttf
font files for Urdu in the resources/font/ur/
directory. Ensure the font supports the language you are targeting.
To get high-quality fonts for your target RTL language, you can download them from Google Fonts. Here's how you can do it:
make sure that font directory is set in config_ur.yaml
This process can be repeated for other languages by creating new directories under resources/font/
(e.g., resources/font/arabic
for Arabic fonts).
Step 6: Generating the Dataset
Run the following command in terminal(shell) to generate your dataset. Adjust the parameters -c
(number of samples) and -w
(number of workers) as needed:
Parameter Explanation:
Step 7: Modifying Configuration for Different Effects
Here’s a brief guide to some parameters you can tweak in the config_ur.yaml
file:
Step 8: Extending to Other RTL Languages
To generate synthetic datasets for other RTL languages, repeat the steps above and:
This method is suitable for generating synthetic data for Arabic, Urdu, Persian, Hindi, Hebrew, and similar languages.
Conclusion
This guide helps you create high-quality synthetic datasets for Donut OCR using SynthDoG. With the flexibility of config.yaml
, you can adjust parameters to match the specific needs of your project and target language.
For more information and updates, refer to the
Leave a Comment