Donut (🍩), or Document Understanding Transformer, is an innovative OCR-free, end-to-end Transformer model designed for document understanding. Unlike traditional methods, Donut bypasses the need for external OCR engines or APIs, yet it achieves state-of-the-art results across a range of visual document tasks, such as document classification and information extraction (also known as document parsing). While datasets for left-to-right languages like English, Spanish, and Chinese are readily available and can be generated with SynthDog, creating datasets for right-to-left languages, such as Hindi, Arabic, and Urdu, is somewhat complex

We'll use SynthDoG 🐶, a Synthetic Document Generator, to make model pretraining adaptable across various languages and domains.

Here is a list of the main languages that use right to left scripts:

Step 1: Installing Required Libraries

First, clone the SynthDoG-RTL GitHub repository to get access to all the necessary tools and configurations:

This repository contains everything you need to get started, including configuration examples, templates, and background resources.

Next, install the required dependencies:

Step 2: Setting Up Your Project Structure

Inside the cloned Synthdog-RTL directory, organize the project with the following structure:

Explanation:

Step 3: Creating config_ur.yaml Configuration File

Below is a sample config_ur.yaml file configured for Urdu. This file determines how your synthetic dataset will be generated, including text layout, image effects, and dataset size:

Step 4: Creating Sample Corpus (urdu_sample.txt)

Create a text file named urdu_sample.txt inside the resources/corpus/ folder. This file should contain sample Urdu text paragraphs. You can replace the text with any other RTL language content:

Example of urdu_sample.txt:

Step 5: Adding Fonts

Place .ttf font files for Urdu in the resources/font/ur/ directory. Ensure the font supports the language you are targeting.

To get high-quality fonts for your target RTL language, you can download them from Google Fonts. Here's how you can do it:

make sure that font directory is set in config_ur.yaml

This process can be repeated for other languages by creating new directories under resources/font/ (e.g., resources/font/arabic for Arabic fonts).

Step 6: Generating the Dataset

Run the following command in terminal(shell) to generate your dataset. Adjust the parameters -c (number of samples) and -w (number of workers) as needed:

Parameter Explanation:

Step 7: Modifying Configuration for Different Effects

Here’s a brief guide to some parameters you can tweak in the config_ur.yaml file:

Step 8: Extending to Other RTL Languages

To generate synthetic datasets for other RTL languages, repeat the steps above and:

This method is suitable for generating synthetic data for Arabic, Urdu, Persian, Hindi, Hebrew, and similar languages.

Conclusion

This guide helps you create high-quality synthetic datasets for Donut OCR using SynthDoG. With the flexibility of config.yaml, you can adjust parameters to match the specific needs of your project and target language.

For more information and updates, refer to the