BHDD: A Burmese Handwritten Digit Dataset
This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.
BHDD is a solid, carefully constructed dataset paper that fills a genuine gap in handwritten digit benchmarks. The methodology is sound: contributors are explicitly split between training and test sets to avoid writer overlap, quality assurance involves both automated deduplication and manual verification, and the mobile preprocessing app with real-time thresholding is a practical innovation for crowdsourced collection. The baseline experiments are sufficient to validate that the dataset is learnable while revealing script-specific challenge patterns.
The data collection methodology demonstrates care for real-world noise: using phone cameras rather than scanners introduces lighting and angle variation, and the Android app with adaptive thresholding allowed contributors to verify digit extraction quality. The statistical analysis goes beyond simple pixel histograms to examine per-class ink coverage (30.4% for class 2 up to 56.8% for class 0), mean images showing consistent stroke patterns, and variance heatmaps identifying where handwriting styles diverge most. The script-specific confusion analysis between visually similar pairs (0/1, 0/8) provides actionable insights for future model development.
The test set exhibits extreme class imbalance (class 0: 6,856 samples vs. class 9: 389 samples), a nearly 18:1 ratio that complicates interpretation of overall accuracy metrics despite the reported macro-F1. With the improved CNN producing only 47 total misclassifications across 27,561 test samples, the error analysis—while informative—has limited statistical power for drawing general conclusions about difficult pairs. Additionally, most contributors came from Yangon with smaller representation from other regions, potentially limiting demographic diversity. The paper notes these limitations but does not quantify their impact on generalization.
The placement within the MNIST family of datasets is appropriate and the 28×28×1 format choice enables direct compatibility with existing data loaders. Comparisons to Kuzushiji-MNIST's finding that 'different scripts need their own benchmarks' is fair and well-supported by the paper's own confusion analysis showing unique error patterns (particularly the 0-1 ambiguity specific to circular scripts). The baselines are standard but adequate for a dataset paper; the progression from MLP (99.40%) to CNN (99.75%) to improved CNN (99.83%) demonstrates that augmentation and batch normalization provide expected gains without requiring exotic architectures.
Reproducibility is excellent. The dataset is publicly available under CC BY-SA 4.0 on GitHub with both pickle and IDX formats. Baseline code, exploration scripts, and usage examples are included. Experimental details are thorough: random seed (42), exact layer dimensions (256/128 for MLP, 32/64 filters for CNN), dropout rates (0.25 spatial, 0.5 dense), learning rate ($10^{-3}$), augmentation parameters ($\pm 15°$ rotation, $\pm 2$ px translation, 0.9–1.1× scale), and optimizer settings are all specified. The use of standard frameworks (scikit-learn, PyTorch) further ensures independent reproduction is straightforward.
We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.