A distributed transcription platform for Sanskrit texts
(Update from 11 August 2022 — This proofing interface has been fully implemented. See: https://ambuda.org/proofing/)
7 October 2021
This document proposes a distributed transcription and proofreading platform for Sanskrit texts. Although our initial focus is on scanned books printed in Devanagari, our proposed platform is easily extensible to other document types.
Why transcription is important
For our purposes, transcription is the conversion of Sanskrit text into a machine-readable format. In this setting, the source format is typically some variety of image, such as a PDF of scanned pages. Transcription is highly desirable, as it allows wider and more convenient access to Sanskrit literature. In particular, transcription allows applications like:
full-document search
annotations
conversion to other Indian scripts
convenient display in a variety of devices and formats, including websites, e-books, and perhaps even physical reprints
easier reuse when quoting or excerpting the text
reading environments in the mold of the Scaife Viewer from the Perseus Digital Library
Although the major Sanskrit texts have been transcribed and collected(through projects like SARIT, GRETIL, and the like), there is still an enormous selection of texts waiting to be transcribed, including:
important variants of major works, e.g. in different recensions
various minor works and commentaries
associated secondary literature, such as grammars and student volumes(e.g. the works of M. R. Kale)
The transcription problem
Transcription is difficult because it requires energy and expertise. To simplify the problem, let’s assume that we have a scanned PDF available already. Here are the major difficulties we see:
Transcription requires patience and stamina. It is lonely and often boring work. Partial effort is not especially useful to others, nor is it easy to share. It is one matter to transcribe three pages in a moment of energy and enthusiasm; but it is another to transcribe three hundred more.
Even a fully transcribed work can be riddled with errors that make it questionable for serious use. For example, the electronic text of the Ramayana critical edition is in dire need of thorough proofreading.
Transcription may depend on some level of expertise in Sanskrit and in the text’s genre and context. But at a minimum, it requires some working knowledge of Sanskrit.
Efforts to solve the transcription problem have focused on reducing the energy required. There are two common approaches for doing so:
using optical character recognition software to do automated transcription
sharing the transcription work with others
Optical character recognition
Optical character recognition(OCR) software performs automatic transcription by applying a statistical model to some input image and producing machine-readable text. OCR software varies in quality and ease of use, and all systems require some level of manual correction after the fact.
We identify two candidate solutions:
SanskritOCRis available as a standalone program for Windows computers at €129 per license. Created specifically for Sanskrit, it contains several useful features and configuration options, and it is reasonably user-friendly.
Google OCRis available as a web API at negligible cost. It is a raw API with no interface and limited configuration options.
At the token level, both systems have comparable error rates.
However, the errors that Google OCR makes are substantially less severe.
Further, Google OCR is easy to use in a cross-platform setting, e.g. through a web application.
It may be the case that SanskritOCR can improve substantially with better configuration settings. But finding and applying these settings requires some level of expertise, on top of being able to buy and install the program in the first place. On this basis, I think it is reasonable to invest in Google OCR and building tooling and support around it.
Why transcription is important
The transcription problem
Optical character recognition
Shared transcription