A distributed transcription platform for Sanskrit texts
(Update from 11 August 2022 — This proofing interface has been fully implemented. See: https://ambuda.org/proofing/)

7 October 2021

This document proposes a distributed transcription and proofreading platform for Sanskrit texts. Although our initial focus is on scanned books printed in Devanagari, our proposed platform is easily extensible to other document types.   

Why transcription is important

For our purposes, transcription is the conversion of Sanskrit text into a machine-readable format. In this setting, the source format is typically some variety of image, such as a PDF of scanned pages. Transcription is highly desirable, as it allows wider and more convenient access to Sanskrit literature. In particular, transcription allows applications like:

  • full-document search
  • annotations
  • conversion to other Indian scripts
  • convenient display in a variety of devices and formats, including websites, e-books, and perhaps even physical reprints
  • easier reuse when quoting or excerpting the text
  • reading environments in the mold of the Scaife Viewer from the Perseus Digital Library

Although the major Sanskrit texts have been transcribed and collected (through projects like SARIT, GRETIL, and the like), there is still an enormous selection of texts waiting to be transcribed, including:

  • important variants of major works, e.g. in different recensions
  • various minor works and commentaries
  • associated secondary literature, such as grammars and student volumes (e.g. the works of M. R. Kale)

The transcription problem

Transcription is difficult because it requires energy and expertise. To simplify the problem, let’s assume that we have a scanned PDF available already. Here are the major difficulties we see:

  • Transcription requires patience and stamina. It is lonely and often boring work. Partial effort is not especially useful to others, nor is it easy to share. It is one matter to transcribe three pages in a moment of energy and enthusiasm; but it is another to transcribe three hundred more.
  • Transcription may depend on some level of expertise in Sanskrit and in the text’s genre and context. But at a minimum, it requires some working knowledge of Sanskrit.

Efforts to solve the transcription problem have focused on reducing the energy required. There are two common approaches for doing so:

  1. using optical character recognition software to do automated transcription 
  1. sharing the transcription work with others

Optical character recognition

Optical character recognition (OCR) software performs automatic transcription by applying a statistical model to some input image and producing machine-readable text. OCR software varies in quality and ease of use, and all systems require some level of manual correction after the fact.

We identify two candidate solutions:

  • SanskritOCR is available as a standalone program for Windows computers at €129 per license. Created specifically for Sanskrit, it contains several useful features and configuration options, and it is reasonably user-friendly.

  • Google OCR is available as a web API at negligible cost. It is a raw API with no interface and limited configuration options.

I did a quick comparison of the two systems in 2018 and found the following:

  • At the token level, both systems have comparable error rates.
  • However, the errors that Google OCR makes are substantially less severe.
  • Further, Google OCR is easy to use in a cross-platform setting, e.g. through a web application.

It may be the case that SanskritOCR can improve substantially with better configuration settings. But finding and applying these settings requires some level of expertise, on top of being able to buy and install the program in the first place. On this basis, I think it is reasonable to invest in Google OCR and building tooling and support around it.

Shared transcription