Reviews, catalogs, archives, transcribing and OCR

This category is for notes on a few overlapping topics related to “Publishing Tools”.

Wikisource has access to a larger wikimedia staff with technical support and perhaps a better process than described below. Start there.

They are explicitly going from raw scan data to finished wikimedia format which is certainly going to be a standard markup easier to convert for and Gutenberg than vice versa.

They have a full “backend” for automating each step that can be automated and wikimedia also has lots of offline editor support for proof reading and other editing.

The remainder of this post still needs rewriting.

Separate posts later if needed. References to “Word” below are a shorthand intended to refer to any editing software compatible with import and export from text files and well structured documents produced with standard FOSS software such as LibreOffice and pandoc. See especially pandoc to understand conversion between markup formats.


Single paragraph and other short reviews will be needed for linking to longer reviews and related material at different “landing spots” here (where related links for a specific audience and topic will be highlighted).

When there is enough here to support a “campaign” to draw wider attention, such links will be central for attracting traffic. As well as user URLs in comments to blogs and discussion forums etc it would be useful to include such short paras or reviews linking to longer reviews here in connection with other works referenced here so as to be seen by people who have an interest in those works.

Examples of sites that support user provided short reviews include library catalogs such as WorldCat, Open Library and Library Genesis, distributers like Amazon etc, Google books and

Reaching academic audience would be primarily via peer reviewed journals that get cited with reference lists in Google scholar.

Also important is wikipedia to be discussed in a separate entry (following guidelines closely so based on peer reviewed articles already in Google scholar).

Another is where many works that should be referred to here will be found by a wide range of readers. This is discussed under “archives” below.


Both and are repositories for many works to be referenced here. They should be enhanced to include any additional material or improved versions of important material referenced here and reviews should be linked to all existing versions. Gutenberg is probably largely included in (but also has its own wikimedia distribution in kwix format).

They all have detailed guidelines for contributors and rely heavily on OCR.

The has python software with an API. See “Useful Links” from here:

Could presumably be integrated directly with Calibre, Library Genesis, Open Library, Zotero, Google Scholar etc.

Gutenberg also has a self-publishing facility with detailed guidelines for .pdf format and links to others. Will do separate page with info on others relevant including lean pub (for updating from github optional payments).


Both the two most important archives mentioned above rely heavily on transcribing from OCR scans of old editions.

Intermediate step is editing a Word file generated from an OCR text file.

There should be some way to automatically get an OCR text or xml file which includes markup for at least font changes like italics and perhaps for some structure like pages, headers and footnotes. That would save a lot of work while editing the Word file. I am certain this can be found from documentation of the OCR software that produced the text or xml file (and is more likely to be in the xml file or collection of related xml files.

When editing the Word file, it is important to use standard structure elements such as chapters heading levels and footnotes that can be easily modified for the whole document as a “style” rather than formatting specific text for headers etc. Plan this in advance to be compatible with and wikimedia discussed below, preferably by understanding the conceptual structure that could also be processed automatically by pandoc.

The format at is a good reference point as each work can be easily bulk downloaded from links in Table of Contents and then easily read into Calibre software for subsequent well automated conversion to epub, pdf etc as well as for export to editing and import back again after enhancement. Our own material should end up in a similar format, with source files using markup that can be automatically transformed with pandoc to become html or wikimedia pages or edits to pages and kwix as well as epub, pdf etc. Design so entire sections of the site can be distributed either as collections of epub and pdf files with relative links that work between them and/or as kwix.

Some material needs to be upgraded from the (best) files already available at to the .html used by

Best approach is to actually become a volunteer, follow their guidelines, use their templates and scripts and get their help when needed:

Important links from there include:

Above says first convert OCR text to html using a perl script they provide and perhaps also using Tx2html.

Then proof read the html. The script saves 90% of the proof reading work so it is important to learn to use it first.

I am surprised by that sequence as my intuition would be that OCR plus automated script could also recognize a lot of what has to be added through proof reading such as italics but there is no reference to this that I can see.

Should be possible to automate nearly all of this by just exporting to html from a Word document plus the perl script. There may be some other page with more specific advice on that. Check first. I do not understand how to use the perl script or how to get an OCR text that includes markup for italics etc to save a lot of proof reading. Their volunteers will either know how or that it has to be done manually.

When learning this process also keep in mind Wikipedia/Wikimedia editing for which there are special offline editors and scripts etc. Will be harder to find relevant guidelines for just transcribing as they are not an archive but an Encyclopedia. But they may have better technical guidelines for going from Word documents to their markup language which goes to both html and kwix formats

Best to get help from one of the volunteers giving them link to source of specific OCR text file you are willing to fully proof read and transcribe and asking whether there is some way to get an OCR txt or xml file that already recognizes italics etc from same source:

Gutenberg should have detailed guidelines for upgrading their archives from those that only have raw scans, pdf and OCR txt files to those that include their specially marked up text and epubs. This must be a regular activity that their volunteers engage in. The intermediate step of a Gutenberg marked up text which does automatically generate an epub strikes me as exactly what is needed to minimize further proof reading for etc. There must be documentation here that spells it out but I have not found it from a very quick scan and it does seem to indicate their

This confirms that:

Working with HTML

In the survey, most volunteers preferred to handcraft their HTML using their normal editor. Those using a word processor edited the HTML as text, rather than composing a word processor file and then Saving As HTML. There was remarkable unanimity on this.

Gutenberg has organized web based systems to coordinate volunteers through production process, with extensive documentation:

They recommend participating in a distributed project before working alone.

This includes a “Smooth Reading” section for volunteers who want to read the whole text while proof reading it themselves.

From very quick scan just to find these links I still have not seen anything that sheds light on getting italics etc marked up direct from OCR scan but remain convinced it must be there.

Especially given recommendations from both Gutenberg and to work from text or html rather than Word document when proof reading.

My guess is that a project to upgrade a raw book or OCR scan or OCR text to a page should be done via a marked up and proof read text that can be automatically converted for use by Gutenberg which can then also be automatically converted to html and wikimedia kwix format.

Therefore worth becoming a volunteer at all three and explicitly working with each of them to get through each of those stages.

This is also worthwhile for familiarity with the massive phenomena of communist mode of production already breaking through capitalist social relations in such volunteer projects especially focussed on freeing “universal labour” from obsolete capitalist property relations and the problems that arise in organizing social labour without employers.

Likewise important to become a wikipedia editor as early as possible for same reason and to be familiar with linking process to make connections within related pages and to external reviews etc.

Optical Character Recognition

We will need to OCR some material for translation from Russian and perhaps some other material not already conveniently available as txt (eg journal articles).

Both the important archives recommend ABBYY software which happens to be Russian and heavily oriented to multiple languages.

A free linux command line version is available but presumably this would not include the most important visual OCR editor needed for enhancing existing scanned archives to adequate text versions as used at


One thought on “Reviews, catalogs, archives, transcribing and OCR

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s