You could be running into a subtle difference in the way FineReader binarizes, rotates, and dewarps and the way ST binarizes and rotates. The only thing which is apparently guaranteed to get better results is a perfectly flat page. I guess all this goes to show that you can't count on anything in particular to improve your OCR results. Pluggable dictionaries, trainable character sets, and transparent operations will make OCR more of a science. I hope that OCRopus and Tesseract will finally open the process (eventually, once the developers stop assuming that the only platform OCRopus should be used on is Ubuntu Linux). A support request from ABBYY, wherein I provided a sample page image with sample outputs, pointing out the errors, resulted in the typical customer support nonanswer that FineReader isn't guaranteed to logically do anything you expect it to do. FineReader is particularly troubling because it is closed, and because I've found it to make errors depending even on the output format, which makes absolutely no sense. Fonts, font sizes, color, dictionaries used by the OCR program, all contribute to errors. Interesting! I think that up until now, OCR has been a black art. I can break down the errors and skipped portions per line if anyone wants that detail I decided not to clutter this post for now. I've attached the Word documents themselves to this message. Be warned that it's about 120MB in 7zip thanks to the size of the source TIFFs.
I'll be uploading an archive containing both sets of TIFFs, and Word documents containing the OCR output for each, if anyone is interested in seeing the results for themselves. The Scan Tailor results produced a lower average error rate on the remaining pages. While the letters were recognized very accurately, something about the way Scan Tailor processed the text meant that a number of spurious spaces appeared - primarily in words ending in "-ford," such as Brantford. As well, I noticed that one particular segment of pages (pages 8-9, comprising of the two indexes) produced substantially more errors in the Scan Tailor output.
First, the substantially higher number of missing segments in the unprocessed TIFFs meant that ABBYY skipped some lines outright the Scan Tailor pages had a bit more text overall recognized, which compensates a bit for the higher error rate. Looking at the results, it seems that this is for two reasons. I was surprised to see that the unprocessed TIFFs had a lower error rate than the Scan Tailor output.
The Scan Tailor results produced an average of 3.8 errors per page, with an average of 1.1 missing segments per page. The unprocessed TIFFs produced an average of 3.4 errors per page, with an average of 1.8 missing segments per page.
For reference, this was tested using Scan Tailor 0.9.7.2 I have not had the chance to upgrade yet.
The software used was ABBYY FineReader 6.0 Sprint Plus, the simple edition that is bundled with some scanners. I let the OCR software run completely automatically on both sets of pages. For reference, an "error" was considered to be text interpreted incorrectly or non-text interpreted as text, while a missing "segment" is a word or line skipped by the OCR software because it was not recognized. I then counted the number of errors and missing segments per page, and averaged that over the 10 pages.
I processed these pages using Scan Tailor in "black and white" mode, and performed OCR on both the Scan Tailor output and the original TIFFs I fed into Scan Tailor. These pages have a bit of variety in types of contents - blocks of plaintext, pages with multiple fonts and illustrations, and coloured pages. To test, I compared a set of the first 10 pages from the Brant County Gazetteer, 1869-70 that I recently scanned. My first instinct was that that would be the case, but I wanted to verify to see if that was true. However, since some of them are typewritten and OCRable, I was interested in comparing to see if Scan Tailor output can provide cleaner OCR. I scan a number of older books which are going to be uploaded to my institution's website using the original full-colour photographs, rather than bitonal images.