One of my jobs in recent months has been creating CROSS eBooks for a company called Doxa Digital Press. Pretty much all of these eBooks have to be OCR'ed from scanned images inside PDF files, which means that I have to have an OCR program that handles PDF files reasonably well. The main problem I have run into on this front is cost. I could go out and spend $500 for ABBYY FineReader, but I don't have that kind of cash just sitting around. Unfortunately there are not many other options on the market, especially for a guy on a budget, and especially ones that I can download a trial of.
Enter Readiris Pro 10. This OCR program was really the only other option I could find that fits the bill in the main ways. It handles PDF files and it is only about $100. SO, I tried out the trial version and noticed that it is not as full-featured as ABBYY, but it is usable. I also noticed that the Demo version did not allow me to open more than 50 pages of a PDF document at a time, which was annoying but was an understandable restriction for a demo program. So, I purchaed the full version from Amazon.com.
After installing the full version, I loaded the PDF file for my next project and came to the realization that that 50-page limit is actually a PROGRAM FEATURE. The idiots who designed this program did not consider the idea that customers would want to open documents that are larger than 50 pages. After playing around with the progam more, here are all of my observations on this total waste of money.:
- The program does not input more than 50 pages at a time: This means that if you have more than 50 pages in the PDF you want to OCR, or if you have more than 50 pages in any other source that you want to OCR, you have to split them up into groups and save the groups individually. Additionally, this issue is not revealed at all on the company's website or in the documentation. As a matter of fact, I found this "Tip" in the Help files related to working with multipage documents: "Tip: when you're loading multipage images and PDF documents, you can define the page range! (Why load an entire book into Readiris when you only need a specific chapter...?)"
- The program does not allow you to save a batch or project: If you want to come back to a scanned document or project you have to do all the work over. If the program or your computer crashes you have to do all of your work over. I guess this could be the reason IRIS decided to limit projects to 50 pages...
- Every time you change pages the tool you are using changes back to the default: This is an annoying function. The default tool is the text block tool, and the programmers decided that you should always use this tool first when you go to a different page.
- The program does not use keyboard shortcuts for tools: I have yet to find any keyboard shortcuts that will let me change between the tools. This is really annoying when you consider the point above. I end up having to use my mouse a lot more than I should.
- The program does not offer a Preferences or Settings list to allow the user more control over their workspace: I guess I can understand setting up a program with a certain user base in mind and making the default functions cater to that crowd. However, not giving users the opportunity to make changes to this defualt behavior and customize their settings is illogical.
- The program does not export clean HTML: Since I am a web developer and an eBook editor, this is something that I feel very strongly about. If a program creates an HTML document for you it should at least attempt to follow the recognized standards. Readiris Pro 10 saves each page of a project as an individual HTML page—complete with headers and inline styles—even if you are explicitly saving the project as one document. You end up with lots of HTML documents thrown together into one file. In addition, the source code is so messy that I have to go in and make major changes to the the document just to get the code clean enough to actually make into a useable eBook. Saving the same project in the three different "Text Format" settings only changes the number of styles added to the style attributes in each tag (yes, they put styles in every tag, not just in the style list in the header...), not the actual layout of the HTML. To beat it all, the DOCTYPE tag at the top of each document is the XHTML Transitional doctype! I ran Tidy on samples of all three formatting types and got errors on all of them.
- The program does not tell you what errors, if any, it found in the OCR process: Unlike ABBYY, which highlights possible OCR problems in a separate frame and allows you to actually edit the text that you will be exporting, Readiris Pro 10 does the OCR and automatically opens the Save As wizard. There is no possibility of seeing what words it had problems deciphering.
Those are just the most egregious issues I found with this program. How can this be the 10th version release?? I have contacted the company's support staff by e-mail to find out if there is a way to get around the 50-page limit. If there is not a way to open more pages as part of a project, I will most likely demand that the company buy back the CD that I purchased. I can overlook the other functionality problems if necessary, but having to break up my documents into pieces for no reason is a design flaw that should have been remedied before the product was released for sale. At the very least, the company has a responsibility to alert potential buyers of this restriction before they spend their money.
UPDATE: Well, here is an interesting bit of info: The e-mail I sent IRIS came back to me undeliverable....
----- Transcript of session follows -----
support@samson.irislink.com... Deferred: Connection timed out with
samson.irislink.com.
Warning: message still undelivered after 4 hours
Will keep trying until message is 5 days old
UPDATE: Here are further updates to this situation and its resolution: I, II, III, Final Resolution.
Labels: Technology