In this lab session we are going to go through the whole process of digitization, correction, and enrichment.
The first step in the digitization process is scanning the two news paper articles you were supposed to bring to the lab session. As we only have a few scanners available at IVA and it would take a lot of time for everyone to get their articles scanned properly, we have done this for you. If you did you homework and dropped off printed versions of your two articles, then these will have been scanned by us. We have put all of your scanned articles online on this page, ordered by group number.
Download the images belonging to your group and save them in your ‘UCHFall2010′ folder. If you for some reason did not hand in any images for scanning, we have made scans of two other articles available under group 0 so you can download and use those.
After you’ve downloaded your images, it is time to run OCR on them! First, we need to install OCR software in Ubuntu. We are going to use OCRAD, a free OCR software package. OCRAD is not part of the standard Ubuntu installation, so we need to download the program and install it. To do that, however, we first need to install the g++ compiler by running the following command in the Terminal:
Now we download the program
and install it:
sudo make install
rm -rf ocrad-0.20*
OCRAD should now be installed. Let’s make a new directory in our ‘UCHFall2010′ folder for our OCR experiments:
If you haven’t done so already, please download the scanned articles for your group. This can be done using the browser and Ubuntu interface or, for the bold, figure out how to use wget from the command line to do the download. Save them in the ‘ocr’ directory you just created or move them there from the Downloads directory. Use the unzip command to unpack the zipped articles:
Once unpacked we can prepare the images for OCR. The images are scanned in the JPG format (grey scale; 600 dpi). OCRAD only works with a few file formats so we need to convert the images, for instance in the PBM format. This can be done with convert:
Use the file browser in Ubuntu to view the image. Many of the images are not rotated correctly, and OCRAD needs the text to be in a horizontal direction to have a chance to work. Again convert can help us – using the -rotate 270 option will for instance rotate the image 270 degrees to the right. Note that you only have to do this if your image is not rotated correctly!
The image is now ready to be OCRed. OCRAD has many options and parameters (study the manual and help files yourself). To do a simple OCR simply call OCRAD and provide your filename:
The output will roll over the screen for you to inspect. If you want to save it in a file use a redirect to redirect the output into a file called ‘group0-oldArticle-ocr.txt’ or something similar (see the Ubuntu tutorial if you cannot remember how)
Now that you have the OCRed text you can compare it to the typed plain text you have prepared in advance.
Now that you have the OCR’ed text of the articles, we are going to see if we can correct some of the spelling errors made by the OCR software. We have implemented a simple version of Kernighan’s Noisy Channel model, which was discussed in class. You can download the software here, save it in your ‘UCHFall2010′ directory or download it using the wget command in a Terminal window:
In a Terminal window, navigate to your ‘UCHFall2010′ directory. First, we need to unzip the spelling corrector using the following command:
Let’s take a look at what was in the zip file:
You should see something like this:
.. correct language-model.txt
What do these different files represent? The file ‘language-model.txt’ contains the training data for the language model. In this case, it contains a large number of text files taken from Project Gutenberg that I have pasted together. The ‘error-model.txt’ file contains the occurrence counts for the different types of errors as they are listed in the back of the Kernighan paper (i.e., the error model. These files are necessary to run the spelling corrector, but you will not have to alter or use them in any way. The file correct is the actual program that will construct the error and language model and then generate corrections. It always needs to be run from a Terminal window (so you cannot simply launch it by clicking on it in an Explorer window!).
The fourth file, clean-up, is a program that cleans up the OCR output for us. As you might have noticed, ocrad outputs a lot of punctuation marks and other garbled characters that are not real text. We need to clean this up, or our spelling corrector will choke on this. The clean-up program will do that for you.
In the Ubuntu tutorial you learned that you can usually ask programs for a help message and the ‘clean-up’ program is no different. If you run it with the ‘-h’ option (i.e., the help message)
it will display the following help message, explaining which parameters it takes:
Usage: clean-up [options]
-h, –help show this help message and exit
-t T, –text=T filename of the file that contains the text to be cleaned
Let’s clean the OCR’ed text of your new article as an example. Specify the filename of that OCR’ed text using the -t parameter:
As the output rolls over the screen, you can see that some of the weirder punctuation marks and the extra empty lines are removed. Run the command again and redirect it to a new file (for instance called ‘group0-newArticle-clean.txt’ using the > operator:
Of course you need to do the same for the old article as well. Now that the OCR output has been cleaned, we can do spelling correction using the correct program! If you run it with the -h option, you can display the help message:
-h, –help show this help message and exit
-l L, –language-model=L
filename that contains the training data for the
language models (i.e., a collection of text)
-t T, –text=T filename of the file that contains the text to be
It tells us that to correct the spelling of a text file we need to specify (1) the name of that text file and (2) the name of the file containing the training data for the language model. This means that we could also do spelling correction on the same text using different language models. If we for instance selected a smaller collection of texts, that might influence the quality of the corrections. Notice that you do not need to specify the name of the file containing the error model; this is generated automatically.
Let’s run the spelling corrector on some of the cleaned output of the OCR software, such as the output of OCR on the new article, ‘group0-newArticle-clean.txt’, where X is your group number:
If this produces an error such as “KeyError”, then make sure you have cleaned the OCR output first before you run spelling correction! If it still does not work, then send me an e-mail and attach the cleaned and the ‘dirty’ text file that caused the error. I have tried to address all possible errors, but nobody is perfect and I might have overlooked something.
If it did work, then do the same thing for the OCR’ed version of the old newspaper article. You have now run spelling correction on your OCR’ed articles!
Let’s examine the result! To look at the result, use the Ubuntu text editor (which is called gedit) by going to Applications > Accessories > gedit Text Editor in the menu. Open both the OCR’ed and the corrected versions of your articles. If everything went okay, you will probably notice that OCR has worked much better on the new article than on the old article. Whether or not spelling correction has improved the texts depends on many factors, such as the quality of the scan, and the size and layout of the original article.
What you will also notice, is that the spelling corrector has converted everything to lowercase (i.e., without any capital letters). This is because we only have an error model for lowercase characters. It also has some consequences for the NER stage as we will discuss in the next section.
Named Entity Recognition
While many different types of enrichment are possible, we are going to focus on Named Entity Recognition (NER). Instead of installing software, we will instead use an online demo created by the Cognitive Computation Group at the University of Urbana-Champaign in Illinois. You can find this demo here.
As explained in the lecture, NER uses many different cues to determine whether a term is a named entity or not. One of the most important ones is capitalization. This means that the lack of capitalization in the output of our spelling corrector could negatively influence the NER quality. We will therefore manually restore the capitalization in the output of the spelling corrector. In gedit Text Editor, open the corrected files (such as ‘group0-newArticle-corrected.txt’) and select File > Save As to save the file under a new name (such as ‘group0-newArticle-corrected-capitalized.txt’). Using the original article as a guide, you should then add the capitalization to this new file where it is missing. Save the file and repeat this for the old newspaper article as well.
Now you’re ready to perform NER. For each of the two articles, copy the text and paste it into the text field of the NER web demo. Press Submit and the NER system will identify the named entities. Note that there is a certain limit to how much text you can paste into this Web demo, so if you are waiting and waiting and it does not seem to produce any results, split your text up into several smaller bits, and copy, paste, and submit these bits individually. You can combine the results afterwards. Once again, if something really doesn’t work, let me know and send me the full text you are trying to run through the Web demo.
If it was successful, you can copy the NER’ed text into a new text file. At the end of this step you should have (at the very least) for both the old and the new newspaper article:
- a file that contains the OCR’ed output
- a file that contains the cleaned OCR output
- a file that contains the output of the spelling correction
- a file that contains the output of the spelling correction with manually added capitalization
- a file containing the output of the NER system
- a file containing the original, error-free text of the original article (which you typed out manually or copied from the Web)
For your first assignment, you are supposed to write a 3-page paper about the results of these three stages. At each stage, old errors can be corrected, but new errors can also be introduced. The goal of the paper is to evaluate the performance at each of these stages and then to write a scientific report about your evaluation. This means you are going to have to go through your articles and count the number of errors made at each stage and report the accuracy.
Your report should at least contain information about the accuracy of the OCR, spelling correction and NER stages for both the old and the new newspaper article. Describe and discuss your findings and try to offer explanations for the differences or similarities you see. Why are certain things going wrong?
This does not necessarily mean you have to stop here. Try to think of all the interesting combinations you could evaluate and the comparisons you could make. What could you have done/do to get better results? Take some initiative!
Your paper is due Tuesday October 26, 2010 at 23:59. See the slides for lecture 4 for more details about how to hand in your assignment!