Lab session ‘Information Retrieval’

Introduction

In this third lab session, we are going to experiment with different information retrieval models. We will install a retrieval toolkit called Lemur and then use this to perform retrieval experiments on the Cystic Fybrosis (CF) collection, a test collection containing a little over 1200 medical documents and 100 queries including relevance judgments.

The documents in this collection are represented by different fields, such as title, abstract, subject headings, references, etc. Take your time to read a little bit more about what this collection contains on its Web page here.

We have created separate versions of this collection for you to use in your experiments. We have done this for two reasons: (1) the original collection is not in a format that can be indexed by Lemur, and (2) not all of the fields in these CF documents are of equal value in retrieving relevant documents. We have created four separate versions of the CF collection:

  • Titles only; this version only contains the text of the title fields for each document
  • Abstract only; this version only contains the text of the abstract fields for each document
  • Subject headings only; this version only contains the text of the major and minor subject headings for each document
  • Title + abstract + subject headings; this version of the collection contains the text from all three fields combined

These four different versions will enable us to explore how different fields influence retrieval performance: is it better to use only the title text for retrieval than using the abstracts? Or do we get the best performance when we combine all three fields together? These are some of the questions you are going to have answer in your assignment.

Let’s take a quick look at what a document looks like. Start  by downloading these four versions of the CF test collection here. Save the file in your directory of choice (such as your ‘UCHFall2010′ directory) and unzip the file using the ‘unzip’ command. It contains the following six files:

cystic-fibrosis.abstract.sgml
cystic-fibrosis.all.sgml
cystic-fibrosis.qrel
cystic-fibrosis.queries
cystic-fibrosis.subjects.sgml
cystic-fibrosis.title.sgml

The four files ‘cystic-fibrosis.abstract.sgml’, ‘cystic-fibrosis.all.sgml’, ‘cystic-fibrosis.subjects.sgml’, and ‘cystic-fibrosis.title.sgml’ are the four versions of our CF test collection. If you examine one of these files, for example ‘cystic-fibrosis.title.sgml’, using the ‘less’ command, you will find that it looks like this.

<DOC>
<DOCNO> 00001 </DOCNO>
<TEXT>
Pseudomonas aeruginosa infection in cystic fibrosis. Occurrence of
precipitating antibodies against pseudomonas aeruginosa in relation
to the concentration of sixteen serum proteins and the clinical and
radiographical status of the lungs.
</TEXT>
</DOC>
<DOC>
<DOCNO> 00002 </DOCNO>
<TEXT>
Amylase content of mixed saliva in children.
</TEXT>
</DOC>
<DOC>
<DOCNO> 00003 </DOCNO>
<TEXT>
A clinical study of the diagnosis of cystic fibrosis by instrumental
neutron activation analysis of sodium in nail clippings.
</TEXT>
</DOC>

Basically, each document is composed of a document ID (<DOCNO>) and the text of the document. In this case, the <TEXT> element contains the title text of the original document.

The other two files in our zip archive contain the queries and the relevance judgments. The queries are formatted the same way as the documents are. Query 1, for instance, looks like this:

<DOC>
<DOCNO> 00001 </DOCNO>
<TEXT>
What are the effects of calcium on the physical properties of mucus
from CF patients?
</TEXT>
</DOC>

The last file, ‘cystic-fibrosis.qrel’, contains the relevance judgments for the query-document pairs:

00001 0 00139 2
00001 0 00151 2
00001 0 00370 1
00001 0 00440 1
00001 0 00441 2

Here, the first column contains the query IDs, the third column contains the document IDs and the fourth column contains the relevance values. Relevance values can be either ‘1′ or ‘2′. If the value is ‘2′, it means that a document is highly relevant to a specific query, whereas a value of ‘1′ means that it is only marginally relevant. Example: the first line of the ‘cystic-fibrosis.qrel’ file states that for query 00001 the document with ID 00139 is highly relevant. Documents that are not explicitly mentioned in this file are assumed to be irrelevant. The second column is always set to 0 and is irrelevant to our experiments, so you may ignore this column in your experiments.

Installation

As mentioned earlier, we are going to use an information retrieval toolkit for our experiments called Lemur. This section will show you how to install it on your Ubuntu system. However, before we can do this, we need to install some software packages that are necessary to run Lemur. Run the following commands in a Terminal window:

sudo apt-get -y install g++
sudo apt-get -y install python-lxml
sudo apt-get -y install zlib1g-dev

It is possible that after the first command Ubuntu will tell you that g++ has already been installed. That is quite possible, since you had to install it to run ocrad in the last lab session. However, for those of you who have had to start over for some reason, we have included the command here anyway. When these three commands have finished executing, we use the following commands to download and install Lemur:

cd ~/UCHFall2010
wget http://itlab.dbit.dk/~toine/files/lemur-4.12.tar.gz
tar -xzf lemur-4.12.tar.gz
cd lemur-4.12
./configure
make
rm app/obj/*.o
sudo mv app/obj/* /usr/local/bin/
cd ..
rm -rf lemur-4.12*

Be patient, as it might take a while (especially the ‘make’ command). After this is done, Lemur should be installed and ready to go now! Try typing

RetEval

in a terminal. If that does not give you any errors, you’re good to go. If it does produce a command-not-found type error, ask us for help!

There is one final thing to do though. Since Lemur can be a bit tricky and particular to use, so we have made three programs for you that help you interact with the retrieval toolkit. You can download these here. Make a directory for these scripts in your UCHFall2010 directory named, for instance, ‘ir’ and download and unzip this archive in that directory:

mkdir ~/UCHFall2010/ir
cd ~/UCHFall2010/ir
unzip lemur-interface.zip

This archive contains five different files. We will cover what these files do in the following sections. Make sure you also move your collection files to this new directory, so that we have everything in one place.

Configuration

Now we are ready to start our experiments. The first step is configuring our search engine. Among other things, we need to tell Lemur (1) where we want it to store the index, (2) whether or not we want to filter stopwords from our documents, and (3) whether or not we want to perform stemming. We have to tell Lemur about these last two choices now, because they affect both documents and queries: if we choose to stem and filter our documents, then we also need to do this to our queries, and vice versa. If we do not do this, we would risk not being able to match the stemmed version of a word in a document to the unstemmed version of the same word in the query.

To configure the search engine, we use the ‘lemur-configure.py’ program. If you run it with the ‘-h’ option, you will see the help message:

Usage: lemur-configure.py [options]

Options:
-h, –help show this help message and exit
-c C, –config-file=C
Specify an alternative configuration filename. If this
is not specified, the default file ‘lemur-config.xml’
is used.
-i I, –index-location=I
The name and location of the index to be created, i.e.
‘indices/CACM’.
-f F, –document-format=F
The document format of the documents to be indexed.
Choose from ‘trec’, ‘xml’.
-l L, –stopword-list=L
Stopword list to use when parsing text. Leave empty
for no stopword filtering. Optional parameter.
-p P, –toolkit-path=P
Path to the retrieval toolkit binaries (e.g.
‘/usr/local/bin’ or ‘/data/local/retrieval-
engines/terrier/bin’).
-s S, –stemming=S The type of stemming to perform. Choose from ‘porter’
or ‘krovetz’, or leave empty for no stemming. Optional
parameter.

Let’s say that for this sample retrieval experiment, we are interested in

  • using the version of the CF collection that contains only the title text
  • with stopword filtering
  • with stemming (we will use the Krovetz stemmer)
  • and we want to save this particular index as ‘title-index’

To configure the search engine like this, we would then enter:

./lemur-configure.py -i title-index -l stopwords.txt -s krovetz -p /usr/local/bin/ -f trec -c title-config.xml

What are we doing here? The first parameter

-i title-index

specifies the name of the index: ‘title-index’. This is something you yourself decide; in this case we name it ‘title-index’, because it will be an index containing the title-only version of the CF collection, but you are free to name it whatever you want. The second parameter

-l stopwords.txt

specifies what the name of the stopword list is we are using: ’stopwords.txt’. This file contains a list of stopwords we wish to filter from our documents. If you at some point decide not to use stopword filtering, just leave this parameter out. The third parameter

-s krovetz

specifies what kind of stemming we are using. In this case we use the Krovetz stemmer; replace this with ‘porter’ if you want to use the Porter stemmer, which is slightly worse than the Krovetz stemmer. Leave this parameter out if you do not want to perform stemming. The fourth parameter

-p /usr/local/bin/

tells the program where the Lemur toolkit is located. This remains the same, so you will never need to change this. The fifth parameter

-f trec

specifies the format the document collection is in. We have created a collection in TREC format, so you do not need to change this either. The sixth parameter

-c title-config.xml

specifies the name of the configuration file we want to save these settings in. In this case, we choose to call it ‘title-config.xml. Once again, you are free to name it whatever you want, provided you use the same name in the indexing and searching phases. This config file will contain all of the above configuration settings.

Please note that these settings are for this specific combination of settings (stemming, stopword filtering, etc). If you later want to run different experiments with a different collection and different settings, we advise you to pick a different name that reflects the particular combination. Something like ‘title-stem-stop-config.xml’ or ‘abstract-nostem-stop-config.xml’ would be useful.

Indexing

If you run this command, you have properly configured the search engine and we can start indexing. For this we use the ‘lemur-indexing.py’ program, which is fairly straightforward. All we have to specify is (1) the name of the configuration file we just created and (2) the filename of the file containing the document collection we want to index (‘cystic-fibrosis.title.sgml’). To continue with our example, we would do the following:

./lemur-index.py -c title-config.xml -d cystic-fibrosis.title.sgml

If all goes well, this should produce the following output:

Trying to open toc: title-index.key
Couldn’t open toc file for reading
Creating new index
Parsing file: cystic-fibrosis.title.sgml
Writing out main stats table
Trying to open toc: title-index.key
Trying to open doc manager ids file: title-index.dm
Load index complete.

Congratulations, this means you have successfully indexed all 1234 documents! Don’t worry about the “Couldn’t open toc file for reading” message; Lemur always says that the first time it creates an index.

Retrieval

Now we’re ready to do some retrieval. In this step we use the ‘lemur-search.py’ program. If you run it with the ‘-h’ option, you get the following help message:

Usage: lemur-search.py [options]

Options:
-h, –help show this help message and exit
-c C, –config-file=C
Specify an alternative configuration filename. If this
is not specified, the default file ‘lemur-config.xml’
is used.
-m M, –retrieval-model=M
Retrieval model to be used, Select ‘?’ as the model
for a list of available options.
-n N, –result-count=N
Number of results to be returned. Default is 1000.
-o O, –output-file=O
Name of the file containing the search results.
-p M, –model-parameters=M
Model-specific parameters. Use the ‘-m ?’ option to
see a list of available model and their parameters. To
set one or more model parameters, assign them as
‘param1=0.5′ and ‘param2=350′, where param1 and param2
would be the actual parameter names. You can set
multiple model parameters at the same time by joining
them with a ‘:’ like ‘-p param1=0.5:param2=350′.
-q Q, –query-to-match=Q
Query (set) to retrieve documents for.

We need to specify at least four things:

  • What configuration file do we want to use? (the ‘-c’ parameter)
  • Where are our queries? (the ‘-q’ parameter)
  • Where do we want to save our results? (the ‘-o’ parameter)
  • What retrieval model are we using? (the ‘-m’ parameter)

In addition, we can specify how many we results we want the search engine to return using the ‘-n’ parameter. The default is 1000.

Now how do we know which retrieval models we can use? We can find out by running the program with the ‘-m’ parameter and ‘?’ as the value. Try it:

./lemur-search.py -m ?

The program then prints out a list of the available retrieval models.

Available retrieval models:
1: Vector Space model with TFIDF weighting
2: Okapi BM25
3: Language modeling with Jelinek-Mercer smoothing
4: Language modeling with Dirichlet smoothing
5: Language modeling with absolute discounting smoothing

Each retrieval model also has different parameters. To find out which parameters are available, select your desired model and use the ‘-p’ parameter with the ‘?’ value. For instance, for model 3:

./lemur-search.py -m 3 -p ?

will return

Language modeling with Jelinek-Mercer smoothing
* Name: JelinekMercerLambda
Description: Jelinek-Mercer lambda
Default value: 0.9

Model 3 has only one single parameter: the λ parameter which specifies the weight that is assigned to the document and background language models for smoothing. We will not go into this in more detail in this tutorial; from now on we will use the default parameter settings, but feel free to explore this in greater detail at a later point in time (for instance for your assignment).

Let’s do some retrieval! The following command

./lemur-search.py -c title-config.xml -m 1 -q cystic-fibrosis.queries -o results-title.txt

matches each of the 1234 documents against each of our 100 queries. It uses the simple Vector Space model with TFIDF weighting

-m 1

takes the queries from the file ‘cystic-fibrosis.queries’

-q cystic-fibrosis.queries

and saves the results in ‘results-title.txt’

-o results-title.txt

If you look at the results file now, it contains the following data:

00001 Q0 00593 1 29.1289 Exp
00001 Q0 00950 2 26.2491 Exp
00001 Q0 00139 3 19.7766 Exp
00001 Q0 00128 4 16.621 Exp
00001 Q0 00754 5 15.9524 Exp
00001 Q0 01171 6 15.9222 Exp
00001 Q0 00302 7 15.8093 Exp
00001 Q0 00481 8 15.5401 Exp
00001 Q0 01188 9 14.9552 Exp
00001 Q0 00533 10 14.8648 Exp

Here, the first column contains the query ID, the third column contains the document ID, the fourth column is the rank of that document for this query, and the fifth column contains the retrieval score by which the documents are ranked. The second and sixth columns are always the same and not important for your experiments.

Evaluation

How good did our search engine perform using the particular combination of settings we selected? For that we need to evaluate the retrieved results (‘results-title.txt’) against what we know to be true (‘cystic-fibrosis.qrel’). We will the standard program for IR evaluation: ‘trec_eval’. Installation is quick and painless. Just enter

wget http://trec.nist.gov/trec_eval/trec_eval_latest.tar.gz
tar -xzf trec_eval_latest.tar.gz
cd trec_eval.9.0/
make
sudo mv trec_eval /usr/local/bin/
cd ..
rm -rf trec_eval*

in your Terminal window and everything should be installed quickly.

Now you are ready evaluate your results! We need to give ‘trec_eval’ two pieces of information: (1) where the relevance judgments are and (2) where the retrieved results are. If you enter

trec_eval cystic-fibrosis.qrel results-title.txt

then ‘trec_eval’ will calculate the most popular evaluation metrics for you. Its output should look like this:

runid                  all       Exp
num_q                  all       99
num_ret                all       21611
num_rel                all       2528
num_rel_ret            all       1074
map                    all       0.1858
gm_map                 all       0.1131
Rprec                  all       0.2312
bpref                  all       0.4686
recip_rank             all       0.6179
iprec_at_recall_0.00   all       0.6754
iprec_at_recall_0.10   all       0.5333
iprec_at_recall_0.20   all       0.4023
iprec_at_recall_0.30   all       0.2709
iprec_at_recall_0.40   all       0.1701
iprec_at_recall_0.50   all       0.1259
iprec_at_recall_0.60   all       0.0841
iprec_at_recall_0.70   all       0.0395
iprec_at_recall_0.80   all       0.0217
iprec_at_recall_0.90   all       0.0165
iprec_at_recall_1.00   all       0.0129
P_5                    all       0.3374
P_10                   all       0.2636
P_15                   all       0.2290
P_20                   all       0.2101
P_30                   all       0.1704
P_100                  all       0.0813
P_200                  all       0.0474
P_500                  all       0.0205
P_1000                 all       0.0108

Each row contains a separate evaluation metric that is calculated over all queries (‘all’, second column) and the score for that evaluation metric. For instance, this retrieval run has achieved an MRR score of 0.6179 averaged over all 100 queries. In other words, the user has to scroll to position 1/0.6179 = 1.62 in the ranked list (so between 1 and 2) to find the first relevant document.

‘trec_eval’ supports many different evaluation metrics. For an explanation of some of the most common metrics, run

trec_eval -h -m all_trec | less

The

-m all_trec

selects additional evaluation metrics, such as for instance NDCG. So if you would also like to output this one, run

trec_eval -m all_trec cystic-fibrosis.qrel results-title.txt

For more information, consult the ‘trec_eval’ help text.

Assignment

For your second assignment, you can choose between an Information Retrieval paper or a recommender systems paper. The IR paper assignment is described here.

For the IR paper you are supposed to write a 3-page paper about how to tune an IR system to perform as well as possible with a given collection. This involves selecting which fields to index, how to index (whether or not to use stopping and stemming), indexing the collection, selecting a number of IR models to test, conducting retrieval runs, selecting evaluation measures, and performing the evaluation.

As we do not have access to a cultural heritage test collection we will use the Cystic Fibrosis collection. As IR system we will use Lemur and ‘trec_eval’ to get evaluation scores. This facilitates experiments on a number of dimensions:

  • The effect of indexing different fields (titles; abstracts; subject headings; titles + abstracts + subject headings)
  • The effect of stopping or, and, and/or stemming
  • The effect of choice of IR model (Vector space with TF*IDF; Okapi BM25; Language modeling with three different types of smoothing)
  • The effect of IR model parameters
  • The effect of the different evaluation measures (what do they each express)?

The goal of the paper is to evaluate the effect of at least three of the five dimensions and then to write a scientific report about your evaluation. You should briefly justify your selection of dimensions, carry out experiments, and describe and discuss your findings and try to offer explanations for the differences or similarities you see. Why do certain combinations outperform others? Why are some so similar in performance?

This does not necessarily mean you have to stop here. Try to think of all the interesting combinations you could evaluate and the comparisons you could make. What could you have done/do to get better results? Take some initiative!

Your paper is due Tuesday November 16, 2010 at 23:59. See the slides for lecture 4 and the updated overview of the number of pages for more details about how to hand in your assignment!