(COMP231)midterm00S_sol.pdf

=========================preview======================
(COMP231)midterm00S_sol.pdf
Back to COMP231 Login to download
======================================================
COMP 336 Information Retrieval Midterm Examination, Spring 2000 March 20, 2000

Name: Student ID:

1.
Using the basic KMP algorithm, determine the number of character positions to shift in the shift[] array for the pattern tomatoes [10]

shift

t
1

o
1

m
2

a
3

t
4

o
4

e
4

s
7

Explain briefly how the algorithm can be improved by examining the pattern character which causes a mismatch with the text. Show the new values of the shift array when the improvement is applied. [10] If after shifting the pattern, the new pattern character to be compared with the text string is the same as the pattern character that caused the mismatch, we can conclude that a match is not possible and thus shift the pattern further by a number of position as determined by the shift table entry for the new pattern character.

shift

t
1

o
1

m
2

a
3

t
5

o
5

e
4

s
7

Give two major differences between the KMP and BM algorithms. [6] Any two of the following (although the first two characterize BM better):

1)
comparison is from right to left

2)
the text character causing a mismatch is using to determine the shift (delta 1)

3)
repeated substrings in the pattern are obtained from right to left to determine the shift (delta 2)

Explain how we can use KMP pattern matching to support prefix, suffix or infix truncation in the pattern. [9]

Prefix and suffix: nothing need to be done; e.g., *xyz and xyz* are equivalent to searching for xyz
infix: ijk*xyz is equivalent to searching for ijk and upon a match search onwards for xyz

[morale of this question is that infix, prefix, and suffix truncation can be easily handled in full text scanning compared to an inverted file]
2.
Fill in the precision, recall and fallout rates in the following table. It is assumed that there are a total of 100 documents, and there are only 5 relevant documents, which are marked with a in the first column. [30]

Rank
doc ID
Recall
Precision
Fallout Rate

1
1001
0
0
1/95

2
2873
1/5
1/2
1/95

3
3916
2/5
2/3
1/95

4
0983
2/5
2/4
2/95

5
8310
2/5
2/5
3/95

6
7892
3/5
3/6
3/95

7
4562
3/5
3/7
4/95

8
4921
4/5
4/8
4/95

9
7934
1
5/9
4/95

10
9248
1
5/10
5/95

...
. . .
...
...
...
...

100
3861
1
5/100
95/95

Is the following precision/recall graph possible? Justify your answer. [5] precision = number of relevant document retrieved / number of documents retrieved
recall = number of relevant retrieved / number of relevant documents in collection

We can see that when precision = 0, recal