(COMP4321)[2012](f)midterm~=lmr27^_19368.pdf

Back to COMP4321 Login to download

======================================================

COMP 4321 Search Engine for Web and Enterprise Data Score:

Mid-Term Examination, Fall 2012

October 30, 2012 Time Allowed: 1 hour

Name: Student ID:

Note: Answer all questions in the space provided. Answers must be precise and to the

point.

1. [15] Circle True or False in the following questions: T F When you choose a search engine, you should always choose the one with highest average precision. T F When stemming has been applied to document terms, stemming must be applied to the query terms. T F A large damping factor d in the PageRank formula will result in a larger number of iterations

before convergence is reached T F In the vector space model, terms are assumed to be independent in the document collection. T F Similarity between two queries can be defined in the same way as the similarity between a

query and a document T F Cosine similarity measures the cosine of the angle between the document vector and the origin

of the vector space T F Search Engine Optimization (SEO) is to optimize the ranking of a site in search engines. T F A high Page Rank means a page is more relevant to the query T F A phrase must be broken down into individual words and represented as individual words in the

document vector T F Precision and recall must add up to 100%

2. [5] Briefly explain why search engine (e.g. Google, Bing) can response (return the relevant results) so fast for a query. (List 3 reasons)

Ans: (1) Crawler will crawl the web pages from time to time and do comprehensive indexing in advance.

(2)

Some smart pattern matching algorithm.

(3)

Web pages are stored and algorithms are run in distributed system.

(4)

the search engine may have cached the results of the queries.

(5)

PageRank values of the pages can be pre-computed, etc.

Note: The first is essential. Other coherent answers will also be accepted. Students who can answer not less than two points can get the full mark.

1) 4)

2) 5)

3) 6)

3. (a) [15] The table below shows the term frequencies of the terms, T1, T2, T3 and T4, in three documents, D1, D2 and D3.

T1 T2 T3 T4 tfmax

D1 2 1 1 0 2

D2 1 2 0 0 2

D3 0 2 0 4 4

Furthermore, there are a total of 1000 documents in the collection, and the document frequencies for T1 to

T4 are:

dfT1 = 20, dfT2 = 30, dfT3 = 10, dfT4 = 20.

Using the tf/tfmax . idf weighting strategy, obtain the term weights of each term in each document.

D1:

W(T1) = 2/2 * log 2 (1000/20) = 5.64;

W(T2) = 1/2 * log 2 (1000/30) = 2.53;

W(T3) = 1/2 * log 2 (1000/10) = 3.32;

W(T4) = 0/2 * log 2 (1000/20) = 0;

D1 = <5.64, 2.53, 3.32, 0>.

D2:

W(T1) = 1/2 * log 2 (1000/20) = 2.82;

W(T2) = 2/2 * log 2 (1000/30) = 5.06;

W(T3) = 0/2 * log 2 (1000/10) = 0;

W(T4) = 0/2 * log 2 (1000/20) = 0;

D2 = <2.82, 5.06, 0, 0>;

D3:

W(T1) = 0/4 * log 2 (1000/20) = 0;

W(T2) = 2/4 * log 2 (1000/30) = 2.53;

W(T3) = 0/4 * log 2 (1000/10) = 0;

W(T1) = 4/4 * log 2 (1000/20) = 5.64;