Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
АНГЛИЙСКИЙ__МОЙ - копия.doc
Скачиваний:
28
Добавлен:
13.08.2019
Размер:
1.83 Mб
Скачать

B. Vocabulary

4. Give English-Russian equivalents of the following words and ex- pressions:

entity; размер, степень; item; трихотомия (деление на три части, на три элемента); surrogate; совмещение, наложение; inference; детализировать, уточнять; trite; (логический) вывод, умоза- ключение; iterate; элемент (данных); extent; заменять, замешать; refine; объект, категория; peg; замена; trichotomy; повторять, говорить или делать что-то еще раз; substitute; банальный, из- битый, неоригинальный; overlap; стержень.

5. Find the word belonging to the given synonymic group among the words and word combinations from the previous exercise:

^195 —

  1. size, amount, degree, level, scope;

  2. specify, make more exact/precise/accurate, itemize, work out in detail;

  3. unit, thing, object, matter;

  4. repeat, review, follow;

  5. coincidence, combining, matching, overlay, stacking;

  6. replace with, exchange, use instead;

  7. stale, banal, commonplace, unoriginal;

  8. substitute, replacement, stand-in, deputy;

  9. element, character, cell;

  1. core, pivot, stem, bar;

  2. conclusion, deduction, supposition, assumption, suggestion.

C. Reading and Discussion

6. Translate the words. Read the text and answer the questions: 1) What is needed for evaluating the performance of information re- trieval systems? 2) How are documents represented for the efficiency of information retrieval? 3) What are the performance measures ?

ill-posed

cutoff

immanent

fraction

recall

transcendent

fallout

set-theoretic

tuple

dimension

fuzzy

scalar value

Performance Measures

Many different measures for evaluating the performance of infor- mation retrieval systems have been proposed. The measures require a collection of documents and a query. All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice queries may be ill-posed and there may be different shades of relevancy.

Precision

Precision is the fraction of the documents retrieved that are rel- evant to the user's information need.

In binary classification, precision is analogous to positive predic- tive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called preci- sion at n or P@n.

Note that the meaning and usage of «precision» in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.

Recall

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

In binary classification, recall is called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.

Fall-Out

In binary classification, fall-out is closely related to specificity. More precisely: fall-out = 1 - specificity. It can be looked at as the probability that a non-relevant document is retrieved by the query.

It is trivial to achieve fall-out of 0 % by returning zero documents in response to any query.

Average Precision of Precision and Recall

The precision and recall are based on the whole list of documents returned by the system. Average precision emphasizes returning more relevant documents earlier.

For the information retrieval to be efficient, the documents are typically transformed into a suitable representation. There are several representations which can be illustrated by the relationship of some

196

— 197

common models. The models are categorized according to two dimen- sions: the mathematical basis and the properties of the model.

First Dimension: Mathematical Basis

Set-dieoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are: standard Boolean model, extended Boolean model, fuzzy retrieval.

Algebraic models represent documents and queries usually as vectors, matrices or tuples. The similarity of the query vector and document vector is represented as a scalar value.

Probabilistic models treat me process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like me Bayes' meorem are often used in these models.

Second Dimension: Properties of the Model

Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.

Models with immanent term interdependences allow a repre- sentation of interdependences between terms. However me degree of me interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.

Models with transcendent term interdependencies allow a repre- sentation of interdependencies between terms, but they do not allege how me interdependency between two terms is defined. They relay an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)

7. State whether the following statements are true or false. Correct the false ones.

1. Fall-out can be looked at as the probability that a non-relevant document is retrieved by me query and is called sensitivity.

  1. Recall is the fraction of the documents retrieved that are rel- evant to the user's information need and is not enough alone but one needs to measure the number of non-relevant docu- ments also. It is closely related to specificity.

  2. Precision is the fraction of the documents that are relevant to the query that are successfully retrieved, analogous to positive predictive valuetiiatcan also be evaluated at a given cut-off rank.

  3. Set-theoretic models treat me process of document retrieval as a probabilistic inference. Similarities are usually derived from set-theoretic operations on diose sets.

  4. Algebraic models represent documents and queries usually as vectors, matrices or tuples. The similarity of the query vector and document vector is represented as a scalar value.

  5. Probabilistic models represent documents as sets of words or phrases. Similarities are computed as probabilities diat a document is relevant for a given query. Probabilistic theorems like the Bayes' theorem are often used in these models.

  6. Models with transcendent term interdependencies allow a representation of interdependencies between terms.

  7. Models wim immanent term interdependencies treat differ- ent terms/words as independent. However the degree of the interdependency between two terms is defined by the model itself.

  8. Models without term-interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined.

8. Translate the text without a dictionary.