On the Role of Words and Phrases in Automatic Text Analysis

AutoreGerard Salton/Anita Wong
CaricaReceived a Ph. D, in Applied Mathematics at Harvard in 1958
Pagine291-321

Gerard Salton received a Ph. D, in Applied Mathematics at Harvard in 1958, He remained on the applied mathematics faculty ai Harpard until 1963, when he joined the Computer Science Department at Cornell, where he is now Professor of Computer Science, Dr. Salton has worked on a number of non-numeric computer applications, including, in particular, automatic information retrieval. From 1965-68, he was editor-in-chief of ACM Communications, And from 1969-72, editor-in-chief of the ACM Journal, At the present time he is chairman of the Computer Science Department at Cornell. Dr. Salton is the author of several books, the most recent one being Dynamic Information and Library Processing.

    This study was supported in part by the National Science Foundation under grant GJ 43505. Editor's Note: This article Is reprinted from ´Computers and the Humanitiesª, Vol. 10, 1976.


Page 291

@1. Automatic Content Analysis

In automatic information retrieval, the first and possibly most' crucial operation consists in assigning to the stored documents and to incoming user queries appropriate identifiers capable of representing information content. The difficulty of the task is illustrated by the fact that in practically all existing operational retrieval systems this indexing operation is carried out manually by trained indexers or subject experts, rather than automatically. Fully automatic indexing methods in'which a computer is used to generate and assign these content identifiers are restricted to special, mostly laboratory-type environments [1], [2],

Quite a few observers claim, in fact, that an indexing operation (that is, a reduction of written texts to individual ´ units of expression ª for the representation of content) in inherently impossible, irrespective of the manner in which it is carried out, and" that retrieval, or text processing systems based on such content identifiers can never operate satisfactorily:

It is a logical category mistake to assume that a word, or a phrase, contains information in the same sense in which a statement does. In spite of prima facie appeal, the information content of a statement is not the sum, or combination, of the information content of its constituent phrases... [3].

Page 292

This type of argument leads to the conclusion that index terms and phrases are not substitutes for more complete content identifications, and that term sets cannot therefore function as miniature or condensed documents. The latter assertion is reinforced by showing that the juxtaposition of terms in the language is in no way comparable to the intersection, or union, of the document sets identified by the corresponding terms, Thus, the set of documents related to ´ fish food ª is not identical with the set identified by both ´ fish ª and ´ food ª, and neither is a ´ Newfoundland dog ª something characterizable by the term pair ´ Newfoundland ª and ´ dog ª [3].

Instead of reducing the representation of written texts to a set of simple terms only, a full theory of language understanding appears to be needed which would account for the complete stated and implied content of the texts. Such a theory of language understanding should be capable of identifying not only an appropriate set of content indicators, but also two main types of relations between indicators:

  1. the logical-semantic relations between text units which are dependent on the world knowledge and on the social context within which a given area of discourse is placed;

  2. the linguistic-semantic relations which are dependent on the linguistic context and are derivable from a knowledge of the ´ deep ª structure of the texts [4].

    For determining the logical-semantic relations, an encyclopedia or semantic net is often suggested to identify the scope and extent of a given subject area, and the known relationships between the concepts included in the field. The linguistic-semantic relations, on the other hand, are obtainable by using a combined syntactic-semantic analysis to generate a detailed structure of the written texts. When the world (encyclopedic) knowledge is combined with the linguistic analysis, a text should then be representable as a series of ´inference chains ª representing the line of thought expressed in the text, including both the stated as well as the unstated assumptions and conditions [5].

    Unfortunately, a large number of unresolved problems interfere at present with the utilization of language understanding systems. There is uncertainty about the primitive entities needed to represent information content, and about the relations which may exist between individual concepts, Further more, it is unclear how best to represent the dictionary encyclopedia, semantic map, etc., which may be used to specify the necessary world knowledge. Finally, no viable prescriptions exist for resolving many ambiguities, discarding useless context, and generally reaching the proper conclusions in the construction of the inference chains.

    Moreover, even if a satisfactory theory of language understanding could be implemented, it would constitute only a beginning, and 'not the endPage 293 for information retrieval purposes. Additional problems would then arise in the determination of appropriate search strategies through the respective inference chains, and in deciding on retrieval, or nonretrieval, of the various documents based on available user relevance specifications.

    In practice, it is then necessary for the foreseeable future to fall back on less sophisticated methodologies. While a language understanding theory may not be available, a good deal is known about the structure of written texts: certain words are used frequently, others are rate; some words are quite short, others are much longer; some words occur in every document in a given collection, others are concentrated in only a few documents. Based on such objective criteria, a theory of term value can be developed and incorporated into an appropriate indexing strategy.

    @2. Term Value and Indexing Theories

    A great many studies have been made in the recent past of the occurrence characteristics of words in running texts, and of the manner in which single terms can be combined with each other to form larger text units such as phrases. It is not possible in the present context to review in detail all the known facts in this connection. It may suffice to mention only the most important properties first for single terms and then for term combinations.

    Single terms may first be characterized by their overall frequency of occurrence in written texts. The following properties may be cited in this connection [6]:

  3. High-frequency words that occur repeatedly in written texts are, shorter on the average (they include fewer phonemes) than low-frequency words; they often exhibit a large number of different meanings; and they enter as components into many different permutations (phrases),

  4. Low-frequency words, on the other hand, are often longer in terms of phonemes; they exhibit fewer distinct meanings, and they are used for a smaller number of different phrases.

    Few conclusions can be drawn about the value of a term in rettievtl based on raw word frequency statistics alone. Experience indicates, however, that very high-frequency words are not optimal as content indicators because such words cannot be used to discriminate the texts from- each other. Very low-frequency terms tend 'to be more specific, but since they occur so rarely, they lead to few matches - between incoming queries and' stored documents and their effectiveness for retrieval is therefore also questionable.

    To learn more about the usefulness of single terms, it is necessary to look at the frequency distribution of the terms across the documents ofPage 294 a collection, and at the total document frequency of a term (the number of distinct documents in which the term occurs, or to which it is assigned). Once again, extreme distributions are not optimal:

  5. Terms whose frequency distribution is quite flat, in that they occur with approximately equal frequency in many documents of a collection, are not very effective as content indicators because they cannot distinguish the documents in which they occur from each other.

  6. Terms whose frequency distribution is very concentrated and peaked- they occur with high-frequency in one or two documents only and are absent from the rest-are also not -optimal because their inclusion as part of a user query affects very few items.

    The best evidence obtainable from a frequency, characterization of word occurrences in texts indicates that, the most useful content indicators are terms whose overall occurrence frequencies are neither too large nor too small, and whose frequency distributions across the documents of a collection' are skewed: they occur with high frequencies in some documents and with much lower frequencies in others.

    One additional complicating factor must be mentioned in connection with the analysis of term' importance based on occurrence frequencies and frequency distribution characteristics. Obviously, among all the terms possessing a given frequency profile, the most valuable will be those that are concentrated in documents specified by the users as relevant to their respective queries, rather than those prevalent mostly among the non-relevant items. This leads, to the concept of term precision as the balance between the relevant and the nonrelevant documents in which a term occurs [7], Ideally, the term precision should be used as a term value indication in addition to the frequency characteristics mentioned earlier. In practice, the term precision may not be easy to utilize in the absence of a sufficient number of reliable relevance assessments of documents with respect to user queries1.

    While formal, frequency properties may be adequate to characterize the single terms, additional criteria are normally introduced for larger text units, such as phrases and term classes...

Per continuare a leggere

RICHIEDI UNA PROVA

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT