Understanding and modeling document collections was successful long ago, before any computer systems became available: good reference librarians modeled collections of traditional documents in their heads (and notebooks!) and successfully used these models. Their approach is based on abstraction1 and precision - the currently recognized foundation of object information management [Kilov, Ross 94; odp 2]. Indeed, a reference librarian - a taxonomist - abstracts away ifxelevant details, such as layout and presentation, when building and using a model of documents in the library. Different contents details may also be considered irrelevant, depending upon the modeled viewpoint. Furthermore, the reference librarian's model is precise, so that it is possible to answer questions like "do you have information about XvZ and if yes, where can I get it?". In other words, a model of this kind is essential to help the user who is not interested in a fancy document, but rather is interested in solving a business problem.
Unfortunately, all too often such models are not presented explicitly. With a drastic increase in the amount of available information, the absence of explicit models becomes a serious problem: a reference librarian is not
* Copyright o 1994, Bel Communications Research Inc. (Bellcore), Permission to uses copy, modify and distribute this material for any lawful purpose and without fee is hereby grantedj provided that the above copyright notice and this permission notice appear in all copies, and that the name of Bellcore not be used in advertising or pulicity pertaining to this material without the specific, prior written permission of an authorized representative of Bellcore. Bellcore makes no representations or warranty, express or implied, about THE ACCURACv, SUFFICIENCv, OR SUTTABILITv OF THIS MATERIAL FOR ANv PURPOSE. It Is provided "As is", wtihout any EXPRESS or implied warranties. Bellcore expressly disclaims any lability for any damage or injury incurred by any person arising out of the sufficiency, accuracy, or utility of any information contained herein. Any use of this -material is at the sole risk of the user. Page 56 always available, and Information management systems are useful only to the extent they help the user understand and retrieve the Information he needs to solve a business problem.
Different users may have different vlewpolnts on a document collection: they are Interested In different collections of Information fragments, In other words, the contexts within which the users solve their problems are quite different and do not always correspond to the contexts within which the document authors have created their documents, Thus5 the document models created - implicitly or explicitly - by document authors may be quite different from document models needed by document users. This phenomenon Is well-known to, e.g., students who highlight fragments of textbooks needed to pass a test; or to attorneys who highlight those existing document fragments that they need to support their arguments.
Obviously, understanding a document collection should not depend on Its presentation (paper or electronic). Electronic document management may provide additional functionality (such as keyword search) to document users, but should be able to provide at least the same functionality as existing traditional document management. This has not always been the case: existing approaches to understanding electronic documents have too often - explicitly or Implicitly - been based on existing tools. These tools usually underemphaslzed document contents and overemphasized layout and representation. As an example, consider highlighting by paper document users: how many electronic document management systems support user-created document models?
Keyword search looks like a useful content-related electronic document management facility. However, Its suitability for contents specification Is rather limited. Consider, for example, the same document content element represented differently In different natural languages2; or for different authences (legal specialists or laymen)3. In all these cases, the same concept Is represented by means of very different keywords, and there is no guarantee that the document author's keyword will be equal to the document user's one. Other approaches, such as using SGML, succeed in separating document's physical presentation from logical layout and contents, but do not dearly separate contents from layout. In addition, Page 57 they are applicable to restricted classes of documents, mostly hierarchical. Moreover, many database management systems treat documents just as "unstructured data". Document users were less than happy with these approaches and requested a better framework to deal with documents.
Highly qualified document users (attorneys, reference librarians, tax accountants, and so on) successfully use document contents to look for document fragments they need. They use document contents specifications implicitly: in human-to-human communication you can always ask additional questions and you can rely on a possibly implicit context, In computer-based information systems you cannot do that: a system has to be instructed precisely and explicitly. Therefore explicit document contents management is essential for solving electronic document management problems. Fortunately, there is no need to reinvent document management concepts from scratch: document management is no different from managing any large and complex application. Therefore the same information modeling concepts that have been successfully used in specifying other businesses, such as tdecommunications, can be successfully used in specifying the business of document management. This paper will provide a short overview of information modeling in document management, and will show an example - a collection of legal documents (disclaimers) that has been successfully understood, modeled, made mutually consistent, and used.
@Let Us Use Information Modeling
Understanding the semantics of a document collection can be accomplished using concepts and constructs needed to uaderstand the semantics of any other enterprise [Kilov, Ross 94; odp 2]. These concepts and constructs described in information modeling are independent of any particular methodology or tool They may be used to create an abstract and precise model - the specification - both of an application and of documents that describe the application. Such a specification will explicitly describe the semantics of documents (concept maps) and assure interoperability between documents belonging to different collections.
The concepts and approach used in document management are applicable to any collection of documents: be they paper or electronic; "linear" or hypertext ones. The intellectual contents of a collection of documents does not depend upon its layout or the presence of hypertext tools. Therefore we need to clearly separate document management concepts from implementation technology. This will permit us to concentrate on the...