Thursday, September 28, 2006

Converging classification schemes of documents and records - Part 4 - Keywords, thesaurus and tagging

For last few days I have been posting about some of the approaches that enable the classification of documents and records in Electronic Document and Records Management Systems (EDRMS), with the aim of being able to see how the convergence of all of these schemes may lead to a more usable classification mechanism, as we have seen emerge with tagging on the Web. So far I have covered:

In this post I want to look at the thesaurus as an approach to help users in accurately applying metadata, since I believe that this is getting very close to the Web tagging paradigm.

Thesuarus / Controlled Vocabulary

In records management a thesaurus or controlled vocabulary is used to assist records managers to apply metadata to records that is more consistent and falls within a recognized taxonomy.

In yesterday's post I talked about taxonomies as being a way of classifying documents according to a predefined scheme. This has the advantage of guiding users to pick from available and recognized items when identifying their documents. A fileplan is a specialized form of taxonomy that provides a representation of the business or filing structure that documents relate to, and also provides some extra notation to assist in efficient filing and retrieval (the filecode).

A thesaurus is a specialized way of representing a taxonomy. It is used to add identification metadata to documents according along a specific classification dimension - it is limited to specific a domain or topic, not intending to fully define the record.

A language thesaurus that organizes the English language vocabulary and defines relationships between 'literary' words within it. A records management thesaurus focuses on a specific domain or type of activity (rather than the whole language), laying out a set of acceptable words or keywords that make up the vocabulary, defining the relationships between them. A typical way of doing this is to provide a tree of words, starting with the most general or broader terms within the topic and working towards the most tightly defined or narrower terms. The aim of the thesaurus is to ensure consistency of use of the keywords, so additional descriptions and scope notes are provided to help elaborate and reduce the chance of different people interpreting words differently. Within this hierarchy, there can also be relationships that cut across branches to show related terms.

The Keywords AAA thesaurus is a well recognized example from Australia, used in New South Wales government record keeping. The NAICS is a scheme that defines standard industries for commecial or employment classification, which many people will be familiar with when classifying themselves. Both of these schemes provide identifications for things within their specific domain and therefore do not usually fully describe the thing they are attached to.

In an EDRMS a thesaurus is a tool to help users pick the correct keywords to apply to an item of metadata for a record. Dependent on the definition of the metadata attribute, one or multiple keywords may be selected, to fully identify the meaning. Thesaurus keywords may be used alongside any other metadata to classify records, so metadata from multiple thesauri may represent the classification of a single record in the multiple domains of its use. Alternatively, a thesaurus can be used alongside more straightforward index metadata. The thesaurus really just provides a tool to help guide records managers to provide the most consistent and exact information to classify a record for a single item of metadata.


When used in records classification, a thesaurus is a tool that enables a specific item of metadata representing to be set with the most tightly defined term or set of terms available within the set. It can provide a tight definition of the records within the constraints of the recognized terms, a specific domain's taxonomy.

Typically a thesaurus is not used alone and much like a fileplan is just another way of more accurately identifying records for storage and effective retrieval. It is a tool to help users pick the correct keywords to apply to an item of metadata for a record. This enables the document to be 'tagged' with keywords from a well defined vocabulary. This is similar to the category tags used in Wordpress and other blogs, so I feel I'm getting close to my original aim. In the next post I will round up all of the classification schemes I have described, and try and relate them to the use of tagging on the Web.

No comments: