Friday, September 29, 2006

Converging classification schemes of documents and records - Part 5 - Tagging and pulling it all together

All this week I have been exploring some of the approaches that enable the classification of documents and records in Electronic Document and Records Management Systems (EDRMS), with the aim of being able to see how the convergence of all of these schemes may lead to a more usable classification mechanism, as we have seen emerge with tagging on the Web. I have covered:

Combined classification schemes

During the discussions, I have continuously reiterated that each of the classification schemes described is not complete on its own. Structured indexes are great for business process driven document managment, like my insurance claims example; full text indexing is great for unstructured, work in progress and documents that are fully described by their content; fileplans, the staple of traditional records management are great for organizing documents into big buckets representing the business activities around them; thesaurus enforced keywords provide taxonomic classification for items related to a specific domain.

All four approaches are in fact used together in EDRMSs, to provide a fairly complete classification of the documents for recordkeeping purposes. The structured index metadata captures a lot of information about the description, authorship, ownership and status of a record. A component of this, a single item of metadata, may capture multiple keywords driven from a controlled thesaurus enabling consistency in domain classification of the records. In traditional recordkeeping environments, the fileplan will provide a further big-bucket classification and may drive high-level security and retention. The fileplan may also drive the records management for physical documents and assets within the same environment. By far, most of the information is added by records managers at the point of declaring documents as official records (the definition of what makes a document a record was in the first post of this series).

This is a lot of metadata and classification that is captured, based on the original business use, future retrieval requirements and the content of the document. If documents have been electronic throughout their life, or at least prior to being records, often there will be collaborative tools and document management systems that have also captured their own metadata along the way. Much of this remains valuable to the business users to retrieve documents to do their jobs, even after documents have been passed for records management. Using this business information without having to duplicate it or the documents is a challenge. Vignette has some great approaches to this business document/metadata and record/metadata integration problem. They exceed the standard mechanism of a separate records management system referencing or copying documents in document repositories or filesystems that risks duplication , damaged data and broken custody issues.

Business users can't be trusted to classify records

Part of the reason for this week-long discussion of records management has been to get to the point of understanding why business users can't be trusted to classify records. Partly, this is because many business users can't even be trusted to store documents in a secure place without some carrot and stick persuasion. The feedback from users has been that even if they store documents, classifying them is a time consuming process that seems complex - being faced with a long form of data to fill in feels like completing your tax return.

Even if business users were inclined to store records, their business is not to understand the details of records management to ensure effective, consistent classification. So is there a halfway house?

Tagging for user friendly classification

Many Web 2.0 sites use tagging to help random Internet users find documents, blog posts, URL bookmarks, photos, videos, music and whatever else may be put out there by contributors. Contributors have a vested interest in tagging their content effectively, since decent tagging is likely to lead to more viewers.

Although the information contributor attaches may not meet the Dublin Core standard element set, the information he provides is likely to provide a fast and concise classification of the content, enabling it to be found by users when a full text Google search may not help.

Tags are just keywords that describe the content, but they typically don't come from a thesaurus, instead being added by the contributor to meet his current requirements. Taking a look at my Technorati tag cloud, you can see how there is a mass of tags that perhaps may have only been used once, and may never be used again. From a classification point of view within the context of this site, these one off tags probably reduce the value of the cloud for users trying to find relevant posts. From an external viewpoint, an anonymous user searching Technorati for interesting blog posts may find the pointedness of this classification valuable. If the user searches explicitly for tags like taxonomy+indexing, he or she gets 2 back blog posts that are identified as being relevant to those tags. Searching blog content with taxonomy+indexing leads to 452 posts, many which may just mention each word once in the course of the text.

The Wordpress blogging tool goes a little further, providing predefined categories for tagging posts. This enables some consistency to the blog categorization, useful for more focused blogs, like my (personal) pet project, Bruncher. Here, restaurants and diners that serve brunch are categorized with a predefined set of tags based on their location (e.g. MA, Boston), style (bar, diner), and most importantly how good a Bloody Mary they serve (0 - No Bloody Mary in sight, up to 4 - The best!). Wordpress allows custom tags to be applied to any post, but within this fixed domain of interest the fixed categories seem to work well and further tags may not provide any value.

Tagging for business users

In the business world, users are unlikely to be working on documents that require user friendly tagging that also has a very narrow domain or fixed set of tags, otherwise this whole discussion would be a non-issue; classification would be fast and easy. Therefore it seems that the freeform tagging technique may be a good start for providing some contextual information about the document they have produced. The accuracy of tagging is not enforceable, since it is completely freeform, but maybe it provides enough information to enhance document search and retrieval beyond pure full text indexing prior to a document becoming a record.

How can this be used to enhance record classification? A records manager could choose to use some of these freeform tags as keyword that they further refine at the point of filing. Alternatively, the records managers could collaborate with business users to provide a very limited set of fixed keywords for typical documents that translate directly to the records management environment, supplemented by freeform tagging to actually represent the context of the document.


Records management is a highly involved discipline, requiring specialist records manager to ensure the consistency of record classification to ensure that records are retrievable and are retained according to corporate policies and legal mandates. At the same time, there is a huge amount of electronic information being produced by the business that should be kept as records. Reusing as much of the business metadata as possible is essential to ensure efficient and scalable records management resources.

Tagging is a form of classification that seems to be acceptable enough to Web users that it could be applicable to non-threatening corporate document classification by business users as well. A combination of predefined categorization and freeform tagging may not only help users searching for documents find them prior to record declaration, but also assist records managers (and maybe automated systems) in the formal classification.

This is the end of this series of records management classification posts. I hope that some of the reference information will be useful to people over time and maybe some of the newer ideas will trigger some discussion. I have many thoughts for corporate document tagging, none of which I know of have been proven in practice in a corporate setting. I'd love to hear of any examples.

No comments: