Improving It: Converging classification schemes of documents and records

Over the last couple of days I have started my comparison of document classification schemes, with the eventual aim of understanding how the freeform user tagging offered by Internet apps, like del.icio.us, Technorati, Flickr and others can be applied to corporate documents and records management.

So far, I've addressed structured metadata indexing / titling and at the opposite end of the spectrum full-text indexing with zero user classification requirements.

Metadata indexing typically requires manual interaction to relate the content of the document with metadata attributes. It provides the most accurate and repeatable approach to identifying documents for a very limited scope of use, typically a business unit and maybe even a specialized business process.

Full-text indexing on the other hand requires no effort to identify the documents, allowing users to search on any words that may appear inside the content of the document. Since this can target a large range of use of the documents, full-text indexing may not provide the accuracy required in repeatable business processes (the example I've been using is insurance claims), but can find information that may have been difficult to find through formal classification. Even for Knowledge Management (KM) use cases, full-text search alone may be too broad to separate documents based on the required context. In this case, search needs to be limited to a broad bucket of documents that are known to be the required context.

To satisfy the 'big bucket' requirement there is a high level classification scheme that is recognizable to all of us - the hierarchical folder structure or directory tree, and its formal equivalent, the file plan. This provides at least some structure to otherwise unclassified documents, and when supplemented with full-text search can provide contextual classification that is granular enough to provide meaningful search results.

Tags: full text indexing titling

User defined filing hierarchies

Filing hierarchies even in their simple PC folder structure form are powerful ways of keeping large amounts of data and documents in context with little other classification apart from a filename. Most novice computer users learn rapidly that the Windows desktop alone is no better for storing documents than a pile of paper in a desk drawer, and will start to experiment with different subfolders that may split their documents up.

Collaborative tools that focus on the management and sharing of documents often take this format as well, in the form of workspaces for teams, projects, events, etc. By segregating documents into appropriate workspaces some context has been applied that allows a user searching for a document to know what it is related to. Further sub-foldering offers additional context to documents allowing browse or search.

This approach has become so pervasive to users that organizations buying a document management system (DMS) for collaboration will often insist on the familiar Windows Explorer lookalike UI for navigation of the DMS. WebDAV is a technology to enable this, and MS Windows has finally started to support it effectively, allowing Windows and Mac users to work in their most familiar 'explorer-type' navigation environment while looking at the same DMS, browsing its hierarchy of workspaces and folders.

Taxonomy, Fileplan

A Fileplan is really a form of Taxonomy - a mechanism for classifying things. Typically the classification scheme is hierarchical, although it doesn't have to be. Certainly within an Electronic Document and Records Management System (EDRMS) a hierarchical fileplan is almost always the case.

So what makes a taxonomy for EDRMS different from a folder structure on a network filesystem, accessed through Windows Explorer? Really it is that a taxonomy is a pre-defined, formal hierarchy, rather than a user defined hierarchy. A taxonomy has been predefined ahead of time as a rigid classification structure, providing a well understood hierarchy for managing documents and records. A familiar type of taxonomy is the Web directory managed by the DMOZ Open Directory Project, and used by Yahoo, Google and others.

The traditional fileplan used by records management systems is really a taxonomy for records in an organization, typically representing the structure of the business or activities that creates the records. It is designed to ensure:

Consistent filing
Efficient retrieval
Effective management of retention
Defined approach to documenting the mechanism

The fileplan may be navigated through its named folders / containers or using a filecode, effectively a shorthand code for addressing each folder and representing its position in the hierarchy. As demonstrated by this EPA example, this is the stuff of traditional filing structures dealing with massive archives of paper and other items.

In an EDRMS the fileplan may represent folders for holding documents, and as such becomes a high level point for defining access and usage permissions, business specific metadata and records retention policies. This is another reason why a fileplan is a formally defined structure, since the enforceable retention and eventual disposition of vast numbers of records is at stake.

Tags: fileplan taxonomy

Using a fileplan

Fileplans provide big buckets for classifying documents, in a well understood, consistent manner. This makes them ideal for records management, since they define a structure for documents to be filed so that they can be found again long after the original creator or owner has left the building. In an EDRMS, coupling a fileplan with full text indexing provides a powerful mechanism for identifying and finding documents in context.

Fileplans though are typically just a component of a broad set of record metadata used to describe each document. In an EDRMS the fileplan container for a record is typically just another metadata attribute in a record index, that highly structured set of metadata describing documents, which users actively avoid using. And due to the consistency and accuracy required by records managers in the filing of documents for long term retention, regular business users are not often allowed the ability to directly classify documents.

Much of the metadata that is found in records management systems comes from government requirements that have built up formal records management systems over many years (long before EDRMS). This has resulted in standards like Dublin Core, and is familiar in the US as components of DoD 5015.2.

Summary

A fileplan is different from regular storage folder hierarchies because it is a formally defined structure. It still provides buckets for throwing in large numbers of documents, so when used in conjunction with full-text indexing the context and consistency it offers can be extremely powerful. Despite this, traditional EDRMSs require significant amounts of metadata to be applied to records in addition to the fileplan, to ensure their complete identification. This metadata is something that users do not often complete effectively, since it is seen as being cumbersome and inflexible.

The pre-defined nature of a fileplan or other taxonomy can be useful to users that must file documents, since it takes much of the guesswork out of classification. Inexperienced users do not need to generate their own filing systems through trial and error. Based on this experience, other taxonomic approaches are used in records management, such as the controlled vocabulary or thesaurus. These approaches, in combination with Dublin Core type metadata are starting to converge on the freeform user tagging and blog categories we see on the Internet. The next post will start to address this convergence.

Improving It

Wednesday, September 27, 2006

Converging classification schemes of documents and records - Part 3

User defined filing hierarchies

Taxonomy, Fileplan

Using a fileplan

Summary

1 comment: