Tuesday, September 26, 2006

Converging classification schemes of documents and records - Part 2

In my post yesterday I introduced my aim to compare the formal classification schemes offered by traditional Document and Records Management with the more freeform user tagging technique used by a range of Internet apps, like del.icio.us, Technorati, Flickr and others. I'm doing this because I'm keen to work out if tagging can meet the requirements of corporate Records Managers, while being more acceptable to users than complex and clumsy formal classification schemes.

Yesterday I introduced the structured classification approach of indexing / titling documents, laying out why this still has its place in business process oriented document management, despite being perceived as clumsy. In this post I'm going to talk about Full Text Indexing, and its use in corporate records management as the 'anti-title', requiring no effort from the user.

Full Text Indexing - "anti-titling"

As businesses have shifted to more electronic content, the ability for search mechanisms to read the content of a greater number of documents and index them for full-text search has increased. Many people would look at this and suggest that the old structured indexing of documents that was necessary when using document imaging (scanning) is outdated and unnecessary. After all, a user could now enter a search term that would return all documents that have that text in their content, much as they are familiar doing with Google. Why duplicate what is accessible in the content of a document with attributes in a database?

Full text indexing of electronic documents (and scanned and ICR documents for that matter) works well where there is a mass of documents that:

  • Do not fall into a formal business indexing structure
  • Have content that is self-describing
  • Are held in context through links they have to one another.

The first of two of these are the justification for email archiving, to keep the mass of emails that goes through employee's inboxes searchable. The emails are all about their content, so full text search is a reasonable approach to finding them. I use Google Desktop all the time for my own emails.

The final type is the current state of corporate websites and intranets, where pages link to one another through hyperlinks, enabling a user to understand the context of a page from where it links to or is linked from. This isn't the stuff around the Steve Gillmor argument that hyperlinks are dead. This is more about the way that an organization structures the relationships between documents (as pages) that inherently provide some classification above and beyond the pure content.

Full Text Indexing and repeatable business processes

Using full-text indexing alone does not work particularly well for structured business files like the insurance claim described in the last post. In the example, all of the documents in the claim must be kept together and have some basic title information, to ensure a full contextual view of the transaction taking place, making a claims manager's work efficient and accurate.

The vague algorithms that control full text search may not allow control over the presentation and completeness of the claim file to the extent that is possible with structured indexing. This leads to the type of filing that equates to having every document and stuffing it straight into a big box - usually the claims manager will find what she is looking for after rifling through the paperwork and sometimes a document creeps in that was in fact not related to the box of paper.

Where full text indexing works well is for unstructured business documents, where being roughly classified into the big boxes is not really a disadvantage. In my experience, these documents are not part of a repeatable or structured business process, instead being 'work in progress' documents authored across the organization. Collaborative tools enable users to file documents with minimal identifying attributes in workspaces that may have little or no structure. The tools enable documents to be found through a combination of full text search and the user knowing the context of the collaboration workspace they are looking for. More rapid access to documents is enabled by passing URLs in emails or IMs and creating browser bookmarks.

Federated and enterprise search

Most documents in corporations can be found through full text indexing / search, although typically within confined applications. Full federated or enterprise search provides mechanisms for enabling a user to search across all of the organization's content stores from a single point. Google's search appliance is one approach, Autonomy provides enterprise search with a range of connectors to third party repositories, and Vignette enhances this with further enterprise search capabilities within the context of the corporate intranet portal.

Full enterprise search has not been implemented in many organizations, and its effectiveness has yet to be proven. It is certainly valuable to support legal discovery requirements, where a broad set of documents across the enterprise have to be found quickly, independent of the applications they were created in or the repositories where they are sitting. My feeling on enterprise search is that it may lead to the lowest common denominator of search capabilities being presented to the user (pure full text search), since attributes within specific systems may not be easily extracted and used. But given the range of disparate data that must be searched that may be the only approach.


Full text indexing / search is essential to organizations requiring certain types of documents to be made searchable. Documents that are not part of repeatable, structured business processes, especially those in email and collaboration tools are ideal candidates for full text indexing. This is because users typically want to spend little time thinking how to identify their documents going forward.

As organizations attempt to apply collaborative tools to Knowledge Management (KM) they may find that completely freeform storage and search is not enough to enable them to find their documents within the mass of information that is floating about. Much like Records Management, valuable documents require a little more classification to ensure that they can be found.

In the following post I'm going to look at a high level classification scheme that is recognizable to all of us - the hierarchical folder structure or directory tree, and its formal equivalent, the file plan. This provides at least some structure to otherwise unclassified documents.

No comments: