tin_the_fatty weblog

Photography is a road long and lonely.

tin_the_fatty weblog header image 2

You Can’t GREP a Dead Tree

October 1st, 2007 · 1 Comment

You can’t grep a dead tree.

A fellow historian A is involved in a faculty project working on the historic archive of the TWGH, and she mentioned the effort to digitize all the correspondences.

Archival is not a good sole reason to digitize information. The great effort that went into the BBC Domesday Project was nearly lost because of equipment becoming obsolete, and it was with great effort that the project was preserved and made available again. Digital data need to be be maintained (by recreation) regularly. On the other hand, if left in good condition and undisturbed, paper (e.g. the Domesday Book) and stone (e.g. the Rosetta Stone) are hard to beat for archival purposes.

The computer is very powerful for sorting, indexing and organizing data in digital format, provided that the data is stored properly and sensibly. For processing large amount of textual data nothing beats plain old text files. However, for some strange reasons in the TWGH project mentioned above, all the correspondences are stored in Microsoft Word format. The problems with using the Word file format for archival is well understood. But what is done is done. So I suggested A to look into DEVONthink, which is great for storing and indexing large amount of data and understands Word files.

Then I thought about the problem domain a bit more.

DEVONthink is pretty smart in searching textual data in English (and I think German as well). Such smart context-aware searching is highly valuable for historians using any achieves. I think DEVONthink’s smartness relies on some sort of built-in lexicon so the program understands what phrases are important and those phrases of similar meanings. Unfortunately, DEVONthink doesn’t understand Chinese, so it could only do dumb word pattern matching, and is not as useful as it could be for the TWGH project.

Ideally, the “correct” approach to the problem would be for A’s team to store the TWGH archive in plain text files, build a Chinese lexicon from the text, and then construct an index of all the entries in the lexicon on the archive. This is a well-understood problem and could be done by any competent computer scientist. Once the lexicon and the index are built, you get smart searching for the TWGH archive. The lexicon may also be expanded to handle other correspondences archives, and the effort should be worthwhile in the long run.

I have no idea how difficult it would be for A’s team to use the service of a computer scientist/programmer to create the system outlined above, but historians may be perfectly happy with dumb word pattern matching over the whole archive under DEVONthink. While I am platform-neutral myself (best tool for the job!) it might be a problem for the historians in A’s team to adopt a new platform. Then I realized Google Desktop may also do the job. It is like having your own Google. I suggested this to A, while our good friend Vincent overheard us, and he is also in need of some sort of search facility for textual data, so I sent the link to Google Desktop and a few screenshots of Google Desktop in action to both of them.

GD-Search-Screenshot-3

I shall follow up w/ both A and Vincent and see how useful (or not) GD is to their research project.

Finally, on infrastructure of file storage and archive, there is this interesting interview of a couple of ex-Be folks (the BeOS had a fully database-backed file system) somewhat related to Microsoft’s effort in creating (and subsequently dropped) their own database-backed file system in Longhorn (codename for Windows Vista in development). Not to mention that Apple has already got the similar feature in Spotlight.

Tags: History · Tools for Work

1 response so far ↓

  • 1 You Can’t GREP a Dead Tree II // Oct 22, 2007 at 11:09 pm

    [...] up to my previous post, I have since discovered IBM OmniFind Yahoo! Edition, featuring configurable synonyms and featured [...]