Where are my six honest serving men?

I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.

– Rudyard Kipling

Kipling’s six honest serving men – sometimes known as the five Ws (and an H), is a well-known maxim encompassing the basic questions to ask when gathering information about something. Whether you’re a journalist writing an article, or a police officer investigating a crime, answering these common questions will get you a long way towards a useful answer.

So if these questions are used for finding information, why don’t they form the basis of querying a computer? In particular, why can’t I easily search for files by answering these questions to reduce the number of possible matches. At the moment computers deal with some of these questions, but certainly not all of them…

  • Who created the file? Most computer file systems keep track of that. Who last modified it? In fact, who has ever modified it? For that you need a versioning file system. They do exist, but they’re not something you’ll commonly find on your desktop just yet.
  • What is the file? An audio file? Document? Web page? This information is usually stored as part of the filename, using a short extension – often limited to three characters for historical reasons. Encoding this information in a file extension isn’t great, and has been the source of several exploits on Windows systems. But at least the information is present, and generally easy to search for or sort by.
  • Why did I create/download/copy this file? Unless you add your own metadata by hand, this data is bound to be missing. Making it easy to “tag” files might go some way to addressing this question without creating a huge burden on the user.
  • When did I create/modify/read this file? Like “who”, some of this information is stored, but much of it is not – and it’s often the stuff that’s not which is most useful. I’m more likely to remember that I want the file I was looking at on Thursday, rather than remembering that it was the one I created last July.
  • Where was I when I created it? With the cost of GPS devices dropping, I’d love to start seeing laptops with integrated geolocation, and filesystems which store that data. “I remember reading it at home”, “I was working on it at the Oxford office”.
  • How was it created? Macs used to (and probably still do) keep track of the “creator” application in their filesystems – but that was more about being able to reopen the file in the application, than allowing the user to search by that information.

Most file systems – and the tools that we use to access them – still only expose the same metadata that was present in the 1970s. Each file has a name (plus extension), and a little information about who created it and when. You often get shown the size of the file by default as well – but in a next-to-useless format that mixes units (100KB can easily look bigger than 1GB if you’re not an expert). And most of the time file sizes aren’t a concern these days anyway: does it really matter if a file is 100KB or 200KB when your nice new machine has a 500GB hard drive in it?

Meanwhile the information we really want – those five Ws and an H – are lost. Sure you could stuff some of that information into metadata fields on some file systems, but if it doesn’t happen by default – or at least very easily – it’s not likely to happen at all. And with no real consensus over the format of that metadata, it’s much harder to search or index.

The creators of ID3 tags for MP3 and other audio files have done a good job of standardising on a few bits of data that are relevant to music files. I can easily search a large corpus of music files by track name, artist name, composer, genre, year, album and so on. Without a similar standard for general file metadata (plus support in archiving, management and searching tools), I’ll never be able to efficiently search for

  • That file I created last week and looked at a couple of days ago
  • The presentation that I was working on when I was staying at the Travelodge in Nuneaton
  • The letter I wrote in OpenOffice, which I remember tagging as “Personal”, while I was working late at the office a couple of weeks ago

Our current collection of file managers don’t even do a great job with what little metadata they do have. F-Spot, the photo manager in Ubuntu, lets me browse photos using a simple timeline based on their creation dates, while Nautilus (the file manager) offers no such convenience. Meanwhile Nautilus has tags in the form of “emblems” – but no means to search or filter by them, or even to assign them in the standard Ubuntu “Save” dialogue.

I know all about Spotlight, Beagle, Google Desktop and other similar applications, but these are all just after-the-fact additions to paper over the cracks in computer file systems. Adding some new, standardised attributes to files would actually make these applications work even more efficiently, as it would make it easier to restrict the list of files that they even need to consider in their searches.

This is the 21st century: GPS is fast becoming a ubiquitous technology; drive sizes are no longer a limiting factor; we’re storing ever larger numbers of files. Why are we still stuck with a few little bits of technical data about each file instead of a lot of bits of human-centred data?

No Comments

1 Trackbacks

Leave a comment