For quite a long time I’ve been thinking of having some directories, like home, documents or the kernel sources indexed for easy searching of strings. To search, for me, is the best way to find files, I don’t ever remember what I had called them nor in which directory were finally saved.
So to practice what I’ve been reading in the python book, I’ve decided to start writing a file indexer, in this post I’ll comment my first attempt, really simple and based in learning python and trying indexation
What I’ve done so far is an indexer in memory for just a directory (and not it’s subdirectories), what the indexer does is, in first place, parse the documents and, word by word, add them to a dictionary ( hash table ) where the key is the word and the value is a list, this list contains the name of all the files where the word has been found.
So, once the program has indexed into memory the whole contents of the directory searching is easy, just look for the word in the dictionary, and print the list of files where the word is.
This was a really simple example of what can be done with just a few python lines. However this program is totally useles, the indexation takes really long and has to be done each time you start the program as it is stored in RAM.
Following is the list of objectives I want to achieve in next versions, work on each objective can be done in paralel:
Parser:
Before indexing the files they need to be parsed, this means extracting all the words that are going to be indexed. To do so, I want to split the parsing in two parts. First, the kind of file needs to be determined, as it is not the same to parse a document in PDF format or a plain text, from this a METADATA will be generated, this metadata will include the name of the file, the path to it, last modification time, the file type and specific filetype metadata, which for example is MP3 song and artists name.
Once all metadata has been extracted the parsing can be done, the parsing will depend on the filetype, but basically means extracting word by word of the document and passing it to the indexer with aditional info ( like position in the original text, following words, …). The additional info is used to speed up the searches. The metadata is also indexed, so you can search your MP3 for artist, genre, year or whatever you want.
Indexer:
Now the indexer puts the words into memory using built-in python structures. The next step is designing the whole indexer with databases in mind, I’ve thought of using a Mysql as the backend, constructing the dictionaries and the lists as databases and searching through them using SQL commands. Python would mantain small size caches for faster searchs. Using databases would make it possible to just update changed files and always having an up to date database for doing the searches.
Code:
First attempt to python