Archive for September, 2004

Mysql-Python: Personal File Indexer

Wednesday, September 15th, 2004

Working with a Mysql Database under Python was easier than I thought, you just import the ComaptMysqldb module, connect to the desired database, and you are ready to start executing mysql commands, fetching and processing the results without much trouble.

To learn a little bit of SQL (I’m still not really comfortable with SQL syntax) I changed the Python structures I used in the last indexer to mysql tables. The goal was to mimic the last indexer but not performance. So what I did was replace the word dictionary with a SQL table called words, where the primary key is the word and the next field is a unique id for each word, this id used to be a list of files in the previous indexer. Now insted of a list for each word there is just one table called occurs, this table has an entry for every word, it associates the filename with the word id. So looking for a word is as easy as getting the word id and finding all entries that match the id in the occur table, the result is all files that have the word.

This indexer is a little different from the last one. It saves all the words, not just diferent words from diferent files. For example in the previous indexer, if the same word was found many times on the same file only one entry was put in the file list, now there is one entry per occurence. I did this change on purpouse as information on each word needs to be saved.

Next step is to design a good database layout to be able to perform all operations, like finding multi-word strings or being able to search files by name.

The code:
Mysql File Indexer

First attempt to python: Personal File Indexer

Tuesday, September 14th, 2004

For quite a long time I’ve been thinking of having some directories, like home, documents or the kernel sources indexed for easy searching of strings. To search, for me, is the best way to find files, I don’t ever remember what I had called them nor in which directory were finally saved.

So to practice what I’ve been reading in the python book, I’ve decided to start writing a file indexer, in this post I’ll comment my first attempt, really simple and based in learning python and trying indexation

What I’ve done so far is an indexer in memory for just a directory (and not it’s subdirectories), what the indexer does is, in first place, parse the documents and, word by word, add them to a dictionary ( hash table ) where the key is the word and the value is a list, this list contains the name of all the files where the word has been found.

So, once the program has indexed into memory the whole contents of the directory searching is easy, just look for the word in the dictionary, and print the list of files where the word is.

This was a really simple example of what can be done with just a few python lines. However this program is totally useles, the indexation takes really long and has to be done each time you start the program as it is stored in RAM.

Following is the list of objectives I want to achieve in next versions, work on each objective can be done in paralel:

Parser:
Before indexing the files they need to be parsed, this means extracting all the words that are going to be indexed. To do so, I want to split the parsing in two parts. First, the kind of file needs to be determined, as it is not the same to parse a document in PDF format or a plain text, from this a METADATA will be generated, this metadata will include the name of the file, the path to it, last modification time, the file type and specific filetype metadata, which for example is MP3 song and artists name.

Once all metadata has been extracted the parsing can be done, the parsing will depend on the filetype, but basically means extracting word by word of the document and passing it to the indexer with aditional info ( like position in the original text, following words, …). The additional info is used to speed up the searches. The metadata is also indexed, so you can search your MP3 for artist, genre, year or whatever you want.

Indexer:
Now the indexer puts the words into memory using built-in python structures. The next step is designing the whole indexer with databases in mind, I’ve thought of using a Mysql as the backend, constructing the dictionaries and the lists as databases and searching through them using SQL commands. Python would mantain small size caches for faster searchs. Using databases would make it possible to just update changed files and always having an up to date database for doing the searches.

Code:
First attempt to python

Celebrating the 11th of September

Saturday, September 11th, 2004

Though it may seem strange for non Catalan people, today we celebrate the 11th of September, it’s our National Day, now fadded by the NY 9/11 and many other things that happened this same date…

FREEDOM FOR CATALONIA!!!

Python

Saturday, September 11th, 2004

Two days ago I read at slashdot a review of a Python book, Dive into python . I found the review to be so good that I wanted to learn Python. Dive into Python was released under the GNU Free Documentation License which makes it even more attractive.

By now I’ve read up to the 5th chapter, and I have really enjoyed it, but have not tried much practical programming, I guess next posts will be about Python and its ease of use.

PowerMac G4 12″ Backlight Module

Wednesday, September 8th, 2004

After thinking a little bit about how ugly was to directly patch the offb with NVidia specific code I decided to convert the patch I had found to a kernel module.

It has not been really hard to do it, It was just a copy/paste of the patch code into the LKM skeleton. To use it, just compile the module, load it and restart pbbutons for buttons support.

PowerMac G4 12″ Backlight Module for kernel 2.6