Continuing with the indexing file system from last post, I’ve implemented parsing support for both MP3 metadata and PDF contents, it works in the same way. When you put a PDF or an MP3 file in the indexed directory, a parser gets the contents of the file and indexes them.
For the MP3 it indexes as words all info contained in the TAG and for PDF it converts them to text ( using pstotext ) and indexes every word. As I’ve commented in last post I’m not seeking performance, I know that there are better ways of indexing a PDF, but I just want to try how an indexed filesystem would look like
The next thing that needs to be done is indexing by document type, so being able to search by artist, song, pdf author and so on. There is also need to put more information on found words, like line on text where the word was found, following words, and so on.
Mini Bench:
The time for indexing a 1.3 Mb PDF file in my system is 55 seconds. After that searches for a word under MySql are reported in 0 seconds. This file contained 77241 words that where 7161 unique words.
From this little test it is obvious that the PDF parser is too slow for real use, but for testing is ok.
Code:
Code for the Indexer with MP3 metadata and PDF support
References:
Magic Python: Determines file type
Pstotext homepage