Andrew Mustun - Private Homepage

PDF Indexer

2004-02-20

Homepage

Home

Links

Hints

Personal Projects

Technical

QCad

dxflib

vec2web

CAM Expert

ManStyle

QTimeSeries

PLog

(PDF-)Indexer

quaneko

Fun

Trophy

Old

Chaos / Fractals

Stereograms

University Projects

Business Economics and Project Management

Summary (PDF)

Databases

Summary (PDF)

Digital Technics

Summary (PDF)

Serial Communication

Micro Processor

Electrotechnics

Formulary (PDF)

Step Down Converter (PDF)

Jurisdiction

Summary (PDF)

Language and Communication

Summary (PDF)

Math

Formulary (PDF)

Physics

Formulary (PDF)

Programming

Summary (PDF)

Java Stuff

Indexer is a simple tool to create an index over a bunch of PDF files. An index is a text file that contains key words from the PDF documents and for each keyword a list of files and page numbers indicating where these words can be found.

Preparation

Before you start, please put all the PDF files you want to index in one directory and rename them in a way that each name is relatively short and unique. E.g.
security.pdf
infra.pdf
basics.pdf
...

Converting PDF to Text

Indexer reads plain text files not PDF. Before you can create the index you have to convert the PDF files into plain text. To do that, please download the xpdf package from here:
http://www.foolabs.com/xpdf/download.html
There's a Linux package that contains precompiled binaries. One of these binaries is called 'pdftotext'. Copy this binary to your directory with the PDF files and convert all PDF files to plain text. If you're using a bash, this can be achieved with the following command (note that ./pdftotext *.pdf won't work!):
for file in *.pdf; do ./pdftotext $file; done
Now you should have the plain text files:
security.txt
infra.txt
basics.txt
...

Creating the Index

Download the Indexer source archive from here:
archives/indexer-0.1.0-src.tar.gz
The archive contains the source file indexer.cpp, a Makefile and a file with an ignore list. Copy those files to the same directory where the plain text files are.

Compile Indexer:
make
This should create the file 'indexer' which is the tool that will create the index from the plain text files.

Now you can create the index with the following command:
./indexer *.txt 1>myindex 2>log
This creates the two files 'myindex' and 'log'. 'myindex' is the index of your PDF files. A typical line might look like this:
keyword<tab>file1:4:5, file2:6
The tab between the keyword and the references is quiet useful if you load the index file in Open Office and convert it to a table. Behind the tab, all files in which the keyword appears are listed, followed by the numbers of all pages on which the word was found.

Coping with Size

Indexer creates usually a very large index. To prevent commonly used words from appearing in the index, Indexer reads the file 'ignorelist' and ignores all words that are listed in that file. A default 'ignorelist' file comes with Indexer but you might want to adjust it to your individual needs or to the language used in the PDFs.