|
Indexer is a simple tool to create an index over a bunch
of PDF files. An index is a text file that contains key words from
the PDF documents and for each keyword a list of files and page
numbers indicating where these words can be found.
Preparation
Before you start, please put all the PDF files you want to index in
one directory and rename them in a way that each name is relatively
short and unique. E.g.
security.pdf
infra.pdf
basics.pdf
...
|
Converting PDF to Text
Indexer reads plain text files not PDF. Before you can create
the index you have to convert the PDF files into plain text.
To do that, please download the xpdf package from here:
http://www.foolabs.com/xpdf/download.html
There's a Linux package that contains precompiled binaries. One of
these binaries is called 'pdftotext'. Copy this binary to your directory
with the PDF files and convert all PDF files to plain text. If you're
using a bash, this can be achieved with the following command
(note that ./pdftotext *.pdf won't work!):
for file in *.pdf; do ./pdftotext $file; done
|
Now you should have the plain text files:
security.txt
infra.txt
basics.txt
...
|
Creating the Index
Download the Indexer source archive from here:
archives/indexer-0.1.0-src.tar.gz
The archive contains the source file indexer.cpp, a Makefile and
a file with an ignore list.
Copy those files to the same directory where the plain text files are.
Compile Indexer:
This should create the file 'indexer' which is the tool that will create
the index from the plain text files.
Now you can create the index with the following command:
./indexer *.txt 1>myindex 2>log
|
This creates the two files 'myindex' and 'log'. 'myindex' is the index
of your PDF files. A typical line might look like this:
keyword<tab>file1:4:5, file2:6
|
The tab between the keyword and the references is quiet useful if
you load the index file in Open Office and convert it to a table.
Behind the tab, all files in which the keyword appears are listed,
followed by the numbers of all pages on which the word was found.
Coping with Size
Indexer creates usually a very large index. To prevent commonly used
words from appearing in the index, Indexer reads the file 'ignorelist'
and ignores all words that are listed in that file. A default 'ignorelist'
file comes with Indexer but you might want to adjust it to your individual
needs or to the language used in the PDFs.
|