[Egothor-tech] PDF & DOC indexing, Egothor 1.3.003
Filip Koczorowski
filipk at man.poznan.pl
Wed Oct 5 14:06:57 BST 2005
Leo Galambos wrote:
> | Egothor 1.3.003, a clean download from sourceforge.net, seems to
> | have some trouble with indexing PDF and DOC documents. I prepared a
> | simple page with one line of text "This is a testing page for
> | Egothor" and two links - one in a form of "a href=test.pdf" and
> | another one as "a href=test.doc". Both of these files contain a
> | header "This is a testing page for Egothor" and a single paragraph
> | of text (files created with OpenOffice 2.0beta).
> |
> | After I run Capek, I get a "corpus" folder that contains the HTML
> | page and the PDF & DOC files. However when I run Michelangelo, the
> | resulting index has no information from PDF & DOC. I looked into
> | "doc.dta" file in "index" folder and it contains HTML page content
> | as well as "test.pdf" file name inside, but nothing else (no
> | content of PDF nor any sign of DOC).
> |
> | I would appreciate any suggestions - perhaps I am doing something
> | wrong...
>
>
> could you send the data files to my private email box, please? I will
> have to peek at it closer ;)
Actually I managed to overcome my PDF documents indexing problem just
recently. The key was to solve my problem was to upgrade PDFBox library.
The PDF documents I was trying to parse and index were generated by
OpenOffice and that was the problem. Since version 0.6.7 of PDFBox (the
most recent is 0.7.2 and that is the version I use now) this issue ("fix
parsing of open office documents", as seen on
http://www.pdfbox.org/changes.html#version_0.6.7) has been resolved.
However I still have trouble parsing DOC documents. Perhaps the Jakarta
POI library used by Egothor is also out-dated. Unfortunately, the
development of POI has nearly stopped. Stable release is 2.5.1, dated on
August 2004 and the next version, 3.0, is still in alpha (last
development release dated on July 2005).
Regards,
Filip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: filipk.vcf
Type: text/x-vcard
Size: 222 bytes
Desc: not available
Url : http://egothor.org/pipermail/egothor-tech/attachments/20051005/06200b04/filipk.vcf
More information about the Egothor-tech
mailing list