[Egothor-tech] Egothor with Pdf parsing: unable to find out a word
despite it seems to be into the barrel
SPIELMANN Christophe
cspielmann at europarl.eu.int
Thu Aug 5 16:20:55 BST 2004
We are facing a problem with the result of a pdf parsing:
here is our point:( We use egothor 1.2.5rc6/ JDK 1.4.1_02 )
Despite a word ("krav") is into a pdf, we are not able to fetch it from a basic query.
The strange stuff is that we are able to find it using the Dumper or the Expand command.
Any help would be welcome.
we provide the logs below:
When parsing one directory with files:
-------------------------------------------------------
- danish.pdf ( danish pdf )
- site.pdf (english pdf )
- index.html ( english html )
we got after parsing (state file )
------------------------------------------------
#Tanker state
#Thu Aug 05 16:59:03 CEST 2004
slotter.last=1
egothor.capacity=32
slotter.flat=false
egothor.slot.2=1
the log of the Directory command is:
-----------------------------------------------------
.../..
Aug 5, 2004 4:58:44 PM org.egothor.crusher.Finder scanPackages
INFO: <java.io.InputStream;15;java.io.Reader>
Switching lowercase to true
Switching Snippet support to true
C:\DGPE\egothor_from as http://winold/manual/ <http://winold/manual/>
danish.pdf
Input
java.lang.String
Flags: <FILENAME><PDF>
Output
org.egothor.data.Document
Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
Filtering system found:
--$0--> via org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME> --$1--> via org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME> --$21--> via org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS> --$31--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS> --$36--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS> --$38--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$53--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParser).
log4j:WARN Please initialize the log4j system properly.
index.html
Input
java.lang.String
Flags: <FILENAME><HTML>
Output
org.egothor.data.Document
Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
Filtering system found:
--$0--> via org.egothor.crusher.IniPath:java.lang.String<HTML><FILENAME> --$1--> via org.egothor.crusher.connectors.ReaderPath:java.io.Reader<BUFFERED><HTML><FILENAME> --$6--> via org.egothor.crusher.connectors.HTML3Path:java.io.Reader<BUFFERED><HTML><SEMANTICS><FILENAME><NOHTMLTAGS> --$16--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS> --$21--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PUNCTUATION><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS> --$23--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><HTML><PUNCTUATION><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$38--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PUNCTUATION><HTML><HOME><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
site.pdf
Input
java.lang.String
Flags: <FILENAME><PDF>
Output
org.egothor.data.Document
Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
Filtering system found:
--$0--> via org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME> --$1--> via org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME> --$21--> via org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS> --$31--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS> --$36--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS> --$38--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$53--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
Commit...
...optimize()
...commit()
Done.
Aug 5, 2004 4:58:55 PM org.egothor.dir.TankerImpl commit
INFO: Saving state
result of the query gives :
---------------------------------------
Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl loadState
INFO: Loading state
Query: krav
Aug 5, 2004 4:59:31 PM org.egothor.query.Executor query
INFO: [null:<WORD>krav r,p true,false]
Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
INFO: Dynamizer is dirty
Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
INFO: Dynamizer is dirty
Aug 5, 2004 4:59:32 PM TermRunner constructor
INFO: setup
0
<?xml version="1.0" encoding="UTF-8"?>
<query><group required="no" prohibited="no" unknown="no" excluded="no"><term required="yes" prohibited="no" unknown="no" excluded="no" value="<WORD>krav" control="no" idf="1.001" boost="1"/></group></query>
result of the Expand gives :
---------------------------------------
C:/Dgpe/Egothor_barrel expand of <WORD>kr*
Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl loadState
INFO: Loading state
Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl elements
INFO: Dynamizer is dirty
<WORD>kraft
<WORD>krav
<WORD>kriterier
<WORD>kræver
Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl commit
INFO: Saving state
result of Dumper gives
---------------------------------
0 [PDF/PS] : [http://winold/manual//danish.pdf] :CM\531576DA.doc PE 344.027 Or. EN DA DA EUROPA-PARLAMENTET BUDGETUDVALGET Meddelelse til medlemmerne Om: Håndbog for nye udvalgsmedlemmer GENERALDIREKTORATET FOR INTERNE POLITIKKER 3. juni 2004 PE 344.027 2/9 CM\531576DA.doc DA Indledning Europ
1 Struts for Transforming XML with XSL (stxx) [http://winold/manual//index.html] :the stxx site stxx Home Getting Started About Index License Download Who we are FAQ Changes Todo Site as PDF Getting Involved Contributing...
2 [PDF/PS] : [http://winold/manual//site.pdf] :stxx Documentation Table of contents 1. About.................................................................................................................................... 1 1.1. Struts for Transforming XML with XSL (stxx).........................
<!VOLATILE>depthrank 3 org.egothor.store.disc.RankFileIn
0 w=9 :
1 w=9 :
2 w=9 :
<ACRONYM>e.g. 1 org.egothor.store.disc.IListFileIn
2 w=1 : 3220
<APOSTROPHE>action's 1 org.egothor.store.disc.IListFileIn
2 w=1 : 6005
<APOSTROPHE>apache's 1 org.egothor.store.disc.IListFileIn
2 w=1 : 3934
.../...
<WORD>korrekt 1 org.egothor.store.disc.IListFileIn
0 w=4 : 2168
<WORD>kort 1 org.egothor.store.disc.IListFileIn
0 w=14 : 338 1634 2342
<WORD>kraft 1 org.egothor.store.disc.IListFileIn
0 w=4 : 110
<WORD>krav 1 org.egothor.store.disc.IListFileIn
0 w=4 : 2590
<WORD>kriterier 1 org.egothor.store.disc.IListFileIn
0 w=4 : 668
<WORD>kræver 1 org.egothor.store.disc.IListFileIn
0 w=4 : 401
<WORD>kun 1 org.egothor.store.disc.IListFileIn
0 w=42 : 373 421 694 1115 1369 1891 1932 2193 2426
<WORD>kunne 1 org.egothor.store.disc.IListFileIn
../...
Christophe Spielmann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.egothor.org/pipermail/egothor-tech/attachments/20040805/d71f4942/attachment.html
More information about the Egothor-tech
mailing list