[Egothor-tech] RE: Egothor with Pdf parsing: unable to find
outa word despite it seems to be into the barrel
SPIELMANN Christophe
cspielmann at europarl.eu.int
Fri Aug 6 08:40:01 BST 2004
Hi leo,
thanks for the fast reply ( you should not work so late :-))
here is what we got from the TankerQuery:
Aug 6, 2004 9:35:21 AM ~dir.TankerImpl loadState INFO: Loading state C:\Dgpe\Egothor_barrel\state
Aug 6, 2004 9:35:21 AM ~query.Executor query INFO: [null:<WORD>krav r,p true,false]
Aug 6, 2004 9:35:21 AM ~dir.TankerImpl elements INFO: Dynamizer is dirty C:\\Dgpe\\Egothor_barrel\
Aug 6, 2004 9:35:21 AM ThinkBarrel constructor FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\2\ C:\\Dgpe\\Egothor_barrel\\2\
Aug 6, 2004 9:35:21 AM DiscIndexData setLocation FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\2\ C:\\Dgpe\\Egothor_barrel\\2\
Aug 6, 2004 9:35:21 AM ThinkBarrel constructor FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\3\ C:\\Dgpe\\Egothor_barrel\\3\
Aug 6, 2004 9:35:21 AM DiscIndexData setLocation FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\3\ C:\\Dgpe\\Egothor_barrel\\3\
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM DiscKeyIndexData indexOf FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM DiscKeyIndexData indexOf FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~dir.TankerImpl elements INFO: Dynamizer is dirty C:\\Dgpe\\Egothor_barrel\
Aug 6, 2004 9:35:21 AM ThickBarrel openIList FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM ~db.disc.DiscKeyIndexDataCache elementAt FINE: DiscKey cache hit
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta constructIn FINER: <WORD>krav GIStream (0.6666667%) 2
Aug 6, 2004 9:35:21 AM ~store.disc.ThickBarrel openIList FINEST: clear list
Aug 6, 2004 9:35:21 AM TermRunner constructor INFO: setup <WORD>krav idf=0.6141471927654584 myidf=0.6141471927654584 boost=1 field=null req=true proh=false
Aug 6, 2004 9:35:21 AM ~store.disc.IListFileIn close FINEST: close
Aug 6, 2004 9:35:21 AM ThickBarrel openIList FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM ~db.disc.DiscKeyIndexDataCache elementAt FINE: DiscKey cache hit
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta constructIn FINER: <WORD>krav GIStream (0.33333334%) 2
Aug 6, 2004 9:35:21 AM ~store.disc.ThickBarrel openIList FINEST: clear list
Aug 6, 2004 9:35:21 AM TermRunner constructor INFO: setup <WORD>krav idf=0.6141471927654584 myidf=0.6141471927654584 boost=1 field=null req=true proh=false
Aug 6, 2004 9:35:21 AM ~store.disc.IListFileIn close FINEST: close
I will know perform what you asked me for in your second mail
will keep you posted
bye
-----Original Message-----
From: egothor-tech-bounces at egothor.org [mailto:egothor-tech-bounces at egothor.org] On Behalf Of Leo Galambos
Sent: 05 August 2004 20:31
To: Egothor list
Subject: Re: [Egothor-tech] RE: Egothor with Pdf parsing: unable to find outa word despite it seems to be into the barrel
Hi,
I think that you did everything right, it must be a bug. Could you run
TankerQuery with logging on, please?
org.egothor.test.TankerQuery -Djava.util.logging.config.file=log.prop
org.egothor.test.TankerQuery c:\Dgpe\Egothor_barrel krav
where "log.prop" is a textual file:
handlers=java.util.logging.FileHandler,java.util.logging.ConsoleHandler
.level=FINEST
java.util.logging.FileHandler.formatter = org.egothor.util.EgothorFormatter java.util.logging.FileHandler.encoding = ISO-8859-1 java.util.logging.FileHandler.limit = 1000000 java.util.logging.FileHandler.count = 50 java.util.logging.FileHandler.pattern = egolog-%g java.util.logging.FileHandler.append = false java.util.logging.ConsoleHandler.level = SEVERE java.util.logging.ConsoleHandler.formatter =
java.util.logging.SimpleFormatter
TankerQuery will generate egolog-* files which would show us what's broken.
Thank you.
Leo
SPIELMANN Christophe wrote:
> !!!
> It seems to work with the robot indexer instead of the index local.
> Still to be fully tested. will provide informations soon.
> question: why such ?
>
> -----Original Message-----
> *From:* SPIELMANN Christophe
> *Sent:* 05 August 2004 17:31
> *To:* 'Egothor-tech at egothor.org'
> *Subject:* RE: Egothor with Pdf parsing: unable to find out a word
> despite it seems to be into the barrel
>
> I forgot to specifed which commands i used: (NT4 with eclipse )
>
> org.egothor.apps.Directory C:\\Dgpe\\Egothor_barrel -lowercase
> -snippet C:\\Dgpe\\Egothor_from as http://winold/manual/
>
> org.egothor.test.TankerQuery C:\\Dgpe\\Egothor_barrel krav
>
> org.egothor.test.Dumper -DLWP C:\\Dgpe\\Egothor_barrel\\1\\
>
> txs
>
> -----Original Message-----
> *From:* SPIELMANN Christophe
> *Sent:* 05 August 2004 17:21
> *To:* 'Egothor-tech at egothor.org'
> *Cc:* CLAUS Pascal
> *Subject:* Egothor with Pdf parsing: unable to find out a word
> despite it seems to be into the barrel
>
> We are facing a problem with the result of a pdf parsing:
> here is our point:( We use egothor 1.2.5rc6/ JDK 1.4.1_02 )
>
> Despite a word ("krav") is into a pdf, we are not able to
> fetch it from a basic query.
> The strange stuff is that we are able to find it using the
> Dumper or the Expand command.
> Any help would be welcome.
> we provide the logs below:
>
> When parsing one directory with files:
> -------------------------------------------------------
>
> - danish.pdf ( danish pdf )
> - site.pdf (english pdf )
> - index.html ( english html )
>
> we got after parsing (state file )
> ------------------------------------------------
>
> #Tanker state
> #Thu Aug 05 16:59:03 CEST 2004
> slotter.last=1
> egothor.capacity=32
> slotter.flat=false
> egothor.slot.2=1
>
> the log of the Directory command is:
> -----------------------------------------------------
> .../..
> Aug 5, 2004 4:58:44 PM org.egothor.crusher.Finder scanPackages
> INFO: <java.io.InputStream;15;java.io.Reader>
> Switching lowercase to true
> Switching Snippet support to true
> C:\DGPE\egothor_from as http://winold/manual/
> danish.pdf
> Input
> java.lang.String
> Flags: <FILENAME><PDF>
> Output
> org.egothor.data.Document
> Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
> Filtering system found:
> --$0--> via
> org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME>
> --$1--> via
> org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME>
> --$21--> via
> org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS>
> --$31--> via
> org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS>
> --$36--> via
> org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS>
> --$38--> via
> org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
> --$53--> via
> org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
> log4j:WARN No appenders could be found for logger
> (org.pdfbox.pdfparser.PDFParser).
> log4j:WARN Please initialize the log4j system properly.
> index.html
> Input
> java.lang.String
> Flags: <FILENAME><HTML>
> Output
> org.egothor.data.Document
> Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
> Filtering system found:
> --$0--> via
> org.egothor.crusher.IniPath:java.lang.String<HTML><FILENAME>
> --$1--> via
> org.egothor.crusher.connectors.ReaderPath:java.io.Reader<BUFFERED><HTML><FILENAME>
> --$6--> via
> org.egothor.crusher.connectors.HTML3Path:java.io.Reader<BUFFERED><HTML><SEMANTICS><FILENAME><NOHTMLTAGS>
> --$16--> via
> org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS>
> --$21--> via
> org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PUNCTUATION><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS>
> --$23--> via
> org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><HTML><PUNCTUATION><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
> --$38--> via
> org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PUNCTUATION><HTML><HOME><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
> site.pdf
> Input
> java.lang.String
> Flags: <FILENAME><PDF>
> Output
> org.egothor.data.Document
> Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
> Filtering system found:
> --$0--> via
> org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME>
> --$1--> via
> org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME>
> --$21--> via
> org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS>
> --$31--> via
> org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS>
> --$36--> via
> org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS>
> --$38--> via
> org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
> --$53--> via
> org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
> Commit...
> ...optimize()
> ...commit()
> Done.
> Aug 5, 2004 4:58:55 PM org.egothor.dir.TankerImpl commit
> INFO: Saving state
>
> result of the query gives :
> ---------------------------------------
> Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl loadState
> INFO: Loading state
> Query: krav
> Aug 5, 2004 4:59:31 PM org.egothor.query.Executor query
> INFO: [null:<WORD>krav r,p true,false]
> Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
> INFO: Dynamizer is dirty
> Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
> INFO: Dynamizer is dirty
> Aug 5, 2004 4:59:32 PM TermRunner constructor
> INFO: setup
> 0
> <?xml version="1.0" encoding="UTF-8"?>
> <query><group required="no" prohibited="no" unknown="no"
> excluded="no"><term required="yes" prohibited="no"
> unknown="no" excluded="no" value="<WORD>krav"
> control="no" idf="1.001" boost="1"/></group></query>
>
> result of the Expand gives :
> ---------------------------------------
> C:/Dgpe/Egothor_barrel expand of <WORD>kr*
> Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl loadState
> INFO: Loading state
> Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl elements
> INFO: Dynamizer is dirty
> <WORD>kraft
> <WORD>krav
> <WORD>kriterier
> <WORD>kræver
> Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl commit
> INFO: Saving state
>
> result of Dumper gives
> ---------------------------------
> 0 [PDF/PS] : [http://winold/manual//danish.pdf]
> :CM\531576DA.doc PE 344.027 Or. EN DA DA EUROPA-PARLAMENTET
> BUDGETUDVALGET Meddelelse til medlemmerne Om: Håndbog for nye
> udvalgsmedlemmer GENERALDIREKTORATET FOR INTERNE POLITIKKER 3.
> juni 2004 PE 344.027 2/9 CM\531576DA.doc DA Indledning Europ
> 1 Struts for Transforming XML with XSL (stxx)
> [http://winold/manual//index.html] :the stxx site stxx Home
> Getting Started About Index License Download Who we are FAQ
> Changes Todo Site as PDF Getting Involved Contributing...
> 2 [PDF/PS] : [http://winold/manual//site.pdf] :stxx
> Documentation Table of contents 1.
> About....................................................................................................................................
> 1 1.1. Struts for Transforming XML with XSL
> (stxx).........................
> <!VOLATILE>depthrank 3 org.egothor.store.disc.RankFileIn
> 0 w=9 :
> 1 w=9 :
> 2 w=9 :
> <ACRONYM>e.g. 1 org.egothor.store.disc.IListFileIn
> 2 w=1 : 3220
> <APOSTROPHE>action's 1 org.egothor.store.disc.IListFileIn
> 2 w=1 : 6005
> <APOSTROPHE>apache's 1 org.egothor.store.disc.IListFileIn
> 2 w=1 : 3934
> .../...
> <WORD>korrekt 1 org.egothor.store.disc.IListFileIn
> 0 w=4 : 2168
> <WORD>kort 1 org.egothor.store.disc.IListFileIn
> 0 w=14 : 338 1634 2342
> <WORD>kraft 1 org.egothor.store.disc.IListFileIn
> 0 w=4 : 110
> <WORD>krav 1 org.egothor.store.disc.IListFileIn
> 0 w=4 : 2590
> <WORD>kriterier 1 org.egothor.store.disc.IListFileIn
> 0 w=4 : 668
> <WORD>kræver 1 org.egothor.store.disc.IListFileIn
> 0 w=4 : 401
> <WORD>kun 1 org.egothor.store.disc.IListFileIn
> 0 w=42 : 373 421 694 1115 1369 1891 1932 2193 2426
> <WORD>kunne 1 org.egothor.store.disc.IListFileIn
> ../...
>
>
> Christophe Spielmann
>
>
>
>
>
>-----------------------------------------------------------------------
>-
>
>_______________________________________________
>Egothor-tech mailing list
>Egothor-tech at egothor.org
>http://www.egothor.org/mailman/listinfo/egothor-tech
>
>
--
::egothor
http://www.egothor.org/Main/LeoGalambos
_______________________________________________
Egothor-tech mailing list
Egothor-tech at egothor.org http://www.egothor.org/mailman/listinfo/egothor-tech
More information about the Egothor-tech
mailing list