Full-text Search Engine and Library which are entirely written in JAVA
:: egothor

Search this Archive ::
:: Egothor@Home :: Demo (Dundee) :: Download :: Getting started :: Bugs :: API

[Egothor-tech] RE: Egothor with Pdf parsing: unable to find outa word despite it seems to be into the barrel

SPIELMANN Christophe cspielmann at europarl.eu.int
Fri Aug 6 08:40:01 BST 2004

Hi leo,

thanks for the fast reply ( you should not work so late :-))

here is what we got from the TankerQuery:

Aug 6, 2004 9:35:21 AM ~dir.TankerImpl loadState INFO: Loading state C:\Dgpe\Egothor_barrel\state
Aug 6, 2004 9:35:21 AM ~query.Executor query INFO: [null:<WORD>krav r,p true,false]
Aug 6, 2004 9:35:21 AM ~dir.TankerImpl elements INFO: Dynamizer is dirty C:\\Dgpe\\Egothor_barrel\
Aug 6, 2004 9:35:21 AM ThinkBarrel constructor FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\2\ C:\\Dgpe\\Egothor_barrel\\2\
Aug 6, 2004 9:35:21 AM DiscIndexData setLocation FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\2\ C:\\Dgpe\\Egothor_barrel\\2\
Aug 6, 2004 9:35:21 AM ThinkBarrel constructor FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\3\ C:\\Dgpe\\Egothor_barrel\\3\
Aug 6, 2004 9:35:21 AM DiscIndexData setLocation FINER: ENTRY C:\\Dgpe\\Egothor_barrel\\3\ C:\\Dgpe\\Egothor_barrel\\3\
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM DiscKeyIndexData indexOf FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM DiscKeyIndexData indexOf FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~dir.TankerImpl elements INFO: Dynamizer is dirty C:\\Dgpe\\Egothor_barrel\
Aug 6, 2004 9:35:21 AM ThickBarrel openIList FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM ~db.disc.DiscKeyIndexDataCache elementAt FINE: DiscKey cache hit
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta constructIn FINER: <WORD>krav GIStream (0.6666667%) 2
Aug 6, 2004 9:35:21 AM ~store.disc.ThickBarrel openIList FINEST: clear list
Aug 6, 2004 9:35:21 AM TermRunner constructor INFO: setup <WORD>krav idf=0.6141471927654584 myidf=0.6141471927654584 boost=1 field=null req=true proh=false
Aug 6, 2004 9:35:21 AM ~store.disc.IListFileIn close FINEST: close
Aug 6, 2004 9:35:21 AM ThickBarrel openIList FINER: ENTRY <WORD>krav <WORD>krav
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta factor FINEST: <WORD>krav flen=0.0 fcap=0.0
Aug 6, 2004 9:35:21 AM ~db.disc.DiscKeyIndexDataCache elementAt FINE: DiscKey cache hit
Aug 6, 2004 9:35:21 AM ~store.disc.IListMeta constructIn FINER: <WORD>krav GIStream (0.33333334%) 2
Aug 6, 2004 9:35:21 AM ~store.disc.ThickBarrel openIList FINEST: clear list
Aug 6, 2004 9:35:21 AM TermRunner constructor INFO: setup <WORD>krav idf=0.6141471927654584 myidf=0.6141471927654584 boost=1 field=null req=true proh=false
Aug 6, 2004 9:35:21 AM ~store.disc.IListFileIn close FINEST: close


I will know perform what you asked me for in your second mail
will keep you posted
bye


-----Original Message-----
From: egothor-tech-bounces at egothor.org [mailto:egothor-tech-bounces at egothor.org] On Behalf Of Leo Galambos
Sent: 05 August 2004 20:31
To: Egothor list
Subject: Re: [Egothor-tech] RE: Egothor with Pdf parsing: unable to find outa word despite it seems to be into the barrel


Hi,

I think that you did everything right, it must be a bug. Could you run 
TankerQuery with logging on, please?

org.egothor.test.TankerQuery -Djava.util.logging.config.file=log.prop 
org.egothor.test.TankerQuery c:\Dgpe\Egothor_barrel krav

where "log.prop" is a textual file:

handlers=java.util.logging.FileHandler,java.util.logging.ConsoleHandler
.level=FINEST
java.util.logging.FileHandler.formatter = org.egothor.util.EgothorFormatter java.util.logging.FileHandler.encoding = ISO-8859-1 java.util.logging.FileHandler.limit = 1000000 java.util.logging.FileHandler.count = 50 java.util.logging.FileHandler.pattern = egolog-%g java.util.logging.FileHandler.append = false java.util.logging.ConsoleHandler.level = SEVERE java.util.logging.ConsoleHandler.formatter = 
java.util.logging.SimpleFormatter

TankerQuery will generate egolog-* files which would show us what's broken.

Thank you.

Leo

SPIELMANN Christophe wrote:

> !!!
> It seems to work with the robot indexer instead of the index local. 
> Still to be fully tested. will provide informations soon.
> question: why such ?
>
>     -----Original Message-----
>     *From:* SPIELMANN Christophe
>     *Sent:* 05 August 2004 17:31
>     *To:* 'Egothor-tech at egothor.org'
>     *Subject:* RE: Egothor with Pdf parsing: unable to find out a word
>     despite it seems to be into the barrel
>
>     I forgot to specifed which commands i used: (NT4 with eclipse )
>      
>     org.egothor.apps.Directory C:\\Dgpe\\Egothor_barrel -lowercase
>     -snippet C:\\Dgpe\\Egothor_from as http://winold/manual/
>      
>     org.egothor.test.TankerQuery C:\\Dgpe\\Egothor_barrel krav
>      
>     org.egothor.test.Dumper -DLWP C:\\Dgpe\\Egothor_barrel\\1\\
>      
>     txs
>
>         -----Original Message-----
>         *From:* SPIELMANN Christophe
>         *Sent:* 05 August 2004 17:21
>         *To:* 'Egothor-tech at egothor.org'
>         *Cc:* CLAUS Pascal
>         *Subject:* Egothor with Pdf parsing: unable to find out a word
>         despite it seems to be into the barrel
>
>         We are facing a problem with the result of a pdf parsing:
>         here is our point:( We use egothor 1.2.5rc6/ JDK 1.4.1_02 )
>          
>         Despite a word ("krav") is into a pdf, we are not able to
>         fetch it from a basic query.
>         The strange stuff is that we are able to find it using the
>         Dumper or the Expand command.
>         Any help would be welcome.
>         we provide the logs below:
>          
>         When parsing one directory with files:
>         -------------------------------------------------------
>          
>         - danish.pdf ( danish pdf )
>         - site.pdf (english pdf )
>         - index.html ( english html )
>          
>         we got after parsing (state file )
>         ------------------------------------------------
>          
>         #Tanker state
>         #Thu Aug 05 16:59:03 CEST 2004
>         slotter.last=1
>         egothor.capacity=32
>         slotter.flat=false
>         egothor.slot.2=1
>          
>         the log of the Directory command is:
>         -----------------------------------------------------
>         .../..
>         Aug 5, 2004 4:58:44 PM org.egothor.crusher.Finder scanPackages
>         INFO: <java.io.InputStream;15;java.io.Reader>
>         Switching lowercase to true
>         Switching Snippet support to true
>         C:\DGPE\egothor_from as http://winold/manual/
>         danish.pdf
>         Input
>         java.lang.String
>         Flags: <FILENAME><PDF>
>         Output
>         org.egothor.data.Document
>         Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
>         Filtering system found:
>         --$0--> via
>         org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME>
>         --$1--> via
>         org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME>
>         --$21--> via
>         org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS>
>         --$31--> via
>         org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS>
>         --$36--> via
>         org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS>
>         --$38--> via
>         org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
>         --$53--> via
>         org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
>         log4j:WARN No appenders could be found for logger
>         (org.pdfbox.pdfparser.PDFParser).
>         log4j:WARN Please initialize the log4j system properly.
>         index.html
>         Input
>         java.lang.String
>         Flags: <FILENAME><HTML>
>         Output
>         org.egothor.data.Document
>         Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
>         Filtering system found:
>         --$0--> via
>         org.egothor.crusher.IniPath:java.lang.String<HTML><FILENAME>
>         --$1--> via
>         org.egothor.crusher.connectors.ReaderPath:java.io.Reader<BUFFERED><HTML><FILENAME>
>         --$6--> via
>         org.egothor.crusher.connectors.HTML3Path:java.io.Reader<BUFFERED><HTML><SEMANTICS><FILENAME><NOHTMLTAGS>
>         --$16--> via
>         org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS>
>         --$21--> via
>         org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PUNCTUATION><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS>
>         --$23--> via
>         org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><HTML><PUNCTUATION><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
>         --$38--> via
>         org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PUNCTUATION><HTML><HOME><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
>         site.pdf
>         Input
>         java.lang.String
>         Flags: <FILENAME><PDF>
>         Output
>         org.egothor.data.Document
>         Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
>         Filtering system found:
>         --$0--> via
>         org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME>
>         --$1--> via
>         org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME>
>         --$21--> via
>         org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS>
>         --$31--> via
>         org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS>
>         --$36--> via
>         org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS>
>         --$38--> via
>         org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE>
>         --$53--> via
>         org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
>         Commit...
>         ...optimize()
>         ...commit()
>         Done.
>         Aug 5, 2004 4:58:55 PM org.egothor.dir.TankerImpl commit
>         INFO: Saving state
>          
>         result of the query gives :
>         ---------------------------------------
>         Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl loadState
>         INFO: Loading state
>         Query: krav
>         Aug 5, 2004 4:59:31 PM org.egothor.query.Executor query
>         INFO: [null:<WORD>krav r,p true,false]
>         Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
>         INFO: Dynamizer is dirty
>         Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
>         INFO: Dynamizer is dirty
>         Aug 5, 2004 4:59:32 PM TermRunner constructor
>         INFO: setup
>         0
>         <?xml version="1.0" encoding="UTF-8"?>
>         <query><group required="no" prohibited="no" unknown="no"
>         excluded="no"><term required="yes" prohibited="no"
>         unknown="no" excluded="no" value="&lt;WORD&gt;krav"
>         control="no" idf="1.001" boost="1"/></group></query>
>          
>         result of the Expand gives :
>         ---------------------------------------
>         C:/Dgpe/Egothor_barrel expand of <WORD>kr*
>         Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl loadState
>         INFO: Loading state
>         Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl elements
>         INFO: Dynamizer is dirty
>         <WORD>kraft
>         <WORD>krav
>         <WORD>kriterier
>         <WORD>kræver
>         Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl commit
>         INFO: Saving state
>          
>         result of Dumper gives
>         ---------------------------------
>         0 [PDF/PS] : [http://winold/manual//danish.pdf]
>         :CM\531576DA.doc PE 344.027 Or. EN DA DA EUROPA-PARLAMENTET
>         BUDGETUDVALGET Meddelelse til medlemmerne Om: Håndbog for nye
>         udvalgsmedlemmer GENERALDIREKTORATET FOR INTERNE POLITIKKER 3.
>         juni 2004 PE 344.027 2/9 CM\531576DA.doc DA Indledning Europ
>         1 Struts for Transforming XML with XSL (stxx)
>         [http://winold/manual//index.html] :the stxx site stxx Home
>         Getting Started About Index License Download Who we are FAQ
>         Changes Todo Site as PDF Getting Involved Contributing...
>         2 [PDF/PS] : [http://winold/manual//site.pdf] :stxx
>         Documentation Table of contents 1.
>         About....................................................................................................................................
>         1 1.1. Struts for Transforming XML with XSL
>         (stxx).........................
>         <!VOLATILE>depthrank 3 org.egothor.store.disc.RankFileIn
>         0 w=9 :
>         1 w=9 :
>         2 w=9 :
>         <ACRONYM>e.g. 1 org.egothor.store.disc.IListFileIn
>         2 w=1 : 3220
>         <APOSTROPHE>action's 1 org.egothor.store.disc.IListFileIn
>         2 w=1 : 6005
>         <APOSTROPHE>apache's 1 org.egothor.store.disc.IListFileIn
>         2 w=1 : 3934
>         .../...
>         <WORD>korrekt 1 org.egothor.store.disc.IListFileIn
>         0 w=4 : 2168
>         <WORD>kort 1 org.egothor.store.disc.IListFileIn
>         0 w=14 : 338 1634 2342
>         <WORD>kraft 1 org.egothor.store.disc.IListFileIn
>         0 w=4 : 110
>         <WORD>krav 1 org.egothor.store.disc.IListFileIn
>         0 w=4 : 2590
>         <WORD>kriterier 1 org.egothor.store.disc.IListFileIn
>         0 w=4 : 668
>         <WORD>kræver 1 org.egothor.store.disc.IListFileIn
>         0 w=4 : 401
>         <WORD>kun 1 org.egothor.store.disc.IListFileIn
>         0 w=42 : 373 421 694 1115 1369 1891 1932 2193 2426
>         <WORD>kunne 1 org.egothor.store.disc.IListFileIn
>         ../...
>          
>          
>         Christophe Spielmann
>          
>          
>          
>          
>
>-----------------------------------------------------------------------
>-
>
>_______________________________________________
>Egothor-tech mailing list
>Egothor-tech at egothor.org 
>http://www.egothor.org/mailman/listinfo/egothor-tech
>  
>


-- 
::egothor
http://www.egothor.org/Main/LeoGalambos

_______________________________________________
Egothor-tech mailing list
Egothor-tech at egothor.org http://www.egothor.org/mailman/listinfo/egothor-tech


More information about the Egothor-tech mailing list
© 2004 Egothor Developers