Full-text Search Engine and Library which are entirely written in JAVA
:: egothor

Search this Archive ::
:: Egothor@Home :: Demo (Dundee) :: Download :: Getting started :: Bugs :: API

[Egothor-tech] RE: Egothor with Pdf parsing: unable to find out a word despite it seems to be into the barrel

SPIELMANN Christophe cspielmann at europarl.eu.int
Thu Aug 5 16:43:26 BST 2004

!!!
It seems to work with the robot indexer instead of the index local. 
Still to be fully tested. will provide informations soon.
question: why such ?

	-----Original Message-----
	From: SPIELMANN Christophe 
	Sent: 05 August 2004 17:31
	To: 'Egothor-tech at egothor.org'
	Subject: RE: Egothor with Pdf parsing: unable to find out a word despite it seems to be into the barrel
	
	
	I forgot to specifed which commands i used: (NT4 with eclipse )
	 
	org.egothor.apps.Directory C:\\Dgpe\\Egothor_barrel -lowercase -snippet C:\\Dgpe\\Egothor_from as http://winold/manual/
	 
	org.egothor.test.TankerQuery C:\\Dgpe\\Egothor_barrel krav
	 
	org.egothor.test.Dumper -DLWP C:\\Dgpe\\Egothor_barrel\\1\\
	 
	txs

		-----Original Message-----
		From: SPIELMANN Christophe 
		Sent: 05 August 2004 17:21
		To: 'Egothor-tech at egothor.org'
		Cc: CLAUS Pascal
		Subject: Egothor with Pdf parsing: unable to find out a word despite it seems to be into the barrel
		
		
		We are facing a problem with the result of a pdf parsing:
		here is our point:( We use egothor 1.2.5rc6/ JDK 1.4.1_02 )
		 
		Despite a word ("krav") is into a pdf, we are not able to fetch it from a basic query.
		The strange stuff is that we are able to find it using the Dumper or the Expand command.
		Any help would be welcome.
		we provide the logs below:
		 
		When parsing one directory with files:
		-------------------------------------------------------
		 
		- danish.pdf ( danish pdf )
		- site.pdf (english pdf )
		- index.html ( english html )
		 
		we got after parsing (state file )
		------------------------------------------------
		 
		#Tanker state
		#Thu Aug 05 16:59:03 CEST 2004
		slotter.last=1
		egothor.capacity=32
		slotter.flat=false
		egothor.slot.2=1
		 
		the log of the Directory command is:
		-----------------------------------------------------
		.../..
		Aug 5, 2004 4:58:44 PM org.egothor.crusher.Finder scanPackages
		INFO: <java.io.InputStream;15;java.io.Reader>
		Switching lowercase to true
		Switching Snippet support to true
		C:\DGPE\egothor_from as http://winold/manual/ <http://winold/manual/> 
		danish.pdf
		Input
		java.lang.String
		Flags: <FILENAME><PDF>
		Output
		org.egothor.data.Document
		Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
		Filtering system found:
		--$0--> via org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME> --$1--> via org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME> --$21--> via org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS> --$31--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS> --$36--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS> --$38--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$53--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
		log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParser).
		log4j:WARN Please initialize the log4j system properly.
		index.html
		Input
		java.lang.String
		Flags: <FILENAME><HTML>
		Output
		org.egothor.data.Document
		Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
		Filtering system found:
		--$0--> via org.egothor.crusher.IniPath:java.lang.String<HTML><FILENAME> --$1--> via org.egothor.crusher.connectors.ReaderPath:java.io.Reader<BUFFERED><HTML><FILENAME> --$6--> via org.egothor.crusher.connectors.HTML3Path:java.io.Reader<BUFFERED><HTML><SEMANTICS><FILENAME><NOHTMLTAGS> --$16--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS> --$21--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PUNCTUATION><HTML><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS> --$23--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><HTML><PUNCTUATION><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$38--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PUNCTUATION><HTML><HOME><SEMANTICS><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
		site.pdf
		Input
		java.lang.String
		Flags: <FILENAME><PDF>
		Output
		org.egothor.data.Document
		Flags: <HOME><PUNCTUATION><LOWERCASE><SNIPPET>
		Filtering system found:
		--$0--> via org.egothor.crusher.IniPath:java.lang.String<PDF><FILENAME> --$1--> via org.egothor.crusher.connectors.InputStreamPath:java.io.InputStream<BUFFERED><PDF><FILENAME> --$21--> via org.egothor.crusher.connectors.PDFPath:java.io.Reader<BUFFERED><PDF><FILENAME><NOHTMLTAGS> --$31--> via org.egothor.crusher.connectors.TokenizerPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><TAGGED><FILENAME><NOHTMLTAGS> --$36--> via org.egothor.crusher.connectors.PunctPath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS> --$38--> via org.egothor.crusher.connectors.LowerCasePath:org.egothor.parser.Tokenizer<BUFFERED><PDF><PUNCTUATION><TAGGED><FILENAME><NOHTMLTAGS><LOWERCASE> --$53--> via org.egothor.crusher.connectors.BHTML2Path:org.egothor.data.Document<BUFFERED><PDF><PUNCTUATION><HOME><TAGGED><FILENAME><NOHTMLTAGS><SNIPPET><LOWERCASE>
		Commit...
		...optimize()
		...commit()
		Done.
		Aug 5, 2004 4:58:55 PM org.egothor.dir.TankerImpl commit
		INFO: Saving state
		 
		result of the query gives : 
		---------------------------------------
		Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl loadState
		INFO: Loading state
		Query: krav
		Aug 5, 2004 4:59:31 PM org.egothor.query.Executor query
		INFO: [null:<WORD>krav r,p true,false]
		Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
		INFO: Dynamizer is dirty
		Aug 5, 2004 4:59:31 PM org.egothor.dir.TankerImpl elements
		INFO: Dynamizer is dirty
		Aug 5, 2004 4:59:32 PM TermRunner constructor
		INFO: setup
		0
		<?xml version="1.0" encoding="UTF-8"?>
		<query><group required="no" prohibited="no" unknown="no" excluded="no"><term required="yes" prohibited="no" unknown="no" excluded="no" value="&lt;WORD&gt;krav" control="no" idf="1.001" boost="1"/></group></query>
		 
		result of the Expand gives : 
		---------------------------------------
		C:/Dgpe/Egothor_barrel expand of <WORD>kr*
		Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl loadState
		INFO: Loading state
		Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl elements
		INFO: Dynamizer is dirty
		<WORD>kraft
		<WORD>krav
		<WORD>kriterier
		<WORD>kræver
		Aug 5, 2004 4:59:03 PM org.egothor.dir.TankerImpl commit
		INFO: Saving state
		 
		result of Dumper gives
		---------------------------------
		0 [PDF/PS] : [http://winold/manual//danish.pdf] :CM\531576DA.doc PE 344.027 Or. EN DA DA EUROPA-PARLAMENTET BUDGETUDVALGET Meddelelse til medlemmerne Om: Håndbog for nye udvalgsmedlemmer GENERALDIREKTORATET FOR INTERNE POLITIKKER 3. juni 2004 PE 344.027 2/9 CM\531576DA.doc DA Indledning Europ
		1 Struts for Transforming XML with XSL (stxx) [http://winold/manual//index.html] :the stxx site stxx Home Getting Started About Index License Download Who we are FAQ Changes Todo Site as PDF Getting Involved Contributing...
		2 [PDF/PS] : [http://winold/manual//site.pdf] :stxx Documentation Table of contents 1. About.................................................................................................................................... 1 1.1. Struts for Transforming XML with XSL (stxx).........................
		<!VOLATILE>depthrank 3 org.egothor.store.disc.RankFileIn
		0 w=9 : 
		1 w=9 : 
		2 w=9 : 
		<ACRONYM>e.g. 1 org.egothor.store.disc.IListFileIn
		2 w=1 : 3220
		<APOSTROPHE>action's 1 org.egothor.store.disc.IListFileIn
		2 w=1 : 6005
		<APOSTROPHE>apache's 1 org.egothor.store.disc.IListFileIn
		2 w=1 : 3934
		.../...
		<WORD>korrekt 1 org.egothor.store.disc.IListFileIn
		0 w=4 : 2168
		<WORD>kort 1 org.egothor.store.disc.IListFileIn
		0 w=14 : 338 1634 2342
		<WORD>kraft 1 org.egothor.store.disc.IListFileIn
		0 w=4 : 110
		<WORD>krav 1 org.egothor.store.disc.IListFileIn
		0 w=4 : 2590
		<WORD>kriterier 1 org.egothor.store.disc.IListFileIn
		0 w=4 : 668
		<WORD>kræver 1 org.egothor.store.disc.IListFileIn
		0 w=4 : 401
		<WORD>kun 1 org.egothor.store.disc.IListFileIn
		0 w=42 : 373 421 694 1115 1369 1891 1932 2193 2426
		<WORD>kunne 1 org.egothor.store.disc.IListFileIn
		../...
		 
		 
		Christophe Spielmann
		 
		 
		 
		 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.egothor.org/pipermail/egothor-tech/attachments/20040805/9690a624/attachment.html


More information about the Egothor-tech mailing list
© 2004 Egothor Developers