[Egothor-tech] Questions about Egothor

Marvin Humphrey marvin at rectangular.com
Thu Mar 10 21:46:36 GMT 2005


On Mar 10, 2005, at 11:50 AM, Leo Galambos wrote:

> Marvin Humphrey wrote:
>
>>> The major difference is that Lucene uses a classic algorithm for 
>>> index actualization while Egothor uses something more sophisticated. 
>>> That is why Egothor may operate on huge document collections more 
>>> effectively.
>>
>>
>> Could you explain some of the differences?
>
>
> Egothor counts which parts of an index are too slow and optimizes them 
> preferably, so that queries run quickly and smoothly. Lucene does not 
> know how much the parts are optimized, so that it always optimizes the 
> full index. Obviously, Lucene's strategy costs more time and I/O 
> operations. Moreover, Egothor manages less parts of an index and can 
> query them concurrently, Lucene queries the parts sequentially.
>
> So the summary is: If you had a dynamic collection (=not static) and 
> you wanted to solve queries fast, Lucene would have to optimize the 
> index more frequently - and it would cost time and I/O operations.

Intriguing.  Lucene indexes are made up of segments, with each segment 
a self-contained index.  The number of docs in each segment depends on 
the mergeFactor -- with a mergeFactor of 10, if you index 2999 
documents you end up with:

2 segments with 1000 documents each
9 segments with 100 docs each
9 segments with 10 docs each
1 segment with 9 docs

If you don't optimize those into a single index, then a search query 
has to chug through every one of them, which is obviously less 
efficient that chugging through a consolidated, optimized version.

I suppose this is the "traditional" index model.  Indeed, I don't see 
how you could optimize only popular terms, and the optimization process 
involves a lot of sorting and interleaving. But after perusing the 
Egothor docs, I have yet to grok how it does things differently.  Is 
the extra overhead you speak of in Lucene's optimization process the 
copying of the stored fields?  Or do you have another trick having to 
do with the scoring data?

The problem is that for each term in each new document you add to an 
existing index, you have to modify the term frequency (tf_d) and 
document frequency (idf_t) data.

 From the documentation on this page...

http://www.egothor.org/book/bk01ch02s05.html

...it looks like you break out the scoring data for certain (popular?) 
terms into .sep files.  But you'd still have to modify trm.idx with 
each new document for the terms that aren't in .sep files, because 
otherwise searches on the (unpopular?) terms not in .sep files would 
fail to return the incrementally added docs.

> Other differencies are not important for you unless you develop a 
> distributed search engine or your collection is pretty huge, i.e. the 
> whole web, one half of the web,... or so :).

FYI, I'm working on a Perl search engine library right now, and I'm 
studying various ways of addressing the incremental indexing problem.  
However, my search engine is not intended to scale up infinitely, nor 
will it support distributed searching/indexing in the p2p sense.

PS: All the links at the bottom of this page seem to be broken:

http://www.egothor.org/api/

Best,

-- Marvin Humphrey



More information about the Egothor-tech mailing list