[Egothor-tech] Questions about Egothor

Leo Galambos Leo.Galambos at egothor.org
Fri Mar 11 12:35:14 GMT 2005


Marvin Humphrey wrote:

> Intriguing.  Lucene indexes are made up of segments, with each segment 
> a self-contained index.  The number of docs in each segment depends on 
> the mergeFactor -- with a mergeFactor of 10, if you index 2999 
> documents you end up with:
>
> 2 segments with 1000 documents each
> 9 segments with 100 docs each
> 9 segments with 10 docs each
> 1 segment with 9 docs


I will not describe my algorithm here (I'm sorry, see source code, 
please), but egothor would not generate more than 9 barrels (depends on 
configuration) and as you noted -- the number of segments/barrels can 
play some role in the overall performance.

However, it is not the main point. The problem is elsewhere I guess - 
you have an index and your collection is changing (it implies removals). 
Using the classic algorithm you have two options:
a) you do not optimize the index, so your index degenerates after many 
updates (=delete+insert), because a obsolete document is only denoted as 
"deleted" during the delete operation and still lives in the index. This 
slows down querying process.
b) the optimize routine is called, which implies that all "removed" 
documents are purged and you get just one segment/barrel. Queries run 
fast, but you must build the final segment and it costs extra I/O calls.

Let's say you have to update 1700 obsolete documents in the collection - 
it implies - you will denote them as "deleted" and insert 1700 new docs 
into the index.

Case a: I will assume that all the obsolete documents are in the two 
largest segments (it is possible because they store 66% of all docs) and 
this will simplify the calculation. I guess you will get 2 segments (a 
1000docs), 6 segs (a 100docs), 9 segs (a 10 docs) and 1 seg (a 1 doc), 
moreover, 2 segments (a 1000docs) with 85% out-dated values which slow 
down query processing.

Case b: one final segment is built and it implies about 2x more I/O ops.

And now Egothor: it would build an index without outdated values, using 
9 barrels. Because Egothor is able to run a query on these barrels 
concurrently, and because the number of barrels is kept small by my 
algorithm, it may be even faster than "case b" (time is given by the 
size of the largest barrel which is smaller than the compact index of 
"case b"). In this concrete case, querying with Egothor will be 2x 
faster than "case b", and the update operation will need much less I/Os 
than "case b", but more than "case a". This could be very interesting 
for a search engine processing www/web or any dynamic collection of 
documents.

I do not want to write elaborate papers here, so the text above may 
contain some bugs which could be clarified by 100x longer text...on the 
other hand, I hope the main difference between Egothor and Lucene was 
described correctly.

Cheers,
Leo

-- 
Leo Galambos
Faculty of Mathematics and Physics, DSE
Malostranske namesti 25
Prague 1
CZE

http://kocour.ms.mff.cuni.cz/~galambos/




More information about the Egothor-tech mailing list