[Egothor-tech] Questions about Egothor
Marvin Humphrey
marvin at rectangular.com
Thu Mar 10 21:46:36 GMT 2005
On Mar 10, 2005, at 11:50 AM, Leo Galambos wrote:
> Marvin Humphrey wrote:
>
>>> The major difference is that Lucene uses a classic algorithm for
>>> index actualization while Egothor uses something more sophisticated.
>>> That is why Egothor may operate on huge document collections more
>>> effectively.
>>
>>
>> Could you explain some of the differences?
>
>
> Egothor counts which parts of an index are too slow and optimizes them
> preferably, so that queries run quickly and smoothly. Lucene does not
> know how much the parts are optimized, so that it always optimizes the
> full index. Obviously, Lucene's strategy costs more time and I/O
> operations. Moreover, Egothor manages less parts of an index and can
> query them concurrently, Lucene queries the parts sequentially.
>
> So the summary is: If you had a dynamic collection (=not static) and
> you wanted to solve queries fast, Lucene would have to optimize the
> index more frequently - and it would cost time and I/O operations.
Intriguing. Lucene indexes are made up of segments, with each segment
a self-contained index. The number of docs in each segment depends on
the mergeFactor -- with a mergeFactor of 10, if you index 2999
documents you end up with:
2 segments with 1000 documents each
9 segments with 100 docs each
9 segments with 10 docs each
1 segment with 9 docs
If you don't optimize those into a single index, then a search query
has to chug through every one of them, which is obviously less
efficient that chugging through a consolidated, optimized version.
I suppose this is the "traditional" index model. Indeed, I don't see
how you could optimize only popular terms, and the optimization process
involves a lot of sorting and interleaving. But after perusing the
Egothor docs, I have yet to grok how it does things differently. Is
the extra overhead you speak of in Lucene's optimization process the
copying of the stored fields? Or do you have another trick having to
do with the scoring data?
The problem is that for each term in each new document you add to an
existing index, you have to modify the term frequency (tf_d) and
document frequency (idf_t) data.
From the documentation on this page...
http://www.egothor.org/book/bk01ch02s05.html
...it looks like you break out the scoring data for certain (popular?)
terms into .sep files. But you'd still have to modify trm.idx with
each new document for the terms that aren't in .sep files, because
otherwise searches on the (unpopular?) terms not in .sep files would
fail to return the incrementally added docs.
> Other differencies are not important for you unless you develop a
> distributed search engine or your collection is pretty huge, i.e. the
> whole web, one half of the web,... or so :).
FYI, I'm working on a Perl search engine library right now, and I'm
studying various ways of addressing the incremental indexing problem.
However, my search engine is not intended to scale up infinitely, nor
will it support distributed searching/indexing in the p2p sense.
PS: All the links at the bottom of this page seem to be broken:
http://www.egothor.org/api/
Best,
-- Marvin Humphrey
More information about the Egothor-tech
mailing list