[Egothor-tech] Duplicate default pages

Leo Galambos Leo.Galambos at egothor.org
Tue Apr 19 11:25:20 BST 2005


Stuart David Lewis [sdl] wrote:

>>http://www.badboyz.example.com:80/stupidpage.html = 0 url 
>>http://www.goodboys.example.com:80/greatresource.html + 5
>>    
>>
>
>Sorry to be stupid, but how does this affect the output? Are entries
>with higher scores ranked above those with lower scores? This might
>solve the problem where there are many results, but what about if there
>are only two results? (/ and /index.xyz)
>  
>

Hi!

Yes, hits with higher scores are listed first (unless you change the
scoring formula). On the other hand, the second issue (/ and /index.xyz
on a result page) cannot be solved with this approach. You would have to
filter out the duplicates hits with identical MD5. This value is
available as getMD5: if the hit is "org.egothor.data.Hit hit" then use
((org.egothor.indexer.html2.HTMLMetadata)hit.getMeta()).getMD5().

See:
http://www.egothor.org/api/kernel/org/egothor/data/Hit.html#getMeta()
http://www.egothor.org/api/kernel/org/egothor/indexer/html2/HTMLMetadata.html#getMD5()


>How easy would it be to write a script to iterate the index file(s) and
>remove entries ending in one of a pre-defined set of default pages (e.g.
>index.* or default.*) if there is a matching entry with just the
>trailing slash? Just a thought.
>  
>

Good point! It could be very easy. Unfortunately, I do not want to play
with the old sources (I have not even enough time to work on 2.x), but
all you need to do is to run the same code as
org.egothor.robot.Michelangelo.Updater::updateShaker -- about 9-12 lines
of Java code. BarrelShaker is one of the tanker's barrels (available via
http://www.egothor.org/api/kernel/org/egothor/dir/Tanker.html#elements()),
and you run:

        void updateShaker(BarrelShaker b) {
            IMetaReader imr = b.openDocMeta();
            while (imr.hasMoreElements()) {
                Object o = imr.nextElement();
                long uid = imr.getUid();
                if (o instanceof org.egothor.indexer.html2.HTMLMetadata) {
                    org.egothor.indexer.html2.HTMLMetadata mt =
                            (org.egothor.indexer.html2.HTMLMetadata) o;
                    if (/*test on the metadata is here*/) {
                        logger.log(Level.FINE, "deleted",
Integer.toString(uid));
                        b.removeDoc(uid);
                    }
                }
            }
            imr.close();


Cheers,
Leo

-- 
Leo Galambos
Faculty of Mathematics and Physics, DSE
Malostranske namesti 25
Prague 1
CZE

http://kocour.ms.mff.cuni.cz/~galambos/




More information about the Egothor-tech mailing list