[Egothor-tech] keeping index uptodate

HM hm at hmLyons.com
Fri Nov 19 20:48:12 GMT 2004


Oh, okay I see. Thanks for the tips Leo.

-HM

> Date: Fri, 19 Nov 2004 11:25:03 +0100
> From: Leo Galambos <Leo.Galambos at egothor.org>
> Subject: Re: [Egothor-tech] keeping index uptodate
> To: Egothor list <egothor-tech at egothor.org>
> Message-ID: <419DC9FF.2050706 at egothor.org>
> Content-Type: text/plain; charset=us-ascii; format=flowed
>
> Hi,
>
> Capek tries to analyze how often the pages are changing and sets
> appropriate intervals between two visits of a page. I mean, every page
> can have a different frequency of visits and you cannot change it,
> because this is computed by the robot.
>
> In your case, you may want to rescan your website weekly or so.
> Therefore, I would rather suggest you to use a scheduled script which
> will execute:
> # remove everything we have now
> rm -rf corpus linkdb scheduler index delta
> # start the robot, build the index from scratch
> java org.egothor.robot.Capek http://mywebsite.tld/
> java org.egothor.apps.Michelangelo ......
> # move the result to a production directory
> mv -rf index /opt/tomcat/webapps/egothor/index.new
> # twist quickly
> mv /opt/tomcat/webapps/egothor/index /opt/tomcat/webapps/egothor/index.old
> mv /opt/tomcat/webapps/egothor/index.new /opt/tomcat/webapps/egothor/index
> # and now we can remove the old index
> rm -rf /opt/tomcat/webapps/egothor/index.old
>
> Obviously, if your site is small and the web documents are saved on your
> disks, you would rather use a local indexer -- Directory, see
> http://www.egothor.org/twiki/bin/view/Know/Directory
>
> Cheers,
> Leo
>
>
> HM wrote:
>
>>Hello list,
>>
>>I'm evaluating EgoThor for use at our company. I've been reading the docs and I'm trying to
>>figure out what the recommended approach is for maintaining an uptodate index of a website.
>>
>>It seems pretty straight forward to run Capek as a daemon,
>>
>>java org.egothor.robot.Capek -daemon [your URLs to crawl]
>>
>>I assume this means that it will crawl the entire site, then when it's done, start over
>> again.
>>I suspose this could be used with the egothor.server.pause argument so that Capek will slowly
>>and continuosly crawl the site.
>>
>>And for indexing, Michangelo could be run at an interval like every 48 hours or something.
>>
>>Am I understanding everything correctly, is the correct approach to use to keep an uptodate
>>index of a website?


More information about the Egothor-tech mailing list