[Egothor-tech] Problem with running capek

Steve Mannin stevemannin at gmail.com
Thu Nov 23 12:47:34 GMT 2006


Hi,

Thanks for your answer. However, when I changed the start point to
http://www.egothor.org:80/ the program still didn't work. I used the
following rules:

user-agent      +http://www.egothor.org/nonexistentpage.html
loop            2

valid           http://www\.egothor\.org(:\d+)?/.*

I still got the same error as follows:

Nov 23, 2006 6:56:09 PM org.egothor.robot.components.Capek inject
INFO: cannot be accepted
 Nov 23, 2006 6:56:09 PM org.egothor.robot.components.Capek inject
INFO: cannot be accepted
Listening @ 9713
...

I also tried running

java org.egothor.robot.Config rules <<EOF
> http://www.egothor.org:80/
> EOF

, and the output looked OK :

Reading rules
...
http://www.egothor.org:80/ http://www.egothor.org:80/

What's wrong with my settings?

Regards,
Steve






On 11/23/06, Leo Galambos <leo.galambos at mff.cuni.cz> wrote:
>
> Steve Mannin wrote:
> > Hi,
> >
> > I've tried to play with egothor starting from running Capek. However,
> > I found that the file etc/rules I checked out from CVS is totally in
> > different format from what being described in the web page
> > http://www.egothor.org/book/bk01ch02s02.html#N101F9. Which one is
> correct?
>
> Hello!
>
> Yes, that's right. The documentation is out of date in that ^^^ section.
>
> Dundee.ac.uk uses:
> ==
> user-agent      +http://www.egothor.org/robot.html
> loop            2
>
> valid           http://.*\.dundee\.ac\.uk(:\d+)?/.*
> #invalid         .*/~somebody/blog/.*
> replace         &amp;                   &
> replace         &\w+=[0-9a-fA-F]{8,}&   &
> replace         &\w+=[0-9a-fA-F]{8,}\z  <EMPTY>
> replace         \?\w+=[0-9a-fA-F]{8,}(&|\z)     \?
> replace         ;\w+=[0-9a-fA-F]{8,}\?  \?
> replace         &&                      <EMPTY>
> replace         \?&                     \?
> replace         \?$                     <EMPTY>
> ==
> It gathers pages on *.dundee.ac.uk domain, and discards some cookies
> from URLs.
>
> The syntax was described in our TWiki which was closed for serious
> security problems. Here is a copy of the TWiki text:
> ==
> ---++ File of Capek's rules
>
> Typical rules can look like these:
> <verbatim>
> robot-id                  egobot
> user-agent
> +http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
> loop                            2
>
> valid                     http://fw\.my:80/.*
> replace                 PHPSESSION=[^&]*                  <EMPTY>
> replace
> &&                                                       <EMPTY>
> replace
> \?&                                                     \?
> invalid                 .*\.mpeg
> </verbatim>
> They specify your agent string, number of loops you allow in URLs, what
> URLs are (in)valid and how the URL should be reformated - in the example
> above, how you can exclude PHPSESSION variable.
>
> The syntax is as follows ([] denotes optional parameter):
>        * =robot-id string= DEFAULT: capek
>        * =user-agent string= DEFAULT:
> +http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
>        * =loop integer= DEFAULT: 0
>        * =valid regexPattern [last]=
>        * =invalid regexPattern [last]=
>        * =replace regexPattern replacement=
>
> The semantics:
>        * Empty lines and lines starting with a hash are ignored.
>        * Delimeters are space(s) and tab(s).
>        * If a line with =(in)valid= ends with =last= option, then the
> rule is last which is evaluated if it matches.
>        * If none rule matches, the input URL is taken as =invalid=
>        * If =replacement= (rule =replace=) is equal to "&lt;EMPTY&gt;",
> the replacement pattern is set to an empty string
>        * Note, that "." in regex matches any character, use "\." if you
> mean a dot!
>        * Loop rule sets the number of repetitions in a URL (path and
> query strings are validated) - see the example below
>
> If you prepare your rules, for instance save them to "rules" file in
> your current directory, you can test them using
> =org.egothor.robot.Config=. On Linux you can then issue:
> <verbatim>
> java org.egothor.robot.Config rules <<EOF
> http://fw.my:80/a/index.html
> http://fw.my:80/a/index.html
> http://fw.my:80/a/a/index.html
> http://fw.my:80/a/a/a/index.html
> http://fw.my:80/a/a/a/a/index.html
> http://fw.my:80/a/index.html?b=/a/a/a
> http://fw.my:80/a/index.html?b=/a/a
> http://fw.my:80/a/index.html?b=/a
> http://fw.my:80/a/index.html?b=/a
> http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
> http://jakarta.apache.org:80/
> EOF
> </verbatim>
> The tool reads and transforms all URLs given on input to their final
> form, or rather, how they are understood by Capek. URLs which are
> excluded are transformed to =null=.
>
> The respective output in our case:
> <verbatim>
> Reading rules
> http://fw.my:80/a/index.html http://fw.my:80/a/index.html
> http://fw.my:80/a/index.html http://fw.my:80/a/index.html
> http://fw.my:80/a/a/index.html http://fw.my:80/a/a/index.html
> http://fw.my:80/a/a/a/index.html null
> http://fw.my:80/a/a/a/a/index.html null
> http://fw.my:80/a/index.html?b=/a/a/a null
> http://fw.my:80/a/index.html?b=/a/a http://fw.my:80/a/index.html?b=/a/a
> http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
> http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
> http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
> http://fw.my:80/a/index.html?b=/a
> http://jakarta.apache.org:80/ null
> </verbatim>
>
> *Note: URL may also enter the process with a port specification given!*
>
> ---+++ Where the rule file is searched
>
> Capek looks for the "rules" in the file
>        * =./rules= if nothing else is specified
>        * if the working directory is given by the =-l= option then
> =working_dir/rules=
>        * if =egothor.rules.file= specifies the location of the "rules"
> file (default: "./rules")
>
> Unfortunatly if the rules file can't be read, no error message is logged
> (only if the FINEST mode is set). In this case no url will be accepted.
>
> (Main.PeterHalacsy - 05 Jul 2004)
>
>
> > I also tried running the following command:
> >
> > java org.egothor.robot.Capek http://www.egothor.org
> >
> > using the default etc/rules. It showed the following messages:
> >
> > Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> > WARNING: Cannot find the previous scheduler on a disk
> > Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> > WARNING: init from disk
> > java.io.FileNotFoundException: ./scheduler/root.aux.arr (No such file
> > or directory)
>
> This is OK, it is an info message masked as an exception :)
> It appears only once (when the robot DB is empty).
>
> > Nov 22, 2006 1:12:45 PM org.egothor.robot.components.Capek inject
> > INFO: cannot be accepted
>
> The start-point (http://www.egothor.org) did not pass your "rules".
> Check the syntax described above.
>
> Cheers,
> Leo
>
> --
> Leo Galambos
> Faculty of Mathematics and Physics, DSE
> Malostranske namesti 25
> Prague 1
> CZE
>
> http://kocour.ms.mff.cuni.cz/~galambos/
>
>
> _______________________________________________
> Egothor-tech mailing list
> Egothor-tech at egothor.org
> http://www.egothor.org/mailman/listinfo/egothor-tech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.egothor.org/pipermail/egothor-tech/attachments/20061123/5b82f8da/attachment.htm 


More information about the Egothor-tech mailing list