[Egothor-tech] Problem with running capek

Leo Galambos leo.galambos at mff.cuni.cz
Wed Nov 22 19:34:35 GMT 2006


Steve Mannin wrote:
> Hi,
>  
> I've tried to play with egothor starting from running Capek. However, 
> I found that the file etc/rules I checked out from CVS is totally in 
> different format from what being described in the web page 
> http://www.egothor.org/book/bk01ch02s02.html#N101F9. Which one is correct?

Hello!

Yes, that's right. The documentation is out of date in that ^^^ section.

Dundee.ac.uk uses:
==
user-agent      +http://www.egothor.org/robot.html
loop            2

valid           http://.*\.dundee\.ac\.uk(:\d+)?/.*
#invalid         .*/~somebody/blog/.*
replace         &                   &
replace         &\w+=[0-9a-fA-F]{8,}&   &
replace         &\w+=[0-9a-fA-F]{8,}\z  <EMPTY>
replace         \?\w+=[0-9a-fA-F]{8,}(&|\z)     \?
replace         ;\w+=[0-9a-fA-F]{8,}\?  \?
replace         &&                      <EMPTY>
replace         \?&                     \?
replace         \?$                     <EMPTY>
==
It gathers pages on *.dundee.ac.uk domain, and discards some cookies 
from URLs.

The syntax was described in our TWiki which was closed for serious 
security problems. Here is a copy of the TWiki text:
==
---++ File of Capek's rules

Typical rules can look like these:
<verbatim>
robot-id                  egobot
user-agent              
+http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
loop                            2

valid                     http://fw\.my:80/.*
replace                 PHPSESSION=[^&]*                  <EMPTY>
replace                 
&&                                                       <EMPTY>
replace                 
\?&                                                     \?
invalid                 .*\.mpeg
</verbatim>
They specify your agent string, number of loops you allow in URLs, what 
URLs are (in)valid and how the URL should be reformated - in the example 
above, how you can exclude PHPSESSION variable.

The syntax is as follows ([] denotes optional parameter):
        * =robot-id string= DEFAULT: capek
        * =user-agent string= DEFAULT: 
+http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
        * =loop integer= DEFAULT: 0
        * =valid regexPattern [last]=
        * =invalid regexPattern [last]=
        * =replace regexPattern replacement=

The semantics:
        * Empty lines and lines starting with a hash are ignored.
        * Delimeters are space(s) and tab(s).
        * If a line with =(in)valid= ends with =last= option, then the 
rule is last which is evaluated if it matches.
        * If none rule matches, the input URL is taken as =invalid=
        * If =replacement= (rule =replace=) is equal to "&lt;EMPTY&gt;", 
the replacement pattern is set to an empty string
        * Note, that "." in regex matches any character, use "\." if you 
mean a dot!
        * Loop rule sets the number of repetitions in a URL (path and 
query strings are validated) - see the example below

If you prepare your rules, for instance save them to "rules" file in 
your current directory, you can test them using 
=org.egothor.robot.Config=. On Linux you can then issue:
<verbatim>
java org.egothor.robot.Config rules <<EOF
http://fw.my:80/a/index.html
http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html
http://fw.my:80/a/a/a/a/index.html
http://fw.my:80/a/index.html?b=/a/a/a
http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
http://jakarta.apache.org:80/
EOF
</verbatim>
The tool reads and transforms all URLs given on input to their final 
form, or rather, how they are understood by Capek. URLs which are 
excluded are transformed to =null=.

The respective output in our case:
<verbatim>
Reading rules
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html null
http://fw.my:80/a/a/a/a/index.html null
http://fw.my:80/a/index.html?b=/a/a/a null
http://fw.my:80/a/index.html?b=/a/a http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a 
http://fw.my:80/a/index.html?b=/a
http://jakarta.apache.org:80/ null
</verbatim>

*Note: URL may also enter the process with a port specification given!*

---+++ Where the rule file is searched

Capek looks for the "rules" in the file
        * =./rules= if nothing else is specified
        * if the working directory is given by the =-l= option then 
=working_dir/rules=
        * if =egothor.rules.file= specifies the location of the "rules" 
file (default: "./rules")

Unfortunatly if the rules file can't be read, no error message is logged 
(only if the FINEST mode is set). In this case no url will be accepted.

(Main.PeterHalacsy - 05 Jul 2004)


> I also tried running the following command:
>  
> java org.egothor.robot.Capek http://www.egothor.org
>  
> using the default etc/rules. It showed the following messages:
>  
> Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> WARNING: Cannot find the previous scheduler on a disk
> Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> WARNING: init from disk
> java.io.FileNotFoundException: ./scheduler/root.aux.arr (No such file 
> or directory)

This is OK, it is an info message masked as an exception :)
It appears only once (when the robot DB is empty).

> Nov 22, 2006 1:12:45 PM org.egothor.robot.components.Capek inject
> INFO: cannot be accepted

The start-point (http://www.egothor.org) did not pass your "rules". 
Check the syntax described above.

Cheers,
Leo

-- 
Leo Galambos
Faculty of Mathematics and Physics, DSE
Malostranske namesti 25
Prague 1
CZE

http://kocour.ms.mff.cuni.cz/~galambos/




More information about the Egothor-tech mailing list