[Egothor-tech] Problem with running capek
Leo Galambos
leo.galambos at mff.cuni.cz
Wed Nov 22 19:34:35 GMT 2006
Steve Mannin wrote:
> Hi,
>
> I've tried to play with egothor starting from running Capek. However,
> I found that the file etc/rules I checked out from CVS is totally in
> different format from what being described in the web page
> http://www.egothor.org/book/bk01ch02s02.html#N101F9. Which one is correct?
Hello!
Yes, that's right. The documentation is out of date in that ^^^ section.
Dundee.ac.uk uses:
==
user-agent +http://www.egothor.org/robot.html
loop 2
valid http://.*\.dundee\.ac\.uk(:\d+)?/.*
#invalid .*/~somebody/blog/.*
replace & &
replace &\w+=[0-9a-fA-F]{8,}& &
replace &\w+=[0-9a-fA-F]{8,}\z <EMPTY>
replace \?\w+=[0-9a-fA-F]{8,}(&|\z) \?
replace ;\w+=[0-9a-fA-F]{8,}\? \?
replace && <EMPTY>
replace \?& \?
replace \?$ <EMPTY>
==
It gathers pages on *.dundee.ac.uk domain, and discards some cookies
from URLs.
The syntax was described in our TWiki which was closed for serious
security problems. Here is a copy of the TWiki text:
==
---++ File of Capek's rules
Typical rules can look like these:
<verbatim>
robot-id egobot
user-agent
+http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
loop 2
valid http://fw\.my:80/.*
replace PHPSESSION=[^&]* <EMPTY>
replace
&& <EMPTY>
replace
\?& \?
invalid .*\.mpeg
</verbatim>
They specify your agent string, number of loops you allow in URLs, what
URLs are (in)valid and how the URL should be reformated - in the example
above, how you can exclude PHPSESSION variable.
The syntax is as follows ([] denotes optional parameter):
* =robot-id string= DEFAULT: capek
* =user-agent string= DEFAULT:
+http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
* =loop integer= DEFAULT: 0
* =valid regexPattern [last]=
* =invalid regexPattern [last]=
* =replace regexPattern replacement=
The semantics:
* Empty lines and lines starting with a hash are ignored.
* Delimeters are space(s) and tab(s).
* If a line with =(in)valid= ends with =last= option, then the
rule is last which is evaluated if it matches.
* If none rule matches, the input URL is taken as =invalid=
* If =replacement= (rule =replace=) is equal to "<EMPTY>",
the replacement pattern is set to an empty string
* Note, that "." in regex matches any character, use "\." if you
mean a dot!
* Loop rule sets the number of repetitions in a URL (path and
query strings are validated) - see the example below
If you prepare your rules, for instance save them to "rules" file in
your current directory, you can test them using
=org.egothor.robot.Config=. On Linux you can then issue:
<verbatim>
java org.egothor.robot.Config rules <<EOF
http://fw.my:80/a/index.html
http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html
http://fw.my:80/a/a/a/a/index.html
http://fw.my:80/a/index.html?b=/a/a/a
http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
http://jakarta.apache.org:80/
EOF
</verbatim>
The tool reads and transforms all URLs given on input to their final
form, or rather, how they are understood by Capek. URLs which are
excluded are transformed to =null=.
The respective output in our case:
<verbatim>
Reading rules
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html null
http://fw.my:80/a/a/a/a/index.html null
http://fw.my:80/a/index.html?b=/a/a/a null
http://fw.my:80/a/index.html?b=/a/a http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
http://fw.my:80/a/index.html?b=/a
http://jakarta.apache.org:80/ null
</verbatim>
*Note: URL may also enter the process with a port specification given!*
---+++ Where the rule file is searched
Capek looks for the "rules" in the file
* =./rules= if nothing else is specified
* if the working directory is given by the =-l= option then
=working_dir/rules=
* if =egothor.rules.file= specifies the location of the "rules"
file (default: "./rules")
Unfortunatly if the rules file can't be read, no error message is logged
(only if the FINEST mode is set). In this case no url will be accepted.
(Main.PeterHalacsy - 05 Jul 2004)
> I also tried running the following command:
>
> java org.egothor.robot.Capek http://www.egothor.org
>
> using the default etc/rules. It showed the following messages:
>
> Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> WARNING: Cannot find the previous scheduler on a disk
> Nov 22, 2006 1:12:45 PM org.egothor.robot.memory.ArrayStats initialize
> WARNING: init from disk
> java.io.FileNotFoundException: ./scheduler/root.aux.arr (No such file
> or directory)
This is OK, it is an info message masked as an exception :)
It appears only once (when the robot DB is empty).
> Nov 22, 2006 1:12:45 PM org.egothor.robot.components.Capek inject
> INFO: cannot be accepted
The start-point (http://www.egothor.org) did not pass your "rules".
Check the syntax described above.
Cheers,
Leo
--
Leo Galambos
Faculty of Mathematics and Physics, DSE
Malostranske namesti 25
Prague 1
CZE
http://kocour.ms.mff.cuni.cz/~galambos/
More information about the Egothor-tech
mailing list