File of Capek's rules
Typical rules can look like these:
robot-id egobot
user-agent +http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
loop 2
valid http://fw\.my:80/.*
replace PHPSESSION=[^&]* <EMPTY>
replace && <EMPTY>
replace \?& \?
invalid .*\.mpeg
They specify your agent string, number of loops you allow in URLs, what URLs are (in)valid and how the URL should be reformated - in the example above, how you can exclude PHPSESSION variable.
The syntax is as follows ([] denotes optional parameter):
-
robot-id string DEFAULT: capek
-
user-agent string DEFAULT: +http://www.egothor.org/twiki/bin/view/Know/UnknownRobot
-
loop integer DEFAULT: 0
-
valid regexPattern [last]
-
invalid regexPattern [last]
-
replace regexPattern replacement
The semantics:
- Empty lines and lines starting with a hash are ignored.
- Delimeters are space(s) and tab(s).
- If a line with
(in)valid ends with last option, then the rule is last which is evaluated if it matches.
- If none rule matches, the input URL is taken as
invalid
- If
replacement (rule replace) is equal to "<EMPTY>", the replacement pattern is set to an empty string
- Note, that "." in regex matches any character, use "\." if you mean a dot!
- Loop rule sets the number of repetitions in a URL (path and query strings are validated) - see the example below
If you prepare your rules, for instance save them to "rules" file in your current directory, you can test them using
org.egothor.robot.Config. On Linux you can then issue:
java org.egothor.robot.Config rules <<EOF
http://fw.my:80/a/index.html
http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html
http://fw.my:80/a/a/a/a/index.html
http://fw.my:80/a/index.html?b=/a/a/a
http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a
http://jakarta.apache.org:80/
EOF
The tool reads and transforms all URLs given on input to their final form, or rather, how they are understood by Capek. URLs which are excluded are transformed to
null.
The respective output in our case:
Reading rules
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/index.html http://fw.my:80/a/index.html
http://fw.my:80/a/a/index.html http://fw.my:80/a/a/index.html
http://fw.my:80/a/a/a/index.html null
http://fw.my:80/a/a/a/a/index.html null
http://fw.my:80/a/index.html?b=/a/a/a null
http://fw.my:80/a/index.html?b=/a/a http://fw.my:80/a/index.html?b=/a/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?b=/a http://fw.my:80/a/index.html?b=/a
http://fw.my:80/a/index.html?PHPSESSION=1234761243&b=/a http://fw.my:80/a/index.html?b=/a
http://jakarta.apache.org:80/ null
Note: URL may also enter the process with a port specification given!
Where the rule file is searched
Capek looks for the "rules" in the file
-
./rules if nothing else is specified
- if the working directory is given by the
-l option then working_dir/rules
- if
egothor.rules.file specifies the location of the "rules" file (default: "./rules")
Unfortunatly if the rules file can't be read, no error message is logged (only if the FINEST mode is set). In this case no url will be accepted.
(
PeterHalacsy? - 05 Jul 2004)