public class Config extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
Config.TypeURIRepository |
| Modifier and Type | Field and Description |
|---|---|
static int |
BACKUPRESPONSES
Backup all in-content-table-specified documents into a gzipped stream?
|
static int |
BUCKETSCACHE
Number of cached
Table objects. |
static int |
BUCKETSIZE
Size of buckets (in bytes).
|
static int |
CAPACITY
How many servers will be kept in memory for processing (implies up to 4
IO handles for each).
|
static boolean |
DEBUG
Print more details about robot work into logs.
|
static int |
DNSRETRIES
How many times we restart DNS query?
|
static String |
DNSSERVER
IP address of our DNS server.
|
static int |
DNSTIMEOUT
How long we wait for DNS response.
|
static long |
DOCDELAY
How often we may ask for the same URI? [msec]
|
static long |
DOCDELAYONFAIL
How often we may ask for the same URI if the request failed? [msec]
|
static long |
DOCINITDELAY
How often we may ask for the same URI on start (the first two fetches)?
[msec]
|
static int |
FETCHQSIZE
Number of ready servers for gathering (number of pages gathered
concurrently).
|
static int |
IMGBUCKETSCACHE
Number of cached
Table objects when storing
IMG-SRC URIs. |
static int |
IMGNODESCACHE
An internal cache which is used in
Directory when storing IMG-SRC URIs. |
static int |
IMGSCHUNKLEN
Chunk length of
IMGSLISTCHUNKFILENAME and
IMGSSTRUCTCHUNKFILENAME reporters. |
static String |
IMGSLISTCHUNKFILENAME
Filename of log file with all new IMG URIs accepted.
|
static String |
IMGSSTRUCTCHUNKFILENAME
Filename of log file with the occurencies of images on pages.
|
static int |
IMGURISCACHE
An internal cache which is used in
Table
when storing IMG-SRC URIs. |
static int |
INDEXCHUNK
How large chunks are created by the indexer.
|
static int |
IP_PAUSE
Delay between two connections to the same IP (msec).
|
static int |
IP_VIRTUAL
Number of requests to one IP concurrently (in case of virtual hosts).
|
static int |
MAXCONNECTIONSINSEQ
One server may be stay in the processing queue up to this value of
connections, then is must give a chance to others.
|
static int |
MAXFAILURESINBATCH
How many times do we allow to restart DNS or HTTP requests during a batch
(sequence)?
|
static int |
MAXIMUSTOTALUS
What's the maximum size of a document we accept?
|
static int |
MAXITEMSINQUEUE
If a queue has more than this number of URIs in a queue, then new URIs
are not planned.
|
static long |
MAXPAGES
Initial number of URIs we will collect.
|
static long |
MAXSERVERS
Initial number of servers robot will scan.
|
static int |
MAXURILEN
Max length of a URI.
|
static int |
MAXURISFORBLINDHASH
Size of a hash table (bitmap) for all URIs in the system.
|
static int |
NEWURISCACHE
An internal cache which is used in
LinksCollector. |
static int |
NODESCACHE
An internal cache which is used in
Directory. |
static int |
POSTQUEUESIZE
How many
Response objects are in post-process
queues. |
static int |
PREFSORTED
Number of items sorted preferably.
|
static int |
RESOLVEQSIZE
Number of ready servers for resolving (number of concurrent request our
DNS server can handle).
|
static int |
RESPONSESIZE
What's the maximum size of a defragmented response we accept?
|
static String |
ROBOTID
Identification string of our robot.
|
static long |
ROBOTSTXT
How old can the robots.txt specification be (in milliseconds)?
|
static int |
SAVEINTERVAL
New URIs are saved at these intervals (msec) when possible.
|
static boolean |
SCRUB
Scrub all URI to a standard format?
|
static long |
SRVEMPTY
How long is a server parked when its queue is empty?
|
static long |
SRVIPLOCKED
How long is a server parked when its IP is locked currently?
|
static long |
SRVRECYCLE
How long is a server parked when it is exhaused?
|
static long |
SRVTCPERROR
How long is a server parked when it is unreachable?
|
static long |
SRVUNRESOLVED
How long is a server parked when it cannot be resolved?
|
static String |
STATCHUNKFILENAME
Filename of log file with statistics values of pages processed
successfully.
|
static int |
STATCHUNKLEN
Chunk length of
STATCHUNKFILENAME reporter. |
static int |
TRANSMITTERPORT
UDP SAX events transmitter port.
|
static int |
TURNUPWHEEL
How many requests are parsed in synchronous mode.
|
static Config.TypeURIRepository |
URIREPOSITORY
What URI repository is used inside
T0 to
assign ids to URIs. |
static int |
URISCACHE
An internal cache which is used in
Table or
BijectInt2StringAppender. |
static String |
URISCHUNKFILENAME
Filename of log file with all new URIs accepted.
|
static int |
URISCHUNKLEN
Chunk length of
URISCHUNKFILENAME reporter. |
static String |
USERAGENT
User agent string of our robot.
|
static String |
VR
Version number of the robot.
|
static int |
WWWRETRIES
How many times we restart HTTP connection?
|
static int |
WWWTIMEOUT
How long we wait for data.
|
| Constructor and Description |
|---|
Config()
This is an empty constructor - used by test scripts only.
|
| Modifier and Type | Method and Description |
|---|---|
static Escape |
acceptToDownload(String contentType,
String contentLength)
Tests whether we have some interest to download a document of a given
content-type and suggested content-length (both are HTTP headers sent to
us).
|
static boolean |
allowedPage(String name) |
static boolean |
allowedServer(String name) |
static boolean |
allowedToBackup(String contentType) |
static String |
explain(URI uri) |
void |
exportTo(DataOutputStream dos)
Save this configuration into an output stream.
|
boolean |
hasSomeBackupConditions()
Return true if this configuration defines some rules for backup.
|
ArrayList<String> |
initialize(String filename)
Load a configuration from a file.
|
URI |
normalize(URI uri)
This function first normalizes the URI and then tests whether such a URI
is valid with a configuration specified by this class.
|
void |
shutdown()
Close any resources this object may held.
|
public static boolean DEBUG
public static String USERAGENT
public static String ROBOTID
public static final String VR
public static String DNSSERVER
public static final int RESPONSESIZE
public static int MAXIMUSTOTALUS
public static int IP_PAUSE
public static int IP_VIRTUAL
public static int MAXURILEN
public static int CAPACITY
public static int RESOLVEQSIZE
public static int FETCHQSIZE
public static int DNSTIMEOUT
public static int WWWTIMEOUT
public static int DNSRETRIES
public static int WWWRETRIES
public static int MAXFAILURESINBATCH
MAXCONNECTIONSINSEQpublic static long ROBOTSTXT
public static long DOCDELAY
public static long DOCINITDELAY
public static long DOCDELAYONFAIL
public static long SRVEMPTY
public static long SRVUNRESOLVED
public static long SRVIPLOCKED
public static long SRVTCPERROR
public static long SRVRECYCLE
MAXCONNECTIONSINSEQpublic static int NEWURISCACHE
LinksCollector.public static int BUCKETSCACHE
Table objects.public static int URISCACHE
Table or
BijectInt2StringAppender.public static int NODESCACHE
Directory.public static int IMGBUCKETSCACHE
Table objects when storing
IMG-SRC URIs.ImageLinksExtractorpublic static int IMGURISCACHE
Table
when storing IMG-SRC URIs.ImageLinksExtractorpublic static int IMGNODESCACHE
Directory when storing IMG-SRC URIs.ImageLinksExtractorpublic static int SAVEINTERVAL
public static String URISCHUNKFILENAME
public static int URISCHUNKLEN
URISCHUNKFILENAME reporter. Valid values are:
T0public static String STATCHUNKFILENAME
public static int STATCHUNKLEN
STATCHUNKFILENAME reporter. Valid values are:
T5public static String IMGSLISTCHUNKFILENAME
ImageLinksExtractor,
Reporterpublic static int IMGSCHUNKLEN
IMGSLISTCHUNKFILENAME and
IMGSSTRUCTCHUNKFILENAME reporters. Valid values are:
ImageLinksExtractor,
Responsepublic static String IMGSSTRUCTCHUNKFILENAME
ImageLinksExtractor,
Reporterpublic static long MAXSERVERS
public static long MAXPAGES
public static int MAXCONNECTIONSINSEQ
public static int BUCKETSIZE
Bucketpublic static int INDEXCHUNK
public static int POSTQUEUESIZE
Response objects are in post-process
queues.public static int TURNUPWHEEL
T5public static int BACKUPRESPONSES
StorageEscapepublic static int MAXURISFORBLINDHASH
FastBlindAppender,
T0public static Config.TypeURIRepository URIREPOSITORY
T0 to
assign ids to URIs.public static boolean SCRUB
Scrubpublic static int PREFSORTED
SequentialQueuepublic static int MAXITEMSINQUEUE
PageSequentialQueuepublic static int TRANSMITTERPORT
Transmitterpublic void exportTo(DataOutputStream dos) throws IOException
dos - the output streamIOException - when the data cannot be writtenpublic ArrayList<String> initialize(String filename) throws IOException
filename - the filenameIOException - on I/O errornormalize(java.net.URI)public URI normalize(URI uri)
uri - the entry URIvalid(java.net.URI)public static Escape acceptToDownload(String contentType, String contentLength)
contentType - contentLength - public static boolean allowedServer(String name)
name - public static boolean allowedPage(String name)
name - public static boolean allowedToBackup(String contentType)
contentType - public boolean hasSomeBackupConditions()
public void shutdown()
Copyright © 2016 Egothor. All Rights Reserved.