Robot
The robot consists of 5 major components:
- Corpus
- Scheduler
- Link DB
- Manager
- Client
Michelangelo is delta indexer (it is not a part of the robot).
Corpus
This component is able to
- load and save gathered documents,
- allocate a new identificator to URI,
- find the identificator assigned to some URI,
- find URI related to an ID.
If a new URI comes to the system, it gets a new unique ID. This identificator is used instead of the long string representation in other components.
Scheduler
This component is able to
- schedule IDs on a time axis,
- retrieve ID which should be processed.
Link DB
Link DB only keeps track of all links among pages - this DB is managed for computation of various semantic ranks.
Manager
This is the central component which communicates with Corpus and Scheduler. It is able to
- accept URIs discovered by crawling processes - New URIs are transformed to internal IDs in Corpus. If such URIs are new, their IDs are sent to Scheduler.
- retrieve URIs which should be processed by a crawling process - IDs are popped from Scheduler, then they are translated to Strings via Corpus and sent to crawling processes.
- save Responses (documents) - they are simply routed to Corpus.
Client
This component connects to manager and retrieves URIs which should be gathered. The Responses are sent back to Manager.
Summary
The system can run unlimited number of Clients. (With some rework in Manager) it is possible to run unlimited number of Schedulers, Corpuses and Managers. It will ensure that the system can scale well on large domains.