A collector is essentially an Enonic application that can be configured to fetch data from a specific source and add it to one (or more) Explorer collections.
Collectors are typically distributed as regular XP applications. A single application may contain multiple collectors. As such, the first thing you need to do is install the app.
When you create a collection, select the collector, and configure it using the form that appears.
Each collector may provide special configuration forms that allow you to tune how it will work for a specific collection.
|A collection may only have a single collector associated with it.|
Collectors run as background jobs, and may either be executed manually, or via a schedule. When running, the collector will (hopefully) fill the associated collection with documents.
Explorer ships with a standard collector called
Webcrawler. This is a simple webcrawler which can be used to index traditional server-side rendered websites, or simply used for testing purposes.
The webcrawler only processes HTML pages. It will extract text from the HTML document, effectively removing any element tags.
|Text within <script>, <nav> and <footer> elements will automatically be removed before the document is stored in the collection.|
The webcrawler collector has the following configuration options.
This is the URL at which the collector will start crawling. For instance
https://market.enonic.com. Only pages within the specified domain will be indexed.
- Exclude patterns
To avoid indexing all pages within a domain, you may specify exclude patterns.
Exclude patterns are specified using regular expressions. The expressions are matched against the start URL domain root.
Example: Ignore all links that start with
Example: Ignore pages ending with
TODO: Verify examples and Document format(link?)
Some webservers handle user-agents differently. Here you can pretend to be a specific browser or robot. If you leave the field empty it will use a default user agent.
- Max pages
Some websites dynamically generate pages. To avoid infinite crawling sessions, you may specify a maximum number of pages to crawl. The default value is 1000.
An exciting feature of Explorer is that you may also build your own collectors. We prepared a dedicated starter-kit and tutorial for building custom collectors to give you a flying start.