Collectors
Contents
Introduction
A collector is essentially an Enonic application that can be configured to fetch data from a specific source and add it to one (or more) Explorer collections.
Usage
Install
Collectors are typically distributed as regular XP applications. A single application may contain multiple collectors. As such, the first thing you need to do is install the app.
Configure
When you create a collection, select the collector and configure it using the form that appears.
Each collector may provide special configuration forms that allow you to tune how it will work for a specific collection.
A collection may only have a single collector associated with it. |
Run
Collectors run as background jobs, and may either be executed manually, or via a schedule. When running, the collector will (hopefully) fill the associated collection with documents.
The Webcrawler collector
Explorer ships with a standard collector called Webcrawler
. This is a simple webcrawler which can be used to index traditional server-side rendered websites, or simply used for testing purposes.
The webcrawler only processes HTML pages. It will extract text from the HTML document, effectively removing any element tags.
Text within <script>, <nav> and <footer> elements will automatically be removed before the document is stored in the collection. |
Configuration
The webcrawler collector has the following configuration options.
- URL
-
This is the URL at which the collector will start crawling (for instance
https://market.enonic.com
). Only pages within the specified domain will be indexed. - Exclude patterns
-
To avoid indexing all pages within a domain, you may specify exclude patterns.
Exclude patterns are specified using regular expressions. The expressions are matched against the start URL domain root.
Example: Ignore all links that start with
/whatever
^/whatever.*$
Example: Ignore pages ending with
.html
.html$
- User-Agent
-
Some webservers handle user-agents differently. Here you can pretend to be a specific browser or robot. If you leave the field empty it will use a default user agent.
- Max pages
-
Some websites dynamically generate pages. To avoid infinite crawling sessions, you may specify a maximum number of pages to crawl. The default value is 1000.
Headless browser
Available from Explorer 4.3.0. |
You can set up a headless browser to render pages that require JavaScript to be executed. This is done by setting the browserlessUrl
property in the application configuration file.
browserlessUrl = http://localhost:3000/content
You can download and run browserless via Docker. |
Custom collectors
An exciting feature of Explorer is that you may also build your own collectors. We prepared a dedicated starter-kit and tutorial for building custom collectors to give you a flying start.