Collectors

Contents

Introduction

A collector is essentially an Enonic application that can be configured to fetch data from a specific source and add it to one (or more) Explorer collections.

Usage

Install

Collectors are typically distributed as regular XP applications. A single application may contain multiple collectors. As such, the first thing you need to do is install the app.

Configure

When you create a collection, select the collector, and configure it using the form that appears.

Each collector may provide special configuration forms that allow you to tune how it will work for a specific collection.

A collection may only have a single collector associated with it.

Run

Collectors run as background jobs, and may either be executed manually, or via a schedule. When running, the collector will (hopefully) fill the associated collection with documents.

The Webcrawler collector

Explorer ships with a standard collector called Webcrawler. This is a simple webcrawler which can be used to index traditional server-side rendered websites, or simply used for testing purposes.

The webcrawler only processes HTML pages. It will extract text from the HTML document, effectively removing any element tags.

Text within <script>, <nav> and <footer> elements will automatically be removed before the document is stored in the collection.

Configuration

The webcrawler collector has the following configuration options.

URL

This is the URL at which the collector will start crawling. For instance https://market.enonic.com. Only pages within the specified domain will be indexed.

Exclude patterns

To avoid indexing all pages within a domain, you may specify exclude patterns.

Exclude patterns are specified using regular expressions. The expressions are matched against the start URL domain root.

Example: Ignore all links that start with /whatever

^/whatever.*$

Example: Ignore pages ending with .html

.html$

TODO: Verify examples and Document format(link?)

User-Agent

Some webservers handle user-agents differently. Here you can pretend to be a specific browser or robot. If you leave the field empty it will use a default user agent.

Max pages

Some websites dynamically generate pages. To avoid infinite crawling sessions, you may specify a maximum number of pages to crawl. The default value is 1000.

Headless browser

Available from Explorer 4.3.0.

You can set up a headless browser to render pages that require JavaScript to be executed. This is done by setting the browserlessUrl property in the application configuration file.

${XP_HOME}/config/com.enonic.app.explorer.cfg
browserlessUrl = http://localhost:3000/content
You can download and run browserless via Docker.

Custom collectors

An exciting feature of Explorer is that you may also build your own collectors. We prepared a dedicated starter-kit and tutorial for building custom collectors to give you a flying start.


Contents

Contents