Building a custom collector

Contents

Upgrading

If you are upgrading a collector to lib-explorer-4 / app-explorer-2, see upgrade.ahtml.

Introduction

Collectors are implemented as standard XP applications. They essentially consist of two main components:

Form

A React based user interface that enable search administrators to instruct the collector in it’s specific activities.

Task

Code that performs the actual retrieval of content for indexing.

This section will guide you through the basic steps of building your own custom collector.

Step 1: Create application

Use Enonic CLI to create an application based on the vanilla starter:

enonic project create -r starter-explorer-collector

The starter ships with the essential setup required for any collector:

Dependencies

gradle.properties

gradle.properties
xpVersion=7.10.0

build.gradle

build.gradle
dependencies {
	include 'com.enonic.lib:lib-explorer:4.0.0'
}
Remember to use the latest version of the library that is compatible with your version of Explorer and XP.

Installation

collectors.json

You can include multiple collectors in a single enonic xp application. If you only include one, the collectors.json file should still contain an array with a single object entry.

Each collector needs to have it’s own unique library name so react can access it on the window object.

webpack.config.babel.js
const config = {
	entry: {
		'MyCollectorNameA': './CollectorA.jsx',
		'MyCollectorNameB': './CollectorB.jsx',
	},
	output: {
		filename: '[name].esm.js',
		library: 'Lib[name]',
		libraryTarget: 'var',
	}
}
/src/main/resources/collectors.json
[{
	"componentPath": "window.LibMyCollectorNameA.Collector",
	"configAssetPath": "react/MyCollectorNameA.esm.js",
	"displayName": "My collector A",
	"taskName": "collectA"
},{
	"componentPath": "window.LibMyCollectorNameB.Collector",
	"configAssetPath": "react/MyCollectorNameB.esm.js",
	"displayName": "My collector B",
	"taskName": "collectB"
}]

documentTypes.json

In the explorer admin GUI, when you create or edit a collection and select a collector, the option to choose a default document-type is hidden. This is because a collector is supposed to provide it’s own document-type(s). You can do this by including a documentTypes.json file in the src/main/resources folder. The json file contains an array of objects so you can provide multiple document-types.

Make sure that the _name property is unique. You may want to prefix the name to avoid name collision with previously installed document-types. Also keep in mind that _name is lowercased and ascii-folded upon installation.

Currently document-types are installed when an app that contains a documentTypes.json file is started. If the document-type _name already exists, it is not overwritten.

/src/main/resources/documentTypes.json
[{
	"_name": "mydocumenttypename", (1)
	"addFields": false, (2)
	"properties": [{
		"active": true, (3)
		"enabled": true, (4)
		"fulltext": true, (4)
		"includeInAllText": true, (4)
		"max": 0, (5)
		"min": 0, (6)
		"name": "text", (7)
		"nGram": true, (4)
		"path": false, (4)
		"valueType": "string" (8)
	},{
		"active": true,
		"enabled": true,
		"fulltext": true,
		"includeInAllText": true,
		"max": 0,
		"min": 0,
		"name": "title",
		"nGram": true,
		"path": false,
		"valueType": "string"
	},{
		"active": true,
		"enabled": true,
		"fulltext": true,
		"includeInAllText": false,
		"max": 0,
		"min": 1, (9)
		"name": "url",
		"nGram": false,
		"path": false,
		"valueType": "string"
	}]
}]
1 The documentType name must be unique. It’s automatically lowercased and ascii-folded to match /^[a-z][a-z0-9_]$/
2 If your collector stores "dynamic" data, aka fields it doesn’t know about in advance: set addFields to true and persistDocument will automatically try to figure out the correct valueType for them and add them to the installed document-type.
3 Deleting a field, can break an interface GraphQL …​ on DocumentTypeName query. Simply deactivating it is safe.
4 See Config Options.
5 Setting max to 0 means there is no limit on how many values the field can have.
6 Setting min to 0 means the field is optional. Setting it to anything larger than 0 means it’s a required field.
7 The name of a field must be unique and match the following regexp /^[a-z][a-z0-9_]*$/
8 See Value types.
9 Setting the min property to 1 means the field is a required field.

Deprecations

Register
The register function is deprecated and will throw an error!

Simply remove it from src/main/resources/main.ts

Unregister
The unregister function is deprecated and will log a warning.

Simply remove it from src/main/resources/main.ts

Step 2: Configuration

The starter also provides the essential build system for the React-based user interface.

Some important ingredients that enable this are:

  • node-gradle-plugin

  • webpack

  • babel

  • node_modules

    • @enonic/webpack-esm-assets

    • @enonic/webpack-server-side-js

    • semantic-ui-react

React component

In order for your collector’s configuration user interface to work in Explorer you must provide a React component. Any react component type should be supported, but all examples are functional (since that is the current status quo of react).

ref

React.forwardRef and React.useImperativeHandle are used in order for the (child) Collector component to provide callbacks that the (parent) Collection component can run when appropriate.

These callbacks are named afterReset and validate:

  • When the [Reset] button is clicked: the (parent) Collection component will reset it’s state, but whatever state is internal to the (child) Collector component needs to be handled inside it’s afterReset function.

  • When the [Save] button is clicked: the (parent) Collection component will validate it’s own input fields, but whatever inputs are provided by the (child) Collector component, should be handeled inside it’s own validate function.

props

The Collector component receives four props from Explorer:

  1. collectorConfig - Read only object which is changed by calling setCollectorConfig.

  2. explorer - static information like contentTypes, fields and sites

  3. setCollectorConfig - setState function used to change the collectorConfig object.

  4. setCollectorConfigErrorCount - setState function to change how many validation errors exist.

collectorConfig

This object contains whatever configuration options you define in order to control your collector.

explorer

This object contains information from Explorer about the collector context. The information can be used to make dropdowns in your collectors configuration.

setCollectorConfig

Call this function whenever you need to change some value inside the collectorConfig. Typically it’s used with onChange events.

setCollectorConfigErrorCount

Call this function whenever a validation error occurs, or a validation error is resolved.

Example code

I like separation of concerns, so I’ve split the presentation and state logic into separate files:

  • useUpdateEffect.ts Handy react hook

  • useCollectorState.ts State logic management

  • Collector.tsx Presentation code

useUpdateEffect.ts

Handy React hook that makes it possible to run an effect only when a state has changed, and avoid running the effect when the state is first initialized.

src/resources/assets/js/react/useUpdateEffect.ts
import * as React from 'react';


export function useUpdateEffect(
	effect :React.EffectCallback,
	deps :React.DependencyList = []
) {
	const isInitialMount = React.useRef(true);

	React.useEffect(() => {
		if (isInitialMount.current) {
			isInitialMount.current = false;
		} else {
			return effect();
		}
	}, deps);
}
useCollectorState.ts

State logic management.

src/resources/assets/js/react/useCollectorState.ts
import type {
	CollectorComponentRef,
	CollectorComponentAfterResetFunction,
	CollectorComponentValidateFunction
} from '/lib/explorer/types/index.d';
import type {CollectorConfig} from '../../../index.d';


import * as React from 'react';
import {useUpdateEffect} from './useUpdateEffect'


export function useCollectorState({
  collectorConfig,
	ref,
  setCollectorConfig,
	setCollectorConfigErrorCount
} :{
  collectorConfig :CollectorConfig
	ref :CollectorComponentRef<CollectorConfig>
  setCollectorConfig :(param :CollectorConfig|((prevCollectorConfig :CollectorConfig) => CollectorConfig)) => void
	setCollectorConfigErrorCount :(collectorConfigErrorCount :number) => void
}) {
  //──────────────────────────────────────────────────────────────────────────
  // Avoiding derived state by not using useState, just pointing to where in collectorConfig it can be found:
  //──────────────────────────────────────────────────────────────────────────
  const url = collectorConfig	? (collectorConfig.url || '')	: '';

  //──────────────────────────────────────────────────────────────────────────
  // State internal to the (child) Collector component:
  //──────────────────────────────────────────────────────────────────────────
  const [urlError, setUrlError] = React.useState<string>(undefined);
  const [/*urlVisited*/, setUrlVisited] = React.useState(false);

  //──────────────────────────────────────────────────────────────────────────
  // Callbacks, should only depend on props, not state
  //──────────────────────────────────────────────────────────────────────────
  const validateUrl = (urlToValidate :string) => {
    const newError = !urlToValidate ? 'Url is required!' : undefined;
    setUrlError(newError);
    return !newError;
  };

  const urlOnChange = React.useCallback((
    _event :React.ChangeEvent<HTMLInputElement>,
    {value} : {value :string}
  ) => {
    setCollectorConfig(prevCollectorConfig => ({
      ...prevCollectorConfig,
      url: value
    }));
    validateUrl(value);
  }, [
    setCollectorConfig,
    validateUrl
  ]);

  const urlOnBlur = React.useCallback(() => {
    setUrlVisited(true);
    validateUrl(url);
  }, [
    collectorConfig,
    validateUrl,
    url
  ]);

  //──────────────────────────────────────────────────────────────────────────
  // Updates (changes, not init)
  //──────────────────────────────────────────────────────────────────────────
  useUpdateEffect(() => {
    setCollectorConfigErrorCount(urlError ? 1 : 0);
  }, [
    urlError
  ]);

  //──────────────────────────────────────────────────────────────────────────
  // Callback to be called by the (parent) Collection component
  //──────────────────────────────────────────────────────────────────────────
  const afterReset :CollectorComponentAfterResetFunction = () => {
    setUrlVisited(false);
    setUrlError(undefined);
  };

  const validate = React.useCallback<CollectorComponentValidateFunction<CollectorConfig>>(({
    url: urlToValidate
  } :CollectorConfig) => {
    const newCollectorConfigErrorCount = validateUrl(urlToValidate) ? 0 : 1;
    return !newCollectorConfigErrorCount;
  }, [
    validateUrl
  ]);

  //──────────────────────────────────────────────────────────────────────────
  // Make it possible for parent to call these functions
  //──────────────────────────────────────────────────────────────────────────
  React.useImperativeHandle(ref, () => ({
    afterReset,
    validate
  }));

  return {
    url,
    urlError,
    urlOnBlur,
    urlOnChange
  };
}
Collector.tsx

Presentation code.

src/resources/assets/js/react/Collector.tsx
import type {
	CollectorComponentRef,
	CollectorProps
} from '/lib/explorer/types/index.d';
import type {CollectorConfig} from '../../../index.d'


import * as React from 'react';
import {Form} from 'semantic-ui-react';
import {useCollectorState} from './useCollectorState'


export const Collector = React.forwardRef(
	(
		{
			collectorConfig,
			//explorer,
			setCollectorConfig,
			setCollectorConfigErrorCount
		} :CollectorProps<CollectorConfig>,
		ref :CollectorComponentRef<CollectorConfig>
	) => {
		const {
			url,
			urlError,
			urlOnBlur,
			urlOnChange
		} = useCollectorState({
			collectorConfig,
			ref,
			setCollectorConfig,
			setCollectorConfigErrorCount
		});
		return <Form>
			<Form.Input
				error={urlError}
				fluid
				label='Url'
				onBlur={urlOnBlur}
				onChange={urlOnChange}
				required
				value={url}
			/>
		</Form>;
	} // component
); // forwardRef

Step 3: Task

The actual code to retrieve and return content for indexing is implemented using named tasks.

The most important parts of a collector are:

Progress reporting

In the explorer app there is a page to display Collector status. In order for this page to show useful updated information. The collector tasks needs to send progress information. When your collector task runs

collector.start();

A collector.taskProgressObj will be created automatically. Looking something like this:

collector.taskProgressObj = {
	current: 0,
	info: {
		name: 'Example',
		message: 'Initializing...',
		startTime: '2020...'
	},
	total: 1 // So it appears there is something to do.
}

A collector task may have a set or changing number of operations to perform. You should keep the progress updated something like this:

collector.start();
collector.taskProgressObj.total = initialNumberOfOperations;
while(somethingToDo) {
	collector.taskProgressObj.info.url = currentUrl;
	collector.taskProgressObj.info.message = 'Some useful information';
	collector.progress(); // This will update task progress. So it can be seen.

	// ... do stuff ...

	collector.taskProgressObj.total += foundSomeMoreOperationsToPerform;

	collector.taskProgressObj.current += 1;
}
collector.stop();

Finally when you collector task calls

collector.stop();

It will set current = total and a nice info.message = Finished with ${x} errors.;

Journal

When a collector task is finished. A journal will be persisted. The journal contains information about things that went well, and possible errors. Write to the journal by using addSuccess or addError like this:

try {
	// ... do some stuff that could fail ...
	collector.addSuccess({message: currentUri});
} catch (e) {
	collector.addError({message: `uri:${currentUri} error:${e.message}`});
}

CRUD

When you have collected some information you want to make available for later search you have to persist it. This can be done by calling persistDocument.

In order to validate and index the information you must specify which documentTypeName the information should match.

Create

const documentToPersist = {
	aField: 'aTag', // perhaps used in aggregation and filtering.
	anotherField: 'anotherTag', // perhaps used in aggregation and filtering.
	text,
	title,
	url, // Since this field is supposed to be unique, it's also required, thus it's min property is set to 1 in the document-type
	whatever: 'perhapsAnImageUrl' // perhaps used when displaying search results.
};

collector.persistDocument(
	documentToPersist, {
		// Must be identical to a _name in src/main/resources/documentTypes.json
		documentTypeName: 'mydocumenttypename'
	}
);

Update

If you want to update a document, rather than creating endless new ones, you have to lookup and provide the document _id.

Let’s say you have provided a documentType in which a field named 'url' is unique.

const documentsRes = collector.queryDocuments({
	count: 1,
	query: {
		boolean: {
			must: {
				term: {
					field: 'url',
					value: documentToPersist.url
				}
			}
		}
	}
});

if (documentsRes.total > 1) {
	throw new Error(`Multiple documents found with url:${documentToPersist.url}! url is supposed to be unique!`);
} else if (documentsRes.total === 1) {
	// Provide which document node to update (rather than creating a new document node)
	documentToPersist._id = documentsRes.hits[0].id;
}

collector.persistDocument(
	documentToPersist, {
		// Must be identical to a _name in src/main/resources/documentTypes.json
		documentTypeName: 'mydocumenttypename'
	}
);

Read/Delete

The explorer library Collection class currently does not provide any api for reading and deleting documents. You may connect to the collection repositories via standard Enonic API’s or via other currently undocumented Explorer library functions.

Example code

collect.xml

The complexity of a collector may vary, but as to provide a basic idea, the starter includes a simple example:

src/resources/tasks/collect.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<task>
	<description>Collect</description>
	<form>
		<input name="collectionId" type="TextLine">
			<label>Collection ID</label>
			<occurrences minimum="1" maximum="1"/>
		</input>
		<input name="collectorId" type="TextLine">
			<label>Collector ID</label>
			<occurrences minimum="1" maximum="1"/>
		</input>
		<input name="configJson" type="TextLine">
			<label>Config JSON</label>
			<occurrences minimum="1" maximum="1"/>
		</input>
		<input name="language" type="TextLine">
			<label>Language</label>
			<occurrences minimum="0" maximum="1"/>
		</input>
	</form>
</task>
src/resources/tasks/collect.ts
import '@enonic/nashorn-polyfills'; (1)
import {Collector} from '/lib/explorer/collector'; (2)

export function run({name, collectorId, configJson, language}) { (3)
	const collector = new Collector({name, collectorId, configJson, language}); (4)

	if (!collector.config.uri) { (5)
		throw new Error('Config is missing required parameter uri!');
	}

	collector.start(); (6)

	const {
		uri,
		object: {
			someNestedProperty
		}
	} = collector.config; (7)

	while(somethingToDo) {
		if (collector.shouldStop()) { break; } (8)

		try {
			const {text, title} = doSomethingThatMayFail(); (9)

			collector.persistDocument({
				text,
				title,
				uri
			}); (10)

			collector.addSuccess({uri}); (11)

		} catch (e) {

			collector.addError({uri, message: e.message}); (12)

		}
	} // while somethingToDo

	// Perhaps delete documents that are no longer found...

	collector.stop(); (13)

} // export function run
1 Perhaps import polyfills.
2 Import the Collector class
3 The collect task gets passed four named parameters.
4 Construct a Collector instance.
5 Validate the configuration object.
6 Start the collector. Sets startTime and more.
7 Fetch configuration properties you need from the collector.config object.
8 Check if someone has clicked the STOP button.
9 This is where you collect the data you want to persist.
10 Persist the collected data.
11 Make a journal entry that collecting data from uri was a success.
12 Make a journal entry that an error prevented collecting data from uri.
13 Stop the collector. Sets endTime and more.

Polyfills

Depending upon what your Enonic XP serverside code contains, or potential node modules you import, you may have to polyfill some js functionality that the Javascript engine (Nashorn) doesn’t support.

src/resources/lib/nashorn/index.js
require './global';
require './Array';
require './Number';
webpack.config.babel.js
import path from 'path';
const WEBPACK_CONFIG = {
	resolve: {
		alias: '@enonic/nashorn-polyfills': path.resolve(__dirname, 'src/main/resources/lib/nashorn/index.js')
	}
}
export { WEBPACK_CONFIG as default };

Global

src/resources/lib/nashorn/global.js
// https://stackoverflow.com/questions/9107240/1-evalthis-vs-evalthis-in-javascript
const global = (1, eval)('this'); // eslint-disable-line no-eval
global.global = global;
global.globalThis = global;
global.frames = global;
global.self = global;
global.window = global;
module.exports = global;

Array.flat

src/resources/lib/nashorn/Array.js
if (!Array.prototype.flat) {
	Object.defineProperty(Array.prototype, 'flat', {
		value: function(depth = 1) {
			return this.reduce(function (flat, toFlatten) {
				return flat.concat((Array.isArray(toFlatten) && (depth>1)) ? toFlatten.flat(depth-1) : toFlatten);
			}, []);
		}
	});
}

Number.isInteger

src/resources/lib/nashorn/Number.js
Number.isInteger = Number.isInteger || function(value) {
	return typeof value === 'number' &&
	isFinite(value) &&
	Math.floor(value) === value;
};

Step 4: Install and test

When you have built your collector application. Install the jar file on the Enonic XP server where you have Explorer installed. Then create a collection using your collector, and click collect to see what happens. It is a good idea to run locally first and keep an eye on the Enonic XP server log.


Contents