Media indexing
Contents
When a file is uploaded to XP, its textual content and embedded metadata are extracted and indexed automatically using Apache Tika. The contents of PDFs, Word documents, spreadsheets, presentations, and other text-bearing files become searchable alongside regular editorial content — no extra configuration required.
What gets indexed
-
Readable text content — Tika extracts the readable text from text-bearing formats (the words inside a PDF, Word document, spreadsheet, etc.) and stores it in the attachment’s
textfield. That text is added to the_alltextindex, the default target for full-text search queries. -
Standard fields — mime type, file size, and original file name are stored in the attachment fields and searchable for every media type.
-
Additional fields —
media:imagecontent has several standard fields, as well as mixins underx.media.*(EXIF, IPTC, and XMP mapped toimageInfo,cameraInfo,gpsInfo), making camera settings, GPS coordinates, and dimensions directly queryable. See Image content type for the full field catalogue.
| Apart from images, built-in media types do not produce structured metadata. Tika extracts only the readable content — not file metadata — so properties such as document page counts or video durations are not captured and cannot be searched. |
Full-text search example
A full-text query matches the extracted content of every media item alongside editorial text fields. The query below finds all content — including PDFs, Word documents, and any regular content type — where indexed text contains the word "quarterly":
{
guillotine {
queryDsl(
query: { fulltext: { fields: ["_alltext"], query: "quarterly" } },
first: 20
) {
_id
displayName
type
}
}
}
{
"data": {
"guillotine": {
"queryDsl": [
{
"_id": "a1b2c3d4-...",
"displayName": "Q1 report.pdf",
"type": "media:document"
},
{
"_id": "e5f6g7h8-...",
"displayName": "Quarterly review",
"type": "com.example.myapp:article"
}
]
}
}
}
To restrict a search to media only, add a type filter (see Querying for the full DSL reference):
{
guillotine {
queryDsl(
query: {
boolean: {
must: [
{ fulltext: { fields: ["_alltext"], query: "quarterly" } },
{ matchAll: {} }
]
}
},
contentTypes: ["media:document", "media:spreadsheet", "media:presentation"],
first: 20
) {
_id
displayName
type
}
}
}