Aggregations

Contents

Aggregations

Introduction

An aggregation is a function that is executed on a collection of search results. The search-results are defined by the query and query-filter of the search request.

For instance, consider a query returning all nodes that have a property "price" less than, say, $100. Now, we want to divide the result nodes into ranges, say 0-$25, $25-$50 and so on. We also would like to know the average price for each category. This could be done by doing multiple separate queries and calculating the average manually, but this would be very inefficient and cumbersome. Luckily, aggregations solve these types of problems easily.

In some API functions it is possible to send in an aggregations expression object. This object is either in Java or a JSON like the following:

Basic aggregation DSL
{
"aggregations" : {
  "[name]" : {
    "[type]" : {
      ... body ...
    },
    "aggregations": {
      ... sub-aggregations ...
    }
  }
}

There are two different types of aggregations:

Bucket aggregations

A bucket aggregation places documents matching the query in a collection - a bucket. Each bucket has a key.

Metrics aggeregations

A metric aggeregation computes metrics over a set of documents.

Typically, you will divide data into buckets and then use metric aggregations to calculate e.g average values, sum, etc for each bucket, if necessary.

terms

The terms aggregation places documents into bucket based on property values. Each unique value of a property will get its own bucket. Here’s a list of properties:

field (string)

The property path.

size (int)

The number of bucket to return, ordered by the given orderType and orderDirection. Default to 10.

order (string)

How to order the results, type and direction. Default to _term ASC.

Types:

  • _term: Alphabetic ordering of bucket keys.

  • _count: Numeric ordering of number of document in buckets.

Sample term aggregation
{
  "aggregations": {
    "categories": {
      "terms": {
        "field": "myCategory",
        "order": "_count desc",
        "size": 10
      }
    }
  }
}
Sample result from the above agg
{
  "aggregations": {
    "categories": {
      "buckets": [
        {
          "docCount": 132,
          "key": "articles"
        },
        {
          "docCount": 101,
          "key": "documents"
        },
        {
          "docCount": 43,
          "key": "case-studies"
        }
      ]
    }
  }
}

stats

The stats-aggregations calculates the following statistics for the parent-aggregation buckets:

avg, min, max, count, and sum

Here’s a list of properties:

field (string)

The property path.

Sample stats aggregation
{
  "start": 0,
  "count": 0,
  "aggregations": {
    "products": {
      "terms": {
        "field": "data.product.category",
        "order": "_count desc",
        "size": 10
      },
      "aggregations": {
        "priceStats": {
          "stats": {
            "field": "data.product.price"
          }
        }
      }
    }
  }
}
Sample result from the above agg
{
  "products": {
    "buckets": [
      {
        "key": "tv",
        "docCount": 123,
        "priceStats": {
          "count": 123,
          "min": 2599,
          "max": 87944,
          "avg": 7400,
          "sum": 578100
        }
      },
      {
        "key": "blu-ray player",
        "docCount": 42,
        "priceStats": {
          "count": 42,
          "min": 699,
          "max": 5999,
          "avg": 1548,
          "sum": 65016
        }
      },
      {
        "key": "reciever",
        "docCount": 12,
        "priceStats": {
          "count": 12,
          "min": 2999,
          "max": 26950,
          "avg": 5548,
          "sum": 66756
        }
      }
    ]
  }
}

range

The range aggregation query defines a set of ranges that represents a bucket. Here’s a list of properties:

field (string)

The property path.

ranges (range[])

The range-buckets to create.

range (from: number, to: number)

Defines a range to create a bucket for. From-value is included in bucket, to is excluded.

Sample range aggregation
{
  "price_ranges": {
    "range": {
      "field": "price",
      "ranges": [
        {
          "to": 50
        },
        {
          "from": 50,
          "to": 100
        },
        {
          "from": 100
        }
      ]
    }
  }
}
Sample result from the above agg
{
  "price_ranges": {
    "buckets": [
      {
        "docCount": 2,
        "key": "a",
        "to": 50
      },
      {
        "docCount": 4,
        "from": 50,
        "key": "b",
        "to": 100
      },
      {
        "docCount": 4,
        "from": 100,
        "key": "c"
      }
    ]
  }
}

geoDistance

The geoDistance aggregation needs a defined range to split the documents into buckets. Only documents with properties of type 'GeoPoint' will be considered in the geoDistance aggregation buckets.

Here’s a list of properties:

field (string)

The property path.

ranges (range[])

The range-buckets to create.

range (from: number, to: number)

Defines a range to create a bucket for. From-value is included in bucket, to is excluded.

unit (string)

The meassurement unit to use for the ranges. Legal values are either the full name or the abbreviation of the following: km (kilometers), m (meters), cm (centimeters), mm (millimeters), mi (miles), yd (yards), ft (feet) or nmi (nauticalmiles).

origin (lat: number, lon: number)

The GeoPoint from which the distance is measured.

Sample geoDistance aggregation
{
  "aggregations": {
    "distance": {
      "geoDistance": {
        "field": "data.cityLocation",
        "unit": "km",
        "origin": {
          "lat": "90.0",
          "lon": "0.0"
        },
        "ranges": [
          {
            "from": 0,
            "to": 1200
          },
          {
            "from": 1200,
            "to": 4000
          },
          {
            "from": 4000,
            "to": 12000
          },
          {
            "from": 12000
          }
        ]
      }
    }
  }
}
Sample result from the above agg
{
  "aggregations": {
    "distance": {
      "buckets": [
        {
          "key": "*-1200.0",
          "doc_count": 3
        },
        {
          "key": "1200.0-4000.0",
          "doc_count": 4
        },
        {
          "key": "4000.0-12000.0",
          "doc_count": 5
        },
        {
          "key": "12000.0-*",
          "doc_count": 1
        }
      ]
    }
  }
}

At the time of writing, there is only one way of find out which result belongs to which bucket: By also sorting the result on geoDistance, and matching the order to the number of each bucket. In a future version, there will easier ways of doing this.

dateRange

The dateRange aggregation query defines a set of date-ranges that represents a bucket. Only documents with properties of type 'DateTime' will considered in the dateRange aggregation buckets. Here’s a list of properties:

field (string)

The property path.

format (string)

The date-format of which the buckets will be formatted to on return. Defaults to yyyy-MM-dd’T’HH:mm:ss.SSSZ.

ranges (range[])

The range-buckets to create.

range (from: <number>, to: <number>)

Defines a range to create a bucket for. From-value is included in bucket, to is excluded. The from and to follows a special date-math explained below.

Sample dateRange aggregation
{
  "my_date_range": {
    "dateRange": {
      "field": "date",
      "format": "MM-yyy",
      "ranges": [
        {
          "to": "now-10M"
        },
        {
          "from": "now-10M"
        }
      ]
    }
  }
}
Sample result from the above agg
{
  "my_date_range": {
    "buckets": [
      {
        "key": "*-12-2017",
        "docCount": 2,
        "to": "2017-12-01T00:00:00Z"
      },
      {
        "key": "12-2017-*",
        "docCount": 4,
        "from": "2017-12-01T00:00:00Z"
      }
    ]
  }
}

Date-math expression

The range fields accepts a date-math expression to calculate the time-spans.

  • Now minus a day: now-1d

  • The given date minus 3 days plus one minute: 2014-12-10T10:00:00Z||-3h+1m

  • Range describing now plus one day and thirty minutes, rounded to minutes: now+1d+30m/m

dateHistogram

The date-histogram aggregation query defines a set of bucket based on a given time-unit. For instance, if querying a set of log-events, a dateHistorgram aggregations query with interval h (hour) will divide each log event into a bucket for each hour in the time-span of the matching events. Here’s a list of properties:

field (string)

The property path.

interval (string)

The time-unit interval for creating bucket. Supported time-unit notations:

  • y = Year

  • M = Month

  • w = Week

  • d = Day

  • h = Hour

  • m = Minute

  • s = Second

format (string)

Output format of date string.

minDocCount (int)

Only include bucket in result if number of hits ⇐ minDocCount.

Sample dateHistogram aggregation
{
  "by_month": {
    "dateHistogram": {
      "field": "init_date",
      "interval": "1M",
      "minDocCount": 0,
      "format": "MM-yyy"
    }
  }
}
Sample result from the above agg
{
  "by_month": {
    "buckets": [
      {
        "docCount": 8,
        "key": "2014-01"
      },
      {
        "docCount": 10,
        "key": "2014-02"
      },
      {
        "docCount": 12,
        "key": "2014-03"
      }
    ]
  }
}

Contents