Aggregate functions - Splunk Documentation (2024)

The SPL2 aggregate functions summarize the values from each event to create a single, meaningful value. Common aggregate functions include Average, Count, Minimum, Maximum, Standard Deviation, Sum, and Variance.

Most aggregate functions are used with numeric fields. However, there are some functions that you can use with either alphabetic string fields or numeric fields. The function descriptions indicate which functions you can use with alphabetic strings.

For an overview, see Overview of SPL2 stats functions.

avg(<value>)

This function returns the average, or mean, of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example returns the average of the values in the size field for each distinct value in the host field.

... | stats avg(size) BY host


The following example returns the average thruput of each host for each 5 minute time span.

... | bin _time span=5m | stats avg(thruput) BY host


The following example displays a timechart of the average of the cpu_seconds field by processor, rounded to 2 decimal points.

... | timechart eval(round(avg(cpu_seconds),2)) BY processor

When you use a eval expression with the timechart command, you must also use BY clause.

count(<value>) or c(<value>)

This function returns the number of occurrences in a field.

Usage

To use this function, you can specify count(<value>), or the abbreviation c(<value>).

This function processes field values as strings.

To indicate a specific field value to match, use the format <field>=<value>.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Several of these examples use an eval expression with the count function. See Using eval expressions in stats functions.


The following example returns the count of events where the status field has the value "404".

...| stats count(eval(status="404")) AS count_status BY sourcetype


The following example separates the search results into 10 bins and counts the values in the _raw field for each bin.

... | bin size bins=10 | stats count(_raw) BY size


You can use search literals in functions that accept predicate expressions. Search literals are enclosed in backtick characters ( ` ). The following search uses a search literal to count the occurrences of the value 500 in your events. The results are organized by the host field:

... | stats count(`500`) by host

For more information, see Search literals in expressions.


The following example uses the timechart command to count the events where the action field contains the value purchase.

| from my_dataset where sourcetype="access_*" | timechart count(eval(action="purchase")) BY productName usenull=f useother=f

distinct_count(<value>) or dc(<value>)

This function returns the count of distinct values in a field.

Usage

To use this function, you can specify distinct_count(), or the abbreviation dc().

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic examples

The following example removes duplicate results with the same host value and returns the total count of the remaining results.

... | stats dc(host)


The following example generates the distinct count of the values in the devices field. The results are returned in a field called numdevices.

...| stats dc(device) AS numdevices


The following example counts the distinct client IP addresses for each host and category, and then bins the count for each result into five minute spans.

... | stats dc(clientip) BY host, categoryId | bin _time span=5m

Extended example

The following search counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The search organizes the count by the type of product , such as accessories, t-shirts, and type of games, that customers purchased.

| from my_dataset where sourcetype="access_*" action=purchase | stats dc(clientip) BY categoryId

  • This example first searches for purchase events, action=purchase.
  • These results are piped into the stats command and the dc() function counts the number of distinct users who made purchases.
  • The BY clause is used to organize the distinct count based on the different category of products, the categoryId.


The results look like this:

categoryIddc(clientip)
ACCESSORIES37
ARCADE58
NULL8
SHOOTER31
SIMULATION34
SPORTS13
STRATEGY74
TEE38

estdc(<value>)

This function returns the estimated count of the distinct values in a field.

Usage

This function processes field values as strings.

The string values 1.0 and 1 are considered distinct values and counted separately.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The following example removes duplicate results with the same host value in a field, and returns the estimated total count of the remaining results.

... | stats estdc(host)

The results look like this:

estdc(host)
6

The following example generates the estimated distinct count of the devices field and renames the results field, numdevices.

...| stats estdc(devices) AS numdevices


The following example estimates the distinct count for the values in the source field for each sourcetype. The results are organized into 5 minute time spans.

... | stats estdc(source) BY sourcetype | bin 5m

estdc_error(<value>)

This function returns the theoretical error of the estimated count of the distinct values in a field.

The error represents this ratio:

absolute_value(estimate_distinct_count - real_distinct_count) / real_distinct_count

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following example determines the error ratio for the estimated distinct count of the values in the host field.

... | stats estdc_error(host)

exactperc(<value>,<percentile>)

This function returns an exact percentile based on the values in a numeric field.

The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

max(<value>)

This function returns the maximum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the maximum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

  • Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
  • Uppercase letters are sorted before lowercase letters.
  • Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic example

This example returns the maximum value of the size field.

... | stats max(size)

Extended example

Calculate aggregate statistics for the magnitudes of earthquakes in an area

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Search for earthquakes in and around California. Calculate the number of earthquakes that were recorded. Use statistical functions to calculate the minimum, maximum, range (the difference between the min and max), and average magnitudes of the recent earthquakes. List the values by magnitude type.

source=all_month.csv place=*California* | stats count, max(mag), min(mag), range(mag), avg(mag) BY magType

The results look like this:

magTypecountmax(mag)min(mag)range(mag)avg(mag)
H1232.80.02.80.549593
MbLg10000.0000000
Md15653.20.13.11.056486
Me22.01.6.041.800000
Ml12024.3-0.44.71.226622
Mw64.93.01.93.650000
ml101.560.191.370.934000

mean(<value>)

This function returns the arithmetic mean of the values in a field.

The mean values should be exactly the same as the values calculated using the avg() function.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

The following example returns the mean of the values in the kbps field.

...| stats mean(kbps)

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magTypecountmean(mag)std(mag)var(mag)
H1230.5495930.3569850.127438
MbLg10.0000000.0000000.000000
Md15651.0564860.5800420.336449
Me21.8000000.3464100.120000
Ml12021.2266220.6296640.396476
Mw63.6500000.7162400.513000
ml100.9340000.5604010.314049

median(<value>)

This function returns the middle-most value of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If you have an even number of search results, the median calculation is the average of the two middle-most values.

Basic example

Consider the following list of values, which counts the number of different customers who purchased something from the Buttercup Games online store yesterday. The values are organized by the type of product (accessories, t-shirts, and type of games) that customers purchased.

categoryIdcount
ACCESSORIES37
ARCADE58
NULL8
SIMULATION34
SPORTS13
STRATEGY74
TEE38

When the list is sorted the the middle-most value is 37.

categoryIdcount
NULL8
SPORTS13
SIMULATION34
ACCESSORIES37
TEE38
ARCADE58
STRATEGY74

The middle-most value is returned when there are an odd number of results. When there are an even number of results, the average of the two middle-most numbers is returned.

min(<value>)

This function returns the minimum value in a field.

Usage

This function processes field values as numbers if possible, otherwise processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

If the values in the field are non-numeric, the minimum value is found using lexicographical ordering.

Lexicographical order sorts items based on the values used to encode the items in computer memory. In Splunk software, this is almost always UTF-8 encoding, which is a superset of ASCII.

  • Numbers are sorted before letters. Numbers are sorted based on the first digit. For example, the numbers 10, 9, 70, 100 are sorted lexicographically as 10, 100, 70, 9.
  • Uppercase letters are sorted before lowercase letters.
  • Symbols are not standard. Some symbols are sorted before numeric values. Other symbols are sorted before or after letters.

Basic examples

The following example returns the minimum size and maximum size of the HotBucketRoller component in the _internal index.

index=_internal component=HotBucketRoller | stats min(size), max(size)


The following example returns a list of processors and calculates the minimum cpu_seconds and the maximum cpu_seconds.

index=_internal | chart min(cpu_seconds), max(cpu_seconds) BY processor

Extended example

See the Extended example for the max() function. That example includes the min() function.

mode(<value>)

This function returns the most frequent value in a field.

Usage

This function processes field values as strings.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

The mode returns the most frequent value. Consider the following data:

firstnamesurnameage
ClaudiaGarcia32
DavidMayer45
AlexGarcia29
WeiZhang45
JavierGarcia37

When you search for the mode in the age field, the value 45 is returned.

...| stats mode(age)


You can also use mode with fields that contain string values. When you search for the mode in the surname field, the value Garcia is returned.

...| stats mode(surname)


Here's another set of sample data:

_timehostsourcetype
04-06-2020 17:06:23.000 PMwww1access_combined
04-06-2020 10:34:19.000 AMwww1access_combined
04-03-2020 13:52:18.000 PMwww2access_combined
04-02-2020 07:39:59.000 AMwww3access_combined
04-01-2020 19:35:58.000 PMwww1access_combined

If you run a search that looks for the mode in the host field, the value www1 is returned because it is the most common value in the host field. For example:

... |stats mode(host)

The results look something like this:

mode(host)
www1

perc(<value>,<percentile>)

This function returns an approximate percentile based on the values in a numeric field.

There are three percentile functions. These functions return the nth percentile of the values in a numeric field. You can think of this as an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the values in the field are below the estimate and 5% of the values in the field are above the estimate.

Valid percentiles are floating point numbers between 0 and 100, such as 99.95.

Differences between the percentile functions:

FunctionDescription
perc(<value>,<percentile>)
Use the perc function to calculate an approximate threshold, such that the values in a field fall below the percentile threshold you specify. The perc function returns a single number that represents the lower end of the approximate values for the percentile requested.
upperperc(<value>,<percentile>)When there are more than 1000 values, the upperperc function gives the approximate upper bound for the percentile requested. Otherwise the upperperc function returns the same percentile as the perc function.
exactperc(<value>,<percentile>)The exactperc function provides the exact value, but is very resource expensive for high cardinality fields. The exactperc function can consume a large amount of memory, which might impact how long it takes for a search to complete.

The percentile functions process field values as strings.

The perc and upperperc functions give approximate values for the integer percentile requested. The approximation algorithm that is used, which is based on dynamic compression of a radix tree, provides a strict bound of the actual value for any percentile.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

In the search results, the field names use this format: perc<percentage>(<value>) regardless of the

Valid syntax

To use this function, you can specify perc() using one of the following syntaxes:

SyntaxExample
perc(<value>,<percentile>)

... | stats perc(host, 95)

perc<percentile>(<value>)

... | stats perc95(host)

Differences between Splunk and Excel percentile algorithms

If there are less than 1000 distinct values, the Splunk percentile functions use the nearest rank algorithm. See http://en.wikipedia.org/wiki/Percentile#Nearest_rank. Excel uses the NIST interpolated algorithm, which basically means you can get a value for a percentile that does not exist in the actual data, which is not possible for the nearest rank approach.

Splunk algorithm with more than 1000 distinct values

If there are more than 1000 distinct values for the field, the percentiles are approximated using a custom radix-tree digest-based algorithm. This algorithm is much faster and uses much less memory, a constant amount, than an exact computation, which uses memory in linear relation to the number of distinct values. By default this approach limits the approximation error to < 1% of rank error. That means if you ask for 95th percentile, the number you get back is between the 94th and 96th percentile.

You always get the exact percentiles even for more than 1000 distinct values by using the exactperc function instead of the perc function.

Basic examples

Consider this list of values {10,9,8,7,6,5,4,3,2,1} in the myfield field.

The following example returns the 90th percentage for the values:

...| stats perc(myfield, 90)

The results look like this:

perc90(myfield)
9.1

The following example returns several percentages:

...| stats perc(score,95), perc(score,50), perc(score,25)

The results look like this:

perc95(myfield)perc50(myfield)perc55(myfield)
9.555.53.25

Extended example

Consider the following set of data, which shows the number of visitors for each hour a store is open:

hourvisitors
08000
0900212
1000367
1100489
1200624
1300609
1400492
1500513
1600376
1700337

Let's use a dataset literal to create a temporary dataset from this data and then run the streamstats command to create a cumulative total for the visitors.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total'

The results from this search look like this:

hourvisitorsvisitors total
080000
0900212212
1000367579
11004891068
12006241692
13006092301
14004922793
15005133306
16003763673
17003374010

Let's add the stats command with the perc function to determine the 50th and 95th percentiles.

| FROM [{hour: 0800, visitors: 0}, {hour: 0900, visitors: 212}, {hour: 1000, visitors: 367}, {hour: 1100, visitors: 489}, {hour: 1200, visitors: 624}, {hour: 1300, visitors: 609}, {hour: 1400, visitors: 492}, {hour: 1500, visitors: 513}, {hour: 1600, visitors: 367}, {hour: 1700, visitors: 337}] | streamstats sum(visitors) as 'visitors total' | stats perc50('visitors total') perc95('visitors total')

The results from this search look like this:

perc50(visitors total)perc95(visitors total)
1996.53858.35

The perc50 estimates the 50th percentile, when 50% of the visitors had arrived. You can see from the data that the 50th percentile was reached between visitor number 1996 and 1997, which was sometime between 1200 and 1300 hours. The perc95 estimates the 95th percentile, when 95% of the visitors had arrived. The 95th percentile was reached with visitor 3858, which occurred between 1600 and 1700 hours.

range(<value>)

Returns the difference between the maximum and minimum values in a field.

Usage

The values in the field must be numeric.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

This example uses events that list the numeric sales for each product and quarter, for example:

productsquartersalesquota
ProductAQTR112001000
ProductBQTR114001550
ProductCQTR116501275
ProductAQTR214251300
ProductBQTR211751425
ProductCQTR215501450
ProductAQTR313001400
ProductBQTR312501125
ProductCQTR313751475
ProductAQTR415501300
ProductBQTR417001225
ProductCQTR416251350

It is easiest to understand the range if you also determine the min and max values. To determine the range of sales by product, run this search:

... | stats sum(sales), min(sales), max(sales), range(sales) BY products

The results look something like this:

quartersum(sales)min(sales)max(sales)range(sales)
QTR1425012001650450
QTR2415011751550375
QTR3392512501375125
QTR4487515501700150

The range(sales) is max(sales) minus min(sales).

Extended example

See the Extended example for the max() function. That example includes the range() function.

stdev(<value>)

This function returns the sample standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample standard deviation is different than a population standard deviation. A sample standard deviation is calculated from a larger population by dividing the number of data points by 1 less than the total data points.

For information about creating a population standard deviation, see the stdevp function.

Basic example

This example returns the standard deviation of fields that end in the word delay which can apply to both, delay and xdelay. Because a wildcard is used, you must use single quotation marks to enclose the field name.

... | stats stdev('*delay')

Extended example

This search uses recent earthquake data downloaded from the USGS Earthquakes website. The data is a comma separated ASCII text file that contains magnitude (mag), coordinates (latitude, longitude), region (place), etc., for each earthquake recorded.

Run the following search to find the mean, standard deviation, and variance of the magnitudes of recent quakes by magnitude type.

source=usgs place=*California* | stats count(), mean(mag), stdev(mag), var(mag) BY magType

The results look like this:

magTypecountmean(mag)std(mag)var(mag)
H1230.5495930.3569850.127438
MbLg10.0000000.0000000.000000
Md15651.0564860.5800420.336449
Me21.8000000.3464100.120000
Ml12021.2266220.6296640.396476
Mw63.6500000.7162400.513000
ml100.9340000.5604010.314049

stdevp(<value>)

This function returns the population standard deviation of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

A sample population deviation is different than a sample standard deviation. A population standard deviation is calculated from the entire population.

For information about creating a sample standard deviation, see the stdev function.

Basic example

The following example calculates the population standard deviation of the field lag.

... | stats stdevp(lag)

sum(<value>)

This function returns the sum of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

You can create totals for any numeric field. For example:

...| stats sum(bytes)

The results look something like this:

sum(bytes)
21502


You can rename the column using the AS keyword:

...| stats sum(bytes) AS "total bytes"

The results look something like this:

total bytes
21502


You can organize the results using a BY clause:

...| stats sum(bytes) AS "total bytes" by date_hour

The results look something like this:

date_hourtotal bytes
076509
113726
156569
234698

sumsq(<value>)

This function returns the sum of the squares of the values in a field.

The sum of the squares is used to evaluate the variance of a dataset from the dataset mean. A large sum of the squares indicates a large variance, which tells you that individual values fluctuate widely from the mean.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Example

The following table contains the temperatures taken every day at 8 AM for a week.

You calculate the mean of the these temperatures and get 48.9 degrees. To calculate the deviation from the mean for each day, take the temperature and subtract the mean. If you square each temperature, you get results like this:

daytemperaturemeandeviationsquare of temperatures
sunday6548.916.1260.6
monday4248.9-6.947.0
tuesday4048.9-8.978.4
wednesday3148.9-17.9318.9
thursday4748.9-1.93.4
friday5348.94.117.2
saturday6448.915.1229.3

Take the total of the squares, 954.9, and divide by 6 which is the number of days minus 1. This gets you the sum of squares for this series of temperatures. The standard deviation is the square root of the sum of the squares. The larger the standard deviation the larger the fluctuation in temperatures during the week.

You can calculate the mean, sum of the squares, and standard deviation with a few statistical functions. Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] | stats mean(temp), sumsq(temp), stdev(temp)

This search returns these results:

mean(temp)sumsq(temp)stdev(temp)
48.8571428571428541766412.615183595289349

upperperc(<value>,<percentile>)

This function returns an approximate upper bound percentile value of a numeric field that contains more than 1000 values.

The upperperc function returns an estimate of where the top N% starts. For example, a 95th percentile says that 95% of the data points are below the estimate and 5% of the data points are above the estimate.

Usage

There are three percentile functions: exactperc, perc, and upperperc. For details about the differences, see the perc function.

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

See the perc function.

var(<value>)

Returns the sample variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Examples

Variance is often used to determine the standard deviation of a set of values.

The following table contains the temperatures taken every day at 8 AM for a week.

daytemperature
sunday65
monday42
tuesday40
wednesday31
thursday47
friday53
saturday64

Let's use a dataset literal so that you can try this on your own:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp)

You calculate the mean, which is the average, of the these temperatures and get 48.9 degrees.

mean(temp)
48.857142857142854

To calculate the deviation from the mean for each day, take the temperature and subtract the mean.

daytempmeandeviation
sunday6548.916.1
monday4248.9-6.9
tuesday4048.9-8.9
wednesday3148.9-17.9
thursday4748.9-1.9
friday5348.94.1
saturday6448.915.1

To calculate the variance, square each deviation and average the result. Or just run this search:

FROM [{day: "sun", temp: 65}, {day: "mon", temp: 42}, {day: "tue", temp: 40}, {day: "wed", temp: 31}, {day: "thu", temp: 47}, {day: "fri", temp: 53}, {day: "sat", temp: 64}] |stats mean(temp), var(temp), stdev(temp)


This search returns these results:

mean(temp)var(temp)stdev(temp)
48.857142857142854159.1428571428574812.615183595289349

The larger the standard deviation the larger the fluctuation in temperatures during the week.

Extended example

See the Extended example for the mean() function. That example includes the var() function.

varp(<value>)

Returns the population variance of the values in a field.

Usage

You can use this function with the stats, eventstats, streamstats, and timechart commands.

Basic example

See also

Function information
Overview of SPL2 stats and chart functions
Quick Reference for SPL2 Stats and Charting Functions
Naming function arguments in the SPL2 Search Manual
Aggregate functions - Splunk Documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 5984

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.