Statistical analysis


Inherent in GIS data is information on the attributes of features as well as their locations. This information is used to create maps that can be visually analyzed. Statistical analysis helps you extract additional information from your GIS data that might not be obvious simply by looking at a map—information such as how attribute values are distributed, whether there are spatial trends in the data, or whether the features form spatial patterns. Unlike query functions—such as identify or selection, which provide information about individual features—statistical analysis reveals the characteristics of a set of features as a whole.

Some of the statistical analysis techniques described in this document are most well-suited for interactive applications, such as ArcMap, that allow you to select and visualize data in an ad-hoc and fluid environment. Some of the methods described here are found in ArcMap's menus and toolbars and don't have a geoprocessing tool counterpart. Other methods, such as the spatial statistics tools, are only implemented as geoprocessing tools.

Uses of statistical analysis

Statistical analysis is often used to explore your data—for example, to examine the distribution of values for a particular attribute or to spot outliers (extreme high or low values). Having this information is useful when defining classes and ranges on a map, when reclassifying data, or when looking for data errors.

In the example below, statistics have been calculated for the distribution of senior citizens by census tract in this region (percentage age 65 and over in each tract), including the mean and standard deviation, as well as a histogram showing the distribution of values. Most tracts have a lower percentage of seniors than the mean, but a few tracts have a very high percentage.

Summary statistics and histogram complement symbology

Another use of statistical analysis is to summarize data. Often this is done for categories, such as calculating the total area in each land use category. You can also create spatial summaries, such as calculating the average elevation for each watershed. Summary data is useful for gaining a better understanding of conditions in a study area.

In the example below, summary statistics have been calculated for each landuse class showing the number of parcels in that class, the size of the smallest and largest parcel, the average parcel size, and the total area in the class.

Parcel feature size may vary with landuse class, statistics can show the pattern

Summary statistics can reveal patterns in data

Statistical analysis is also used to identify and confirm spatial patterns, such as the center of a group of features, the directional trend, or whether features form clusters. While patterns may be apparent on a map, trying to draw conclusions from a map can be difficult-how you classify and symbolize the data can obscure or overemphasize patterns. Statistical functions analyze the underlying data and give you a measure that can be used to confirm the existence and strength of the pattern.

Below is an example of analyses that show the mean center of a set of burglaries, and the standard deviation ellipse for a set of moose sightings (showing the directional trend)

Spatial statistics can show geographic patterns or trends

Below is an example of an analysis that shows statistically significant clusters of census tracts with many senior citizens (orange) or few (blue).

Spatial statistics can show geographic patterns or trends

Types of statistical analysis

Statistical analysis functions in ArcGIS Desktop are either nonspatial (tabular) or spatial (containing location).

Nonspatial statistics are used to analyze attribute values associated with features. The values are accessed directly from a layer's feature attribute table. Examples of nonspatial statistics include the mean and standard deviation.

In this example, the Summary Statistics tool was used to calculate the number of vacant parcels for a set of census tracts, including the total, the mean, and the standard deviation.

Summary statistics

Charts and graphs, such as a histogram or Q-Q plots, are another way of analyzing nonspatial data. In all cases, only the values are analyzed. The locations of the features with which the values are associated—and any spatial relationships between the features—are not considered.

In this example, the histogram shows the distribution of vacant parcels (the number of vacant parcels along the x-axis and the number of tracts in each range along the y-axis).

Histograms show the distribution of data values

A Normal Q-Q Plot is used to assess the similarity of the distribution of a set of values to that of a standard normal distribution (the typical bell curve, when shown on a histogram). The line on the Normal Q-Q plot shows expected values for a normal distribution—the closer the values to the line, the closer the distribution is to normal. In this example, the concentration of the elements Phosphorous for a set of soil samples is close to normally distributed.

A Normal Q-Q plot compares data value distributions to a normal distribution

The Normal QQ Plot tool is one of the data exploration tools available with the Geostatistical Analyst extension.

Spatial statistics, on the other hand, focus on the spatial relationships between features—how compact or dispersed the features are, whether they're oriented in a particular direction, and whether they form clusters. The spatial relationship is usually defined as distance (how far apart features are) but can also be other forms of interaction between features.

In the example below, the output of the Standard Distance tool (displayed graphically as a circle) is calculated using the distance of each wildlife sighting from the calculated center of the sightings.

Standard distance and mean center of a group of points

Some spatial statistics consider both the spatial relationships of features and the values of an attribute associated with the features. These are known as weighted statistics—the spatial relationship is influenced by the values. Weighted spatial statistics are used to find out if features having similar values occur together—if, for example, schools with similarly high or low test scores form clusters.

In the example below, the center of parks is weighted by the number of visitors at each park (represented by the size of the green circles).

Weighted mean center of points

Statistical functions can also be classified by whether they're descriptive or inferential. Descriptive statistics summarize some characteristic of the values or features you're analyzing—the mean value, the frequency distribution of values, or the directional trend of a group of features. Descriptive statistics are often useful for comparing two sets of features for the same area.

The example below compares the distribution of senior citizens (top) to that of children under 5 (bottom) for the same set of census tracts.

Histograms and summary statistics are a way to compare populations

In the example below, the standard distance circles for the American Indian and African American population show that the distribution of the African American population in this area is much more compact.

Standard distance and mean centers are a way to compare populations

Inferential statistics use probability theory to either predict the likely occurrence of values (using a set of known values), or to assess the likelihood that any pattern or trend you see in the data is not due to chance. The function provides a measure of the pattern or relationship. You then perform a statistical test on this measure to determine whether it is significant at some level of confidence. If the statistic analysis indicates burglaries occur in clusters, you'd then run a test to find out the chance that the clusters occurred by chance. You might find, for example, that there's a 90% likelihood that the clusters didn't occur by chance, indicating the burglaries may be linked in some way. Essentially to determine the probability, the test compares the measure you get for the existing features to the measure you'd expect to get for the same number of features spread over the same area, but distributed randomly.

In the example below, the map on the left shows clusters of census tracts having a high number of senior citizens (orange) or a low number (blue), at a 90% level of probability; the right map shows clusters at a 99% level of probability.

Compare the detected clustering at different levels of probability

Statistical analysis functions

The statistical functions in ArcGIS Desktop are located in ArcMap, ArcCatalog, and ArcToolbox, as well as within two extensions: Spatial Analyst and Geostatistical Analyst.

Table statistics

A core set of descriptive statistics that summarize the values for a single field is available from several locations in ArcGIS Desktop-the table window in ArcMap, the table preview tab in ArcCatalog, and the Statistics toolset (within the Analysis toolbox) in ArcToolbox.


Function Location Statistics Output
Statistics menu option ArcMap table window or ArcCatalog table preview tab Count

Minimum

Maximum

Sum

Mean

Standard Deviation

Frequency histogram
Results are displayed in a window
Summary Statistics tool Analysis Toolbox/ Statistics Toolset Minimum

Maximum

Sum

Mean

Standard Deviation

Range

First

Last
Results are written to a new table

To summarize a field by one or more other fields (for example, to count the number of parcels in each landuse class, sum the area in each landuse class, or find the average parcel size in each class), use the Summarize option on the ArcMap table window, or the Frequency command in the Statistics toolset in the Analysis toolbox in ArcToolbox.


Function Location Statistics Output
Summarize menu option ArcMap table window (right-click field name) Minimum

Maximum

Average (mean)

Sum

Standard Deviation

Variance
Results are written to a new table
Frequency tool Analysis Toolbox/ Statistics Toolset Count

Sum
Results are written to a new table

Spatial Statistics

The Spatial Statistics toolbox in ArcToolbox contains a number of statistical routines for analyzing the distribution of a set of features, analyzing patterns, and identifying clusters.


Functional Area Toolset Tools
Geographic distribution measurements Measuring Geographic Distributions Mean Center

Central Feature

Standard Distance

Directional Distribution (Standard Deviational Ellipse)

Linear Directional Mean
Geographic pattern analysis Analyzing patterns Average Nearest Neighbor

Spatial Autocorrelation (Moran's I)

High/Low Clustering (Getis-Ord General G)
Geographic cluster analysis Mapping clusters Cluster and Outlier Analysis (Anselin Local Moran's I)

Hot Spot Analysis (Getis-Ord Gi*)

Raster statistics

The Spatial Analyst extension includes several statistical functions that can be used to analyze rasters, primarily to summarize attribute values and assign the summary statistics to cells in a new raster layer. These are located in several different toolsets with the Spatial Analyst toolbox.


Tool Location Input Output What it does
Cell Statistics Local Toolset Multiple rasters Raster Calculates the specified statistic for each cell based on multiple inputs
Focal Statistics Neighborhood Toolset Raster Raster Summarizes the values for a raster within a defined neighborhood around each cell, and assigns the value to that cell in the output raster
Point Statistics Neighborhood Toolset Point features Raster Summarizes values for point feature attributes within a defined neighborhood, and assigns values to cells in the output raster
Line Statistics Neighborhood Toolset Line features Raster Summarizes values for line feature attributes within a defined neighborhood, and assigns values to cells in the output raster
Zonal Statistics Zonal Toolset Raster, or polygon features Raster or summary table Summarizes values of a raster surface by categories or classes (zones) of the input raster or polygon dataset

Data exploration tools

The Geostatistical Analyst—while focusing on the creation of surface from a set of sample points—also contains a set of tools for visual exploration of data values using charts and graphs. These are often used prior to surface creation to decide which parameters to use for a specific set of data, but can also be used generally to explore your data set. The tools allow you to explore the distribution of values, whether there is a directional trend in the data, and whether there are relationships between two attributes (for example, to see if the values vary together, or inversely). The tools are available from the Explore Data option on the Geostatistical Analyst toolbar.