Tiled processing of large datasets


In order to improve the performance and scalability of feature overlay tools such as Union and Intersect, operational logic called adaptive subdivision processing was added. The use of this logic is triggered when data cannot be processed within the available amount of physical memory. In order to stay in the bounds of physical memory, which greatly improves performance, processing is done incrementally on subdivisions of the original extent. Features which straddle the edges of these subdivisions (also called tiles) will be split at the edge of the tile and then reassembled into a single feature during in the last stage of processing. The vertices introduced at these tile edges will remain in the output features.

Why subdivide the data?

The overlay analysis tools perform best when processing can be done within your machine's physical memory (or RAM). This may not always be possible when working with datasets that contain either a large number of features or very complex features that contain hundreds of thousands or millions of vertices. Previously, when the physical memory was exhausted virtual memory was used, and when it was exhausted an internal paging system was used. Each successive mode of memory management is slower than the previous by an exponential factor.

How do I know when the process was tiled?

Review the messages returned by a tool during or after execution to determine if the input data was tiled. The third line will state "Processing Tiles..." if adaptive subdivision processing occurs, otherwise the input data was not subdivided and the fourth line will state "Cracking Features..."

Example of the messages from a process that was subdivided.

Executing (Identity_1): Identity c:\gp\fgdb.gdb\rivers c:\gp\fgdb.gdb\pf_watersheds c:\gp\fgdb.gdb\rivers_ws
Reading Features...
Cracking Features...
Assembling Features...
Executed (Identity_1) successfully.

Example of the messages from a process that was subdivided.

Executing (Identity_1): Identity c:\gp\fgdb.gdb\rivers c:\gp\fgdb.gdb\pf_watersheds c:\gp\fgdb.gdb\rivers_ws
Reading Features...
Processing Tiles...
Assembling Features...
Executed (Identity_1) successfully.

What do the tiles look like?

Every process starts with a single tile which spans the entire extent of the data. If the data in the single tile is too large to be processed in physical memory, it is subdivided into four equal tiles (using a quadtree approach). Processing then begins on a sub-tile, which is further sub-divided if the data in this second level of tiles is again too large. This continues until the data within each tile can be processed within physical memory. See the example below.

Extent of input datasets

The footprint of all the input features.

GP tile level 1

The process begins with a tile that spans the entire extent of all datasets. For reference this is called tile level 1.

GP tile level2

If the data is too large to process in memory, the level 1 tile is subdivided into four equal tiles. These 4 sub-tiles are called level 2 tiles.

GP tiles adaptive

Based on the size of data in each tile, some tiles are further subdivided, while others are not.

The tiles are output to the following shapefile c:\Documents and Settings\UserName\Local Settings\Temp\OverlayTile.shp upon completion of a process that required tiling.

Which tools use subdivisions

The following tools from the "Analysis Tools Toolbox" have subdivision logic when dealing with large data.

-Clip

-Erase

-Identity

-Intersect

-Union

-Split

-Symmetrical Difference

Process fails with an "Out of memory" error

The subdivision approach will not help process extremely large features. These are features with many millions of vertices. Splitting and reassembling extremely large features multiple times across tile boundaries is very costly in terms of memory, and may cause "Out of memory" errors if the feature is too large. It is recommended that these features be broken up into smaller features. Road casing for an entire city or a polygon representing a river estuary are examples of very large features with many vertices.

The "Out of memory" error could also happen if a second application is run while a tool is processing. This second application could adjust the available amount of physical memory and render the boundary calculation of the currently processing tile as incorrect, thereby causing the tool process to demand more physical memory than is possible. It is recommended that no other operations be performed on a machine while overlay processing large datasets.

What data format is recommended when working with large data?

The file size limit for the Windows operating system is 2 GB. If the result of a process is going to be a large feature class, writing that output to a shapefile (which is made up of several files) or personal geodatabase (consisting of a single .mdb file) could exceed this 2 GB file size limit. Enterprise and file geodatabases do not have this limitation so they are recommended as the output workspace when using very large datasets.

See Also