Integrate Spark into spatial data analysis and process

By Fei Xiao


A novel Apache Spark based computing framework for spatial data is realized in this project. It leverages Spark as the under layer to achieve better computing performance than Hadoop. Spark RDDs can outperform Hadoop by 20 times for iterative jobs and can be used interactively to search a 1 TB dataset with latencies of 5 to 7 seconds.

Figure 1. System Architecture

Four layers architecture from low to high is proposed: spatial data storage, spatial RDDs, spatial operations and spatial query language. All managements of spatial data are mentioned around Apache Hadoop and Spark ecosystem. (1) The spatial data storage using HDFS to storage large size of spatial data, vector or raster, in the distribute cluster. (2) The spatial RDDs are abstract logistical dataset of spatial data types and can be transferred to the spark cluster to do spark transformations and actions. (3) Spatial operations layer is a series of processing on spatial RDDs such as range query, k nearest neighbor and spatial join. (4) Spatial query language is a user friendly interface which supplies people not major in computer a comfortable way to operation the spatial operation.

Spatial indexes are engaged in storing the spatial data. Grid and R-tree is implemented for efficient query. Index building process is also included in the system, users can use the command line to build the different type of index. Three phases process for indexing will be introduced. (1) Partitioning: big input file will be spatially split into n partitions, and n rectangles representing boundaries of the n partitions will be calculated. Each partition should fit in one HDFS block size, so an overhead ratio will be set for the overhead of replicating records overlapped and storing local indexes. (2) Local Indexing: requested index structure (e.g., Grid or R-tree) in each partition block will be built flowed by the spatial data. The index structure and spatial data is in the same partition block file. (3) Global Indexing: all local index files will be concatenated into one file that represents the final indexed file. It is constructed using bulk loading and is stored in the name node of the Hadoop clusters.

Figure 2. R-tree Local Index Structure

The system is deployed on STHCP with a cluster of 8 virtual machines. Every virtual machine has 16 G memory and 120G hard drive. We deployed the Spark environment with Cloudera CDH 5 on the cluster. The performance of experiments executions and big data processing is impressive.

Apply for Cloud Resources

Requesting access to spatiotemporal Hybrid Cloud Platform resources is fairly easy. Simply fill out the online application and submit one page proposal (use your project name as file name) to describe the project objectives. Apply now