Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

By Chaowei (Phil) Yang | Edward M Armstrong | Thomas Huang | David Moroni | Yongyao Jiang | Yun Li


Background and goal: Massive amounts of geospatial datasets are archived and made available through online web discovery and access methods. However, finding the right data for scientific research and application development is still a challenge. We propose to mine and utilize the combination of Earth Science dataset metadata, usage metrics, and user feedback to objectively extract relevance for improved data discovery and access across a NASA Distributed Active Archive Center (DAAC) and other data centers. As a point of reference, the Physical Oceanographic Distributed Active Archive Center (PO.DAAC) aims to provide datasets to facilitate scientists in selecting Earth observation data that fit better their needs in various aspects of Physical Oceanography.

Specific objectives: This project will focus on the following objectives and activities: 1) Analyzing data access logs to find implicit datasets and keywords relations; constructing knowledge base by combining semantics and profile analyzer; improving data discovery by providing better ranked results, recommendation, and ontology navigation; 2) Leveraging the PO.DAAC data science expertise and user communities to a) capture the ocean science data context and record relevant dataset relevance metrics as triple stores, b) analyze and mine user search and download patterns, c) test the developed system in an experimental environment, d) integrate the system into the PO.DAAC testbed and test the feasibility of integration for open usage and feedback. 3) Laying the groundwork for an objective mining and extraction service for data relevance with other data search and discovery systems, such as ECHO, GEOSS clearinghouse, and Data.gov, for data sharing across NASA and non-NASA data systems. The proposed technology has the potential to enhance the NASA Earth Science data discovery experience by more efficiently and objectively providing scientists with the ability to discover and select the datasets most relevant to their scope of interest.


Status: So far, we have reconstructed user session from raw data access logs. The work of session reconstruction consists of three steps: user identification, log synchronization, crawler detection, and session identification. The following Figure shows an example of the set of sessions generated by algorithms developed by ourselves, and table 1 shows keywords searched in sample sessions. Based on the extracted keywords, future efforts need to be made to integrate the search pattern with existing oceanographic ontology by building a structured knowledge base.

Spatiotemporal Hybrid Cloud Platform is used in a number of aspects in MUDROD: 1) Data Storage and Processing (1 VM, memory 20G, disk 250G, cpu 24core); 2) Result Publishing; 3) Collaboration Website (Confluence) Hosting.

To address challenges of computing intensity and enable high performance resource access for users, the entire MUDROD system is deployed to the Spatiotemporal Hybrid Cloud Platform OpenStack platform. Our project benefits from Spatiotemporal Hybrid Cloud Platform from two aspects. First, it automatically distributes end user requests to multiple instances to reduce the response time, handle larger number of concurrent requests by utilizing multiple instances to achieve greater fault tolerance. Second, it enables high performance computing by providing VM clusters to process large amount of data access logs in our project.

Apply for Cloud Resources

Requesting access to spatiotemporal Hybrid Cloud Platform resources is fairly easy. Simply fill out the online application and submit one page proposal (use your project name as file name) to describe the project objectives. Apply now