I want to search


We've launched a new website!

You're currently accessing the archived version of the DataONE website. To see our new design and keep up to date with the latest DataONE news, visit our new website at https://dataone.org

Prospective and Retrospective Provenance Queries Using YesWorkflow, RDF, and SPARQL

Hui Lyu

Hui Lyu recently obtained a master degree in Library and Information Science from School of Information Sciences of UIUC. She is going to pursue a Ph.D. degree in Computer and Information Science at University of Pennsylvania in Fall 2017. Hui received her bachelor degree in Electronic Engineering at Beihang University (BUAA) in China. Her research interest includes data provenance in databases, scientific workflows and distributed systems. In her spare time, Hui enjoys listening to emotional and soul songs, dancing and watching drama.

Project Description: 

YesWorkflow (YW) defines a set of annotations for declaring the expected dataflow patterns in scripts written in any text-based programming language. The YW toolkit extracts these annotations from comments in source code and builds a ProvONE-compatible workflow model of the script which can then be rendered graphically. YW also enables the user to export its representation of the workflow model as a set of Prolog or Datalog facts which can then easily be queried and used to create ad hoc visualizations of all or part of the model. Further, YW can reconstruct key runtime events and even data values that occurred during a run of the script by joining the YW model (prospective provenance) with observations made either during or after the completion of the script run, e.g. the values of metadata embedded in file names and directories created by the script. YW can export this retrospective and reconstructed provenance information as Prolog facts as well. Finally, the prospective and retrospective provenance facts can be queried together, enabling even more useful, hybrid provenance queries and visualizations that are of immediate use to the researcher reviewing the results of a script run or reporting their results to others.

The goal of the current project is to enable all of the provenance information that can be collected by YesWorkflow and exported to Prolog facts, to be exported alternatively to an RDF representation. The goal is to produce RDF documents that are both easy to read directly and also easy to query using SPARQL. We hypothesize that all of the queries that we have previously demonstrated with Prolog/Datalog can also be implemented in SPARQL 1.1. The challenge will be finding an intuitive way of representing prospective and retrospective provenance in RDF that also facilitates scientifically meaningful queries about the derivation of particular script products via the computational steps in the script and the dataflows between them. The above work, to be carried out by the intern, will not entail any modification of the YW tool itself. Rather, the intern will speculatively, author RDF documents representing YW workflow models and provenance, query these documents with SPARQL, and iteratively improve both the RDF representations and the SPARQL queries until as many as possible of the desired queries are supported. YesWorkflow will subsequently be updated to automatically generate the final version of the RDF representation designed in this project.

Primary Mentor: 
Bertram Ludäscher
Secondary Mentor: 
Timothy McPhillips