I want to search


We've launched a new website!

You're currently accessing the archived version of the DataONE website. To see our new design and keep up to date with the latest DataONE news, visit our new website at https://dataone.org

Prospective and Retrospective Provenance Queries Using YesWorkflow, RDF, and SPARQL 2

Linh Hoang
Linh Hoang

Linh Hoang is a first year PhD student at School of Information Sciences at University of Illinois Urbana - Champaign. Her research interests include information management, knowledge discovery, and data analytics. She is motivated in building smarter information systems that can help people get insights from data and make important decisions, without the hassle of going through the laborious work of collecting and disambiguating knowledge. Outside of work and research, Linh enjoys spending time with family, cooking and baking.

Project Description: 

YesWorkflow (YW) defines a set of annotations for declaring the expected dataflow patterns in scripts written in any text-based programming language. The YW toolkit extracts these annotations from comments in source code and builds a ProvONE-compatible workflow model of the script which can then be rendered graphically. YW also enables the user to export its representation of the workflow model as a set of Prolog or Datalog facts which can then easily be queried and used to create ad hoc visualizations of all or part of the model. Further, YW can reconstruct key runtime events and even data values that occurred during a run of the script by joining the YW model (prospective provenance) with observations made either during or after the completion of the script run, e.g. the values of metadata embedded in file names and directories created by the script. YW can export this retrospective and reconstructed provenance information as Prolog facts as well. Finally, the prospective and retrospective provenance facts can be queried together, enabling even more useful, hybrid provenance queries and visualizations that are of immediate use to the researcher reviewing the results of a script run or reporting their results to others.

The goal of the current project is to enable all of the provenance information that can be collected by YesWorkflow and exported to Prolog facts, to be exported alternatively to an RDF representation. The goal is to produce RDF documents that are both easy to read directly and also easy to query using SPARQL. We hypothesize that all of the queries that we have previously demonstrated with Prolog/Datalog can also be implemented in SPARQL 1.1. The challenge will be finding an intuitive way of representing prospective and retrospective provenance in RDF that also facilitates scientifically meaningful queries about the derivation of particular script products via the computational steps in the script and the dataflows between them. The above work, to be carried out by the intern, will not entail any modification of the YW tool itself. Rather, the intern will speculatively, author RDF documents representing YW workflow models and provenance, query these documents with SPARQL, and iteratively improve both the RDF representations and the SPARQL queries until as many as possible of the desired queries are supported. YesWorkflow will subsequently be updated to automatically generate the final version of the RDF representation designed in this project.

This internship is supported by the Ludäscher Lab at UIUC.

Primary Mentor: 
Bertram Ludäscher
Secondary Mentor: 
Timothy McPhillips