I want to search


We've launched a new website!

You're currently accessing the archived version of the DataONE website. To see our new design and keep up to date with the latest DataONE news, visit our new website at https://dataone.org

Andreas: Biochemical modeller


Photo credit: http://bit.ly/1WiVgBC
(CC BY-NC-ND 2.0)
The person represented here is not affiliated with DataONE and use of their image does not reflect endorsement of DataONE services.

Name, age, and education: 

Andreas is a biogeochemical modeler at Michigan State University with a PhD from the Max Planck Research School for Global Biogeochemical Cycles, received 16 years ago.

Life or career goals, fears, hopes, and attitudes: 

Andreas has been a modeler throughout his career. For Andreas, the coming flood of data and the growing numbers of analytical and visualization tools are extremely exciting and he seeks ways to stay at the forefront of this rapidly moving field.

A day in the life: 

Andreas has written several models of various complexity. Right now he is in the final stages of developing a model that can predict plankton dynamics in Tolo Harbour, Hong Kong using nitrogen units. He stores his model runs on a server at his institution. Each run is saved in a folder (named with the date and a runID) as a NetCDF file and sometimes with a text file of notes on the run. However, some of his older output files are csv. Visualizing the model output requires sending the NetCDF or csv file to MatLab. Andreas assesses his model by comparing model output and real-world monitoring data on two axes: time (day of the year) and location. He performs a Model Skill Assessment to assess the accuracy of his model statistically in addition to graphical examination. It is the MSA that others will use to judge the quality of his model.

Reasons for using DataONE to share and to reuse data
Needs and expectations of DataONE tools: 

To assess his latest model, Andreas wants to be able to compare his prediction of maximum spring bloom biomass with what was actually observed. Monitoring data are gathered by the government and are available online, but significant work must be put in to the dataset before it can be used. If DataONE could provide the data in a more usable format, it would save considerable work.

Like most modelers, Andreas hopes that his models can elucidate both the specific biogeochemical dynamics for the area in question as well as be applicable to comparable systems elsewhere. Currently, too much of his time is focused on acquiring and managing the data in the specific context of his research, and he cannot afford to test the applicability of his model to other systems. DataONE could potentially solve that problem, expanding Andreas' research capabilities and revealing his work to a broader base of researchers.

Intellectual and physical skills that can be applied: 

Andreas has significant programming knowledge in MatLab and Fortran. He is likely to be able to overcome functional deficiencies in DataONE tools as long as he doesn't have to spend too much time cleaning up the datasets themselves. Andreas' work is likely to illustrate some of the more powerful analytical capabilities gained from sharing datasets via DataONE, but only if he uses the data referencing protocols so that users can track those links to his models.

Technical support available: 

Andreas is part of a highly sophisticated technical community, with whom he can work both formally and informally. However, he has no additional technical support for his own work beyond himself.

Personal biases about data sharing and reuse (and data management more generally): 

Andreas needs real world data about his area of interest, both to calibrate and to assess his model.
Some of Andreas’s model code is publicly available and open source and he has already published on most of it. However, he is confused as to what he should do with his model output. Normally it just sits on his servers and is used for a couple years by him alone. Other researchers have asked to look at his code, but no one has asked about his model output. Andreas doesn’t think the older model output is very useful, in part because it is difficult for anyone other than him to understand, and would like to delete it to free up server space for new output, but worries that he might be losing.

Comparison of current and DataONE-enabled practices:
Current data collection: 


DataONE enabled data collection: 

No change.

Current data assurance: 

Andreas does not deposit data, so he has no assurance steps for his own data. However, he does worry about how much he can trust the data he uses to validate his model, though he does not have many tools to perform these validation steps himself, instead trusting the checks on data uploads to the repositories he uses.

DataONE enabled assurance: 
  • Andreas may find it helpful to use DataONE tools to double-check the validity of interesting data in the system.
  • He may also use templates provided by DataONE as guides for assuring the quality of any aggregated datasets he generates and republishes, and also perhaps for any model outputs he wants to contribute.
Current data description: 

Andreas does not deposit or retain any data once he has completed a model run. The only information he needs is a reference back to the publication of the original data.

DataONE enabled description: 
  • To the extent that Andreas publishes aggregated datasets he creates, he will likely benefit from some guidelines or tools to manage the process of describing his data. This will be especially important for Andreas because it is otherwise unclear how to apply useful metadata to a dataset which is itself composed of heterogeneous subsets of other datasets.
  • Andreas already understands how important it is to describe data well since his work relies on being able to discover data relevant to his research questions. This implies that he may be likely to devote more time than most scientists to data description activities as part of his reciprocity to the DataONE community.
Current data preservation: 
  • Andreas does not deposit data currently. He wishes everyone else would though.
  • Andreas does not preserve his data. However, his work depends on finding and analyzing existing data which remain accessible and identifiable with a stable identified (URI or DOI) so that peers can evaluate his models correctly.
DataONE enabled preservation: 
  • To the extent that Andreas can and wants to deposit his aggregated datasets, DataONE may be a key facilitator. Andreas' greatest motivation is likely to be a sense of obligation to “give back” to the community of data depositors upon which he depends for his work.
  • Andreas is not so interested in data preservation except to the extent that others preserve their data. Nonetheless, he would be happier contributing his data to a repository if he knew that preservation services were sound and the data useful.
Current data discovery: 
  • Andreas spends an enormous amount of time searching and discovering data that are relevant for his models. This process is so difficult that he usually limits himself to government data repositories where at least the data are likely to achieve a known standard, are presented in some standard form, and have no restrictions on use
  • Because Andreas must rely on relatively “raw” data due to limited availability of data in other forms, he also spends a lot of time transforming the data to meet his analytical needs and the limitations of his models. These processing steps are captured in his methods but are not otherwise recorded or automated in any manner.
DataONE enabled discovery: 
  • If DataONE can save Andreas time in discovering useful data, as well as reveal additional datasets of interest, he will be a power user very quickly.
  • Andreas believes that the processing steps he takes to render raw data suitable for subsequent analysis are valuable, but he has never had any place to deposit such datasets, and there are no guidelines regarding norms of attribution and other variables. If DataONE can accept these datasets and automate the provenance and attribution aspects, this is likely to convert Andreas from a strict consumer of data to a data contributor (or, more precisely, an enhancer).
  • Andreas is also curious to know who might find his datasets and analyses to be of interest, as he is interested in collaboration but has lacked clear pathways to identifying possible collaborators outside of his tight circle of peers. DataONE may open up those possibilities for him.
Current data integration: 
  • As with data discovery, Andreas currently spends huge quantities of time integrating datasets, where possible, for his analyses.
  • Andreas also limits the potential of his work by intentionally steering clear of problems that require complicated integrations. It is simply too difficult to manage the process and takes too much of his time for the payoff.
DataONE enabled integration: 
  • If DataONE provides a toolset for integrating disparate datasets, Andreas is likely to become a power user.
  • Andreas sees integrated (aggregate) datasets as potentially valuable contributions of his work. If DataONE can accommodate deposition of such datasets, with as little fuss as possible, Andreas will probably become a major contributor.
Current data analyses: 
  • Andreas' analyses consist of model runs, using the data he discovered and integrated. Different models produce different outputs, and it is not clear that the outputs are of interest to anyone outside of their utility in validating the model.
  • Andreas labors under the presumption that his models may be useful for answering empirical questions, but he isn't personally involved in such efforts, though he often wishes he could be.
DataONE enabled analysis: 
  • It is likely that Andreas would still export his integrated datasets in order to test his models outside of DataONE. However, DataONE may provide the means for him to refer to different model runs and the associated data more easily, and also to tweak the variables and rerun models more easily, tracking those changes each time.
  • For some models, the outputs might actually be of interest, and reference to the original data, as well as the model itself, would make such outputs easier for other people to interpret and build on. DataONE would need to provide functionality along these lines in order for Andreas to consider depositing the model runs in this manner.

Data Conservancy Zoe persona by Anne Thessen, interview with David Keller and comments from Raleigh Hood and Sam (DataONE scenario). Revised by Kevin Crowston