Case study:

How do we help people apply different licenses to research data in ways that other people and systems can understand?

My role

Conducted and presented quantitative data analysis that tested assumptions about the problem and the effectiveness of certain solutions
Guided participatory design with mockups
Reviewed solutions with users, collected and reviewed feedback
Presented research and proposal during the software community’s annual conference

Methods used

Data analysis
Participatory design
Evaluative research

Results

My analysis of the licenses and terms of use of already published datasets helped us test our assumptions about how depositors were and should be specifying terms of use. This helped us identify users with needs we should learn more about and led to designing a more robust solution.

Overview

Because the Dataverse Project’s mission has centered around the free and open use of research data, the community designed the repository software to encourage and facilitate the waiving of any rights that depositors may have to the data they’re publishing.

So during the deposit workflow, there was no indication that CC0 public domain waivers were applied to deposits by default.

During the deposit workflow, depositors weren't told that the public domain CC0 waiver was applied to their data

And to change the license, depositors would need to type a license or terms of use in free text fields.

Depositors needed to click the "Terms" tab to find out and change the license by typing information one or more free text fields

Over the years we heard from depositors that the default application of CC0 was not obvious, so some depositors published datasets without knowing that the software applied CC0, and had to take extra steps to apply more appropriate terms of use. How could we avoid this?

The project’s growing community also expressed interest in making it easier for depositors to apply different terms of use and in making those choices more human and machine readable. How could we help people, search engines and discovery systems determine more easily how research data with restrictions can be used and improve searching by licenses and terms of use?

Specifically, the community recommended:

Making it easier for depositors to recognize throughout the deposit workflow which terms are set for the data they’re depositing
Shortening the amount of time it takes for data depositors to apply terms of use to data
Making it easier for machines to index and expose in search results the licenses and terms of use that depositors choose
Letting repository administrators control which licenses and terms of use their depositors can apply to their data

What I did

The team I’m on, which leads design and development of the open source Dataverse software, consulted with a group in the community working on solutions that followed these recommendations. This was our team’s first attempt at working with the open source community in a more formal arrangement. We agreed to ship solutions within four months of that meeting.

The contract required the solution would include adding a dropdown list to the software's deposit form with predefined licenses and terms of use, something other data repositories were doing and members of the community had recommended.

Screenshots of four types of license selectors that other data repositories use

The group we consulted for wrote a high-level proposal of changes to the software that would let repository managers control what options appeared in the dropdown list and make it clearer earlier in the data deposit workflow which licenses and terms of use were applied to the data they were depositing.

Data analysis

The proposal sought to make each data deposit's terms of use more machine readable by letting depositors choose from a list of predefined licenses and terms. But we also needed to consider how repositories had been using the software for years: For many datasets, depositors had typed standard licenses or terms of use into one or more free text fields. We needed to learn how prevalent this was and what text existed in these fields. What could we do to help repositories make the terms in those free-text fields more machine-readable?

To more thoroughly explore how repositories and their depositors had been applying terms to the data they published, I wrote Python scripts that used the software’s APIs to collect information about the datasets published in most of repositories we knew were using the Dataverse software. And in a Jupyter notebook I analyzed the information to answer questions such as:

How many datasets were published with standard licenses, such as Creative Commons licenses, and how many were published with custom terms of use?
How many datasets were published with a mix of standard licenses and custom terms of use?
How often did the terms of use that depositors typed into the free text fields conflict with the standard licenses they chose and how did they conflict?

Screenshot of Jupyter notebook analyzing the licenses and terms applied to most datasets published in Dataverse repositories. The notebook is published in my GitHub repo

With the analysis we tested assumptions about how depositors were and should be specifying terms of use.

For example, the group designing the solution hypothesized that any text entered in the free text terms of use fields would invalidate a CC0 waiver, so the design restricted depositors from entering text in those fields when they chose from the dropdown list's predefined terms of use, such as a CC0 waiver.

From earlier explorations of the data I found that this wasn't always the case: Some depositors applied a standard license or waiver to their datasets, such as the CC0 waiver, and entered terms in other free text fields that did not conflict with the standard license. But from the analysis we learned that this wasn't prevalent. Datasets with a standard license and additional text accounted for less than .5 percent of datasets in all known repositories, and less than 1 percent of all datasets in most of those repositories.

We were also able to identify community members with use cases we should learn more about, which helped us realize that the solution had to retain more of the current deposit workflow’s flexibility.

Participatory design

I helped organize and participated in several design sessions with the group contracted to do the design and development work, where we helped create and iterated on mockups and design write-ups.

Screenshot of part of a Miro board used to communicate the data deposit (uploader), search, and administration workflows

Collecting and analyzing community feedback

To solicit feedback from the community, the group shared details of the proposal in mailing lists, Slack, and other communication channels. I organized the feedback in a document accessible to the entire community so that we could review and address the feedback transparently.

We also reviewed several iterations of the design with specific community members identified through the data analysis.

Key team and personal insights

Being prepared to evaluate success

Earlier in this project it would have been helpful to discuss how we might continue to test our understanding of depositor's goals and test how well the solutions that shipped met those goals, such as shortening the amount of time it takes for data depositors to apply terms of use to the data”. This might have been done with follow-up interviews and usability tests and by benchmarking the average deposit time before and after we made changes to the deposit workflow.

The project made me more certain of the importance of planning for evaluation as early as possible, even before solutions are considered and certainly before one is implemented. Otherwise, we may not become aware of problems until they have affected many users.

Page updated

Google Sites

Report abuse