Case study

How do we add barriers to publishing spam while minimizing hurdles for researchers?

My role

Participated in stakeholder interviews to learn more about the problem and define goals
Presented what was discussed to verify goals, what we knew, what we needed to learn, and how we could learn more
Hosted virtual meetings to guide participatory design
Conducted and presented quantitative data analysis

Methods used

Stakeholder interviews
Participatory design
Data analysis

Results

We learned how much work proposed solutions would create for the curation team and that the team didn't have the resources for it at the time. This helped inform the project manager's decision to go with a different solution.
We learned how much design and development resources we'd need to extend certain functionality in order to minimize hurdles for the software's users.
I gained more insight into how to share the results of data analysis with different types of audiences.

Overview

On the Harvard Dataverse Repository, which uses the Dataverse software to provide free hosting of research data, anyone can create an account and publish research data. But for much of 2021, repository staff felt that they spent more time reviewing and removing the spam being published and less time performing other curation tasks. Curation staff also worried that the spam encouraged other spammers to publish even more and that the problem would grow if left unaddressed.

We investigated how to stop people from publishing spam while making sure that publishing data remained as easy as possible.

What I did

Stakeholder interviews

In the repository, anyone can either publish data in the repository's main container or create their own container to store and organize their data. We learned during loosely structured discussions with repository staff that spammers were creating containers in the main container and adding text descriptions filled with URLs and keywords meant to improve search engine rankings of their websites.

Left image is an example of published spam, with URLs and keywords used to improve search engine ranking. Right image is a container with actual research data

In the Dataverse software, containers are like folders in an operating system, where folders can contain folders. People creating containers can configure them to allow everyone or only certain people to create containers within their container.

Summary of how users can use the Dataverse software's permission settings to manage their data

Aware of the constraints of a tight design and development budget, the team asked how we could use the Dataverse software’s existing permission functionality to prevent spammers from publishing in the main container.

Could we change the settings so that people can create but not publish containers in the main container? The curation staff would need to review each created container, publishing data and deleting the spam.
We found that spammers created repository accounts using Gmail addresses and suspected that accounts created with academic email addresses or through GitHub accounts would not publish spam. Could we change the repository's settings so that people with those types of accounts could publish their containers without review? This might lessen the number of containers that curation staff would need to review each day and prevent an unknown number of data depositors from being inconvenienced by the new review process.

Participatory design

With the project manager, UX design lead, developers and curation staff, I used a Miro board to brainstorm and document a shared understanding of the various deposit approval workflows that the team had discussed during stakeholder interviews.

This helped get the team on the same page about how we might use certain settings and how certain configurations might change the publishing experience for the curation team and for users, and helped us figure out how we might predict the effects of these changes.

Data analysis

I conducted and presented the results of data analysis to help predict the effects of a handful of proposed changes. The Dataverse software's database collects information about user accounts and deposits, so I extracted this information and analyzed it in a Jupyter notebook to share answers to questions such as:

How many containers and data deposits are created in the main container each day?
How many user accounts create those containers and deposits each day and what are the account types (such as accounts tied to academic institutions or GitHub accounts)?
Each day, how many containers and deposits are created by first time users of the repository versus returning users?

Screenshots of different parts of a Jupyter notebook I used to learn more about the number and type of accounts used to create containers ("dataverses") and data deposits ("datasets") and how many containers and data deposits had been created each day in the main container ("HDV Root")

The analysis helped staff consider the daily number of new and returning data depositors who would be affected if we implemented an approval workflow where each container would need to be reviewed before being published. In turn, this helped us estimate the amount of work that the various solutions would create for the curation staff and the number of returning depositors who would be inconvenienced each time they needed to publish their data.

The analysis can also be adapted as a benchmark for the volume of deposits and the types of users (such as returning versus first time depositors) over time as changes are made to the repository’s deposit workflow and other parts of the application.

Key team and personal insights

Hurdles created by proposed solutions would be too great

We learned that the amount of additional curation work that the considered deposit approval workflows would cause would be too significant and the team couldn’t commit the design and development resources needed to extend the permissions functionality to reduce the amount of curation work.

These insights informed the project manager's decision to prioritize a different type of solution: While we were considering different deposit workflows, the development team was experimenting with an algorithm to recognize spam deposits, and the repository software would flag those deposits and prevent their publication so that curation staff could review them. This algorithm and workflow was implemented instead.

Methods for understanding users and measuring goals

A goal of the repository is encouraging repeat use, and the analysis helped us learn and can continue to be a method for discovering the number and behaviors of new and returning users and how changes to the repository helps us meet the goal of retaining users.

Sharing data analysis with different audiences

I gained insight into how to share the results of data analysis, catering to colleagues who needed the results and actionable insights and those who also wanted to see how the analysis was done.

Page updated

Google Sites

Report abuse