Week 6

Curation of our Dataset

Part of the research project we did this summer was data curation. The dataset we are working with contains information from the ‘awful-ai’ dataset as well as the ‘incidentdatabase.ai’ dataset. Combined, these are a large dataset and while some of the categories we used were already populated, we needed to manually fill out much of the data.

I spent a lot of time this summer adding to the ‘date of instance’ column in our dataset. Working on this helped me see how fuzzy data curation can be sometimes, and all the choices that go into curating a dataset. We got our date the public found out and information about the AI instance from online articles. Most often the date the public found out was listed as the date the article was published. To determine the date of the instance, sometimes that meant finding the date of occurrence in the article for a very specific instance, and sometimes it meant googling what date a software was released.

It is worth noting that the date of occurrence and date the public found out are approximations. The date of occurrence may be different for people depending on when and how they interact with the AI system. In this case we just used the date the software was released. Similarly, the date the public found out will be different depending on if the AI immediately impacts them, and what is meant by ‘public’. For example, youtube kids was a problem for years, and some viewers probably noticed the issue right away, which eventually led to articles being written about it. However, our ‘date the public found out’is just the dates the articles were published.

Another difficult piece of information to parce was when the issues actually started and who is responsible. For example, if a company releases a software timetabling AI service that works at a small scale, but causes problems when scaled to a large company, should the date of occurrence be when the software was released or when this compmany started using it at a large scale?

It has been tremendously eye opening to see all the decisions that go into curating a dataset, and then to zoom out to the larger picture with the Irresponsible AI Atlas we are trying to build, and recognize that even with all the tradeoffs in our data model this is still an important dataset worth having.

Another part of the dataset we worked on curating was the taxonomy of class and subclass. This simply consisted of categorizing all the AI instances. We are then able to group our AI instances by what field they are in.

You can explore our data below.

Written on June 27, 2022