Day 12

For next time

Toolbox 1 due Monday
MP3 due Tuesday - don’t forget slide for in-class presentation

Ethics in text mining and analysis

We’ll discuss the reading from the previous RJ as a large group. Some of the key points are summarized here

Personally Identifiable Information protection principles

Principle	PII protection rationale	Incompatibility with “Big Data” analysis
Collection Limitation	There should be limits to the collection of PII, it should be obtained by lawfully and fairly and, ideally, with the knowledge/consent of the data subject	The larger the data collection, the better the potential for identifying interesting correlations
Data Quality	PII should be relevant to the purposes for which it is to be used, and should be accurate, complete and up-to-date enough for those purposes	“Messy data” is fine, it’s not clear what is relevant until its analysed, and even inaccurate or incomplete data can be useful
Purpose Specification	purposes for which PII are collected should be specified at the time of data collection. Subsequent use should be limited to those purposes or such others compatible with those purposes and specified on each change of purpose.	Data may have been collected for a particular purpose, but analysis may indicate further unrelated and previously unknown, but valuable, purposes. Data as collected may not be obviously PII, but analysis of it may identify individuals
Use Limitation	PII should not be disclosed, made available or otherwise used for unspecified purposes except with data subject consent or by authority of law.	There may be value in sharing and aggregating data that may not be apparent at the time of collection
Security Safeguards	PII should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data.	It may be unclear what security issues, if any, arise from a particular collection of data or its analysis
Openness	data subjects should be able to establish the existence and nature of PII, and the main purposes of its use, and the identity and location of the data controller.	Where data is collected and analysed, it may not be obvious that it is PII, and even in circumstances where it is, the researcher may have no way of informing the data subject of its use
Individual Participation	An individual should be able to be informed by a data controller whether it holds PII relating to him or her; to have the PII communicated to him or her in meaningful form and reasonable time and at reasonable cost; to be informed if the PII will not be communicated and to be able to challenge that denial, where the PII is not lawfully held to have it erased, rectified, completed or amended.	Data that is anonymous may still be utilised in ways that can cause risk/harm to an individual
Accountability	A data controller should be accountable for complying with measures that give effect to the other principles	How and when might a researcher to be held accountable and for what?

Failure modes of data analysis

overfitting
selection bias
cherry-picking
confirmation bias

Alignment and bias

Let’s examine a reflection write-up from a previous semester MP3 that has been adapted to serve as a prompt that allows us to build up numerous “ask-analyze-assess” alignment possibilities. We will do so today with an eye toward biases that could have played a role in data and algorithms used (or the people that generated them) by the systems that this project incorporated.

Work with one or two people near you as you read the reflection below and do exercises 1 through 5 related to alignment. Examining an already-completed MP3 can give you practice considering limitations and biases in complex systems before you reflect upon your own assignment. The exercise also enables instructors to introduce some considerations that might be less obvious.

What the creator was asking:
- For this project, I created a program that asks the user for a music artist, scrapes the lyrics of their songs, and performs a sentiment analysis on them.
Which types of data were analyzed:
- I used google and lyricsfreak.com as data sources. Once I parsed through the HTML and gathered the lyrics of each song, I used a sentiment analyzer to gauge the overall mood of their songs. I hoped to get more comfortable using python to interact with information on the web and use an appropriate computational analysis on it. Additionally, I wanted to create a visual representation of artists’ lyrics to compare and contrast in hopes to find trends. Once the user enters an artist’s name, my program performs a google search with those terms and finds the artist’s song list on the LyricsFreak website. I looked at different lyric websites but I chose this one because the body of the lyrics was easy to identify and easily accessible using Beautiful Soup. From the artist’s page, I extracted the URLs of each song and went to each individual page to scrape the lyrics. Keeping the lyrics from each songs separate I used the NLTK sentiment intensity analyzer to find the polarity scores of each song. Finally, using matplotlib, I plotted each song by the selected artist on a scatter plot using the negative and positive scores from the analysis.
How the tools selected helped answer the question:
- Graphs for artists with a lot of songs were difficult to read. I tried for a long time to figure out how to get the labels to appear when they were hovered over, but I was unsuccessful and could not find a way to interact with the figures I generated. Instead, I randomly labeled a few of the songs so that the viewer could identify specific songs to make sense of the plot but still be able to observe any trends. The final representation of my information is a scatter plot with every song’s positive and negative sentiment values of a particular artist. One thing that surprised me was that the songs turned out to be a lot more neutral than I had expected. I tend to think songs written by an artist have the similar tones but judging by the analysis I performed, I don’t think I can fully support that hypothesis. However, there were slight trends, for example Adele’s songs were more negative than positive for the most part but generally the points were pretty scattered. Also, I thought it was interesting to look at the songs that were very close to either axis. This suggests that they were almost entirely negative or positive and more skewed to one side or the other.
- How the student felt that the questions they asked aligned with the data sources they chose and the tools they used to analyze them and present their work:
  - [Do the exercises below before viewing the author’s remarks in this section.]
- How the creator felt after completing Mini-Project 3:
  - After doing the project, the I am very pleased with the skill set I learned from this project. This was one of the first projects that allowed for some freedom in what topics and concepts we wanted to dive deeper into. I was definitely challenged by this project. One of the biggest changes I can pinpoint was visualizing the flow of the program and going from what was in my head to lines of code. I had to make decisions about which functions would handle individual song lyrics and which ones would be performing a function on a list of scrapped information. It helped me to write out the different functions I thought I needed in plain english along with the parameters and return objects and to create a flow diagram. I felt as though this was well scoped and that I gained a significant amount of independence during this project. Although I might not use the specific results from this project, I now have a handful of new tools that I look forward to incorporating in my future projects.
- Exercises: put youselves in this students’ shoes and consider the tools mentioned above or any others that you can imagine helping to achieve the stated goal.
1. Take the potential source of lyrics, lyricsfreak.com and Google search results, and propose at least one way that each source may be limited.
  - 1a. A potential lyricsfreak.com limitation…
  - 1b. A potential google search limitation…
2. Take the tool to analyze the language of lyrics, NLTK, and imagine ways that the tool might have been trained to classify a set of words as positive or negative. Think of times when such a toolkit may struggle to classify a particular set of lyrics based on how you think it was trained.
  - 2a. NLTK might produce satisfactory results given lyrical data that has the following characteristics, based on how we think it was trained…
  - 2b. NLTK might struggle to produce satisfactory results given lyrical data that has the following characteristics, based on how we think it was trained…
3. Consider the tool for visualizing the results of the analyses, matplotlib, or something similar (e.g., printing text to the terminal). What are conditions (e.g., with a lot of lyrical data available for an artist or only a little) that might might make it difficult for a user to confirm or challenge assumptions they made about the musician they searched for?
4. Once you have responses to the three prompts above, read the alignment excerpt written by the original project creator.
5. Contemplate whether you went further in your consideration of potential bias than the project’s creator. [Optional] see texts such as Algorithms of Oppression that introduce ways in which Google searches and the algorithms that produce the results can be problematic for many users.