The #Cuéntalo movement started early in the morning on April 2018, inviting women to share on Twitter their personal experiences of sexual aggression. In a few days, it generated more than 2.5 million tweets and retweets with stories told by their protagonists.
Archivists Vicenç Ruiz and Aniol María gathered the tweets in real time, and together with data journalist Karma Peiró they came to BSC to talk about how we could study and visualise this dataset.
The tweets go from the uncomfortable to the unbearable, stories in first person occasionally mixed with a woman speaking in the name of another one because she doesn’t have a computer, or because she doesn’t dare to tell her story, or because she was murdered.
“My mom’s boyfriend raped me when she was drunk because ‘I was very much alike and it was wrong to make love in her state’ and ‘so that you learn for when you get older’, since I was 9 until I was 12 years old”
There are many tweets, and each of the is important. Our goal was to study them with statistics and visualise the results to convey the shocking magnitude of the movement, while trying to respect the unique identity of each narration. Cristina Fallarás, the journalist that started and pushed the viral movement, gave us one more goal: She wanted the #Cuentalo visualization to be shocking, but also to remain a safe space, a place without fear or shame where victims could tell in first person their brutally honest testimonies (many of the telling them for the first time in their lives). A place where they can tell what’s hidden in plain sight, because even if it happens every day, what is not said out loud does not exist.
In the next few posts we will explain our analysis and design process for this project: From data gathering, cleaning, and treatment, to the conceptualisation and implementation of the final result.
The original dataset contained 2.1 million tweets in JSON format, created between April 27th and May 12th, 2018. The collection was missing two days that we were able to recover partially, reaching 2.75 million tweets in total. Each tweet has a lot of properties that we analised (and we will tell about this on the next post); here we will focus on what we used for the visualization.
One important distinction is that there were 160 thousand tweets with content written by users, while the rest (2.6 million) were participations in the form of retweets. These are indeed crucial for spreading the movement (and we assume they are supportive of it), but the data we want to visualize is in the content of the tweets we call “original” (i.e. not retweets).
We want to visualize who tweeted: These mostly anonymous 160 thousand tweets can be split between those that gave some testimony in first person, the ones that tell something on behalf of another woman who doesn’t dare or can’t do it (for example, because they don’t have internet access or they were murdered), and those that express words of support. We also found a few unclassifiable tweets (advertising, images), and a small group of trolls and jokers.
And on the other hand, we wanted to visualize what people wrote about: What testimonies were told? Because of the volume of the data, this was a complex task, but we knew it would be valuable, as we know that above 70% or 80% of sexual aggressions are never reported. This is one of the key points of #Cuéntalo: Bring out to light those things that happen everyday and are not computed in any registry, either by fear, or worst, because society does not believe them.
“I’ve been raped twice; part of my family doesn’t speak to me anymore for saying this, and for saying I wouldn’t shut up about it. I’ve even withdrawn a rape report because the police made me afraid. I won’t shut up anymore #Cuentalo”
In order to categorise the 160 thousand original tweets, 16 people in our team manually classified the content of over 10600 tweets (randomly chosen). The goal was to use this sample to create an algorithm to classify the rest of the tweets automatically, and for this we used the maximum number of categories that we could process correctly.
The first categorization was about who wrote the tweet: In the first case, we have testimonies in first or second person, an expression of support, someone that opposes the movement, and random tweets (for example, tweets in other languages, people that used the popularity of the hashtag to do advertising, or tweets containing just a screenshot). This last example did contain people that told stories that were too long for the 280 characters of Twitter, but we cannot read them correctly.
“The stories of #Cuentalo give me the shudders. They pain. They pain a lot. But it is necessary to read them to understand they are not isolated cases: they are everyday situations.”
After this we categorized the content of the tweets. We found accounts that go from the sensation of fear and insecurity that women feel everyday, to murders with torture, with all types of aggressions in between (physical, verbal, or virtual), including beatings and rape. To ease the training process (more details to come in the next post), we opted for the simplest possible categorization, sadly at the expense of the precision we would like to have (or that we could if we read by hand all tweets). Aspects like frequency or duration, degree of cruelty, age, or victim-aggressor relation were not worked on, not because we don’t think they are important but rather because of technical reasons.
Our categorization is incomplete and improbable in many ways (for example from legal and social points of view). We argued a lot about this issue, specially worried about not minimizing or simplifying the seriousness of the fact.
Initially, the categories we used were: Murder, rape, sexual agression, assault, harassment, fear (the explicit mention of), and emotions like disgust, rage, and indignation. Like we said, this categorization is imperfect because of its simplicity, and for the visualization we aggregated it even more to just three categories: assault or any kind of physical agression (including murder, rape, and so on), non-physical agression (harrassment, fear), and an emotional reaction (like rage, etc). The previous and finer categorization work was not in vain as we were able to use it to estimate the percentage of tweets in each category in the full dataset. Of the 10632 tweets we labelled by hand, 31.03% are written in first person, 8.91% are told by someone else, 40.18% are support tweets, 3.12% are tweets agains the movement, and 16.69% are unrelated tweets. Extrapolating these percentages to the total, we would have error brackets around 1.5% for the tweets against #cuentalo, 3% error in the estimation of testimonial tweets, and almost 6% for the support tweets.
Within the testimonial tweets (first and second person, totaling almost 40%), 3.92% mention murder, 5.59% mention rape, 11.18% mention sexual assault, 6.27% mention physical assault, 14.19% talk about harassment, 11.78% mention fear, and 19.23% some reaction like disgust/rage/sadness (the percentages don’t add to 100 because in the same tweet many things can be mentioned). Again, if we extrapolate these numbers to the global dataset, we find error bars of 1% for rapes and murders, 3% error in the estimation of assault and harassment, and 6% for reaction tweets.
It’s often told that people understand frequencies better than percentages, so we wrote these results in the webpage as such:
Let’s conclude this section by commenting that our algorithm was able to label tweets from the dataset with a precision of 80% for who is writing the tweets, and with a precision of 70% for the topics mentioned. In general, not exact but quite well due to the reduced size of the input data (better results are obtained with millions of sentences). We expect that there will be many mis-labeled tweets, and the predictions is more a suggestion than a final conclusion. Our take away from this is that the visualization cannot depend way too much on the precision of this clasification until we can manually label them all.
We start by discussing images and themes that we used as inspiration, including some of our own sketches as we started producing them.
Initially we started with a preconcept: We were going to find many conversation threads, and we tried to represent them with trees like this:
However, we found that there was overwhelmingly more single tweets and that the conversations were not long, so this technique did not work.
We then though about a lineal time narrative, showing the virality of the phenomen and its magnitude. This charts shows the number of tweets per minute (the highest point is about one thousand), from 27th of April to the 13th of may:
We played with the idea of this chart looking like a sound wave, representing the movement as a giant scream heard around the world:
But in the end we were not convinced by the scream metaphor. Also, the lineal time representation would prevent us from adding data in the future. It felt like freezing the event in time, not allowing it to continue. This visualization did allow us to represent tweets by country, though:
We ended up dropping this option also because we had lost the individuality of each tweet, which was important for us.
In order to be able to let people add more tweets, we started exploring circular representations and periodicity.
In our first attempts, we started plotting hours of the day around the circle, and setting tweets from the inside out in a first created order. These sketches were made thinking of an eclipse, or a human eye:
This representation is quite versatile and allows us to incorporate more dimensions like country (e.g. in color), or number of retweets, AND it allows to explore interactively each tweet one by one.
In order to escape the ink blot shape and take better advantage of the radial position for each point, we tried with some structure, for example by ordering the tweets inside-out in a ring for each day, with the size of the ring proportional to the total volume of tweets for each day:
The results are interesting by they also have the problem of being difficult to expand with new tweets in the future.
At this point, we met with Cristina Fallarás who sent us back to the origins, and focused on where we should put the main message: #Cuentalo, aside from being an event with a large social impact, was a safe space where women could tell their story. We then decided to use our radial coordinate to somehow represent this. We needed to put the women who were telling their stories in a space at the center of focus, and surround them by the people who support them but also isolate them from the randoms and trolls outside in the world.
With this in mind, we created the first sketches that approach the final solution:Di
Our classification algorithm (of who wrote the tweet) allowed us to make a final improvement: We removed those tweets where we were at least 90% sure that they were random (jokes, advertising, and unfortunately also tweets with only an image). In the final visualization we end up with only 100 thousand original tweets, and retweets encoded in the size (subtly).
SO, before getting to the final representation, let’s remember our goals in the design, what it should say:
– Magnitude: These things happen, and more than we think, These are numbers that should set off our alarms, specially because behind these tweets there are so many more women who haven’t dared say anything.
– Empathy: This maybe happened to you too, or could happen to anyone around you. It is empathy that helps us understand someone else´s suffering, their fear, like a close feeling that makes us intervene and speak up.
– The diversity and atrocity of the crimes. Murder, rape, torture, crimes against minors, and crimes commited by families, friends.
– Hope that the victims will find a safe environment where they are not judged and questioned, helping others come forward. This safe space is reinforced by the messages of those that denounce the situation without having experienced it personally, facing those that do not believe how serious the problem is and try victim shaming and blaming. Only by legitimizing the suffering of many we can discuss justice reform and make law reflect better what happens in reality, and thus help change and improve our world.
The final visualization #cuéntalo
The final visualization (here), ended up using the circular representation to evoke a safe and protected space. Our estimation of who wrote the tweets let’s us represent testimonials in the center ring, somehow surrounded and protected by the tweets of support in the second circle. The rest, far or against the cause, are very outside the inner circle. Each tweet is represented by a point in in space, forming a large and filled cloud that gives us an idea of the magnitude and repercusion of the phenomenon. The individuality of each tweet is preserved thanks to exploration, allowing us to see the content of each story by hovering with the mouse. The time of day when the tweet was written is encoded in the angular variable, reminding us that this is an issue that happens at all times of the day, everywhere. The bright colors, over dark background, represent a light in the darkness. The color palete (from red to white) evokes the violence of the topic, and by using it to encode the probability that a tweet is talking about a physical aggression, we discover something interesting: Most of the tweets that talk about assault are located near the center. The inner ring, where women tell their stories, is tainted with deep red where the most piercing tweets.
The visualization is complex, and the legend helps us interpret it:
In alphabetical order: Sol Bucalo, Luz Calvo, Carlos Carrasco, Fernando Cucchietti, Artur García Saez, Carlos García Calatrava, David García Povedano, Juan Felipe Gómez, Camilo Arcadio González, Guillermo Marín, Irene Meta, Patricio Reyes, Feliu Serra and Diana Fernanda Vélez. Also from BSC for the classification we had the collaboration of María Coto and Laura Gutierrez.
Epilogue: Unexplored options
In the selection process of themes we wanted to explore, we dropped some interesting options, like for example talking about the age of the victims: more than three thousand tweets report victims less than 18 years old. Many of them are today adults and telling their story for the first time.
We also thought about focalizing in those tweets in second person that mention the woman who was murdered (10% of all testimonies), usually ending with a phrase “I’m telling this because…can’t”. This is an example visualization of all the name mentioned in this kind of phrase, in a homage to the Iraq’s bloody toll famous visualiazation: