Social network data: Twitter vs FB vs Google vs everyone else

[First published on Towards Data Science]

This is the third post in a series of three looking at how technology is shaping our social connections. The first post tried to convince you that our online and offline social networks are incredibly important. The second outlined some strategies designers can use to design better social networks.

If I am friends with you, who owns that information? Facebook says either of us can reveal our friendship to the world. But that surprises many Facebook users, so they constantly report it as a security bug. Information about our social interactions poses unique challenges.

This post looks at sources of social network data, and rates them for their quality, availability, and ethical legitimacy. My PhD research used Twitter, but I’ve done work with Facebook data, with citation networks and with offline networks. I also have a suspicion that Google is sitting on the most valuable social network information ever gathered, even if it doesn’t know what to do with it.

Twitter

Twitter is wonderful for research: it’s public by default, and the platform is happy to share (a limited amount) of data with users. It also has a ‘town square’ atmosphere, it’s a place for discussing big issues of the day.

In my PhD, I built a tool that uses Twitter data to help local government improve and target their services. There is a problem, as Blank says:

‘British Twitter users are younger, wealthier, and better educated than other Internet users, who in turn are younger, wealthier, and better educated than the off-line British population’

Similar sentiments are true outside the UK. I’ve been aware of this issue throughout the research. I had two lines of argument in response: 1) some data is better than no data, and all data has biases, 2) young people are disengaged from local politics, so they are a good group to have access to.

Even so, Twitter’s over-representation of the better-off and more empowered always made it feel inadequate. Some parts of society are ‘hard to reach’, so difficult to engage that almost any system that tries to increase participation will fail them. The chronically unemployed, non-English speakers, rough sleepers — these groups will need different, probably highly resource intensive, approaches to increase their democratic engagement. A system that fails them isn’t necessarily failing completely, if it can serve other less acutely disadvantaged groups. Despite this logic, Twitter really is about the most advantaged most of the time. There are a few exceptions, which may point the way for future work. For example, Twitter is a means of political inclusion for those with limited mobility, or for those who prefer not to engage face to face.

Another problem with Twitter is getting a computer to process the data. Intuitively, tweets ought to be highly legible to software, short bits of text with very rich metadata — time, location, who wrote it, hashtags, @ mentions. This is not quite the whole story. Perhaps in part because of the brevity of tweets, meaningful Twitter activity is fragmented over multiple messages, ironic retweets, contextual information like user bios, text in images, subtweeting. These are much hard for a computer to understand.

Despite these shortcomings, you’d guess that Twitter ought to be an amazing platform for research. But how often do you see truly startling insights from Twitter data? There is a feeling it’s getting a little tired.

Data politics: Twitter may have all kinds of political problems, but in terms of data its been very transparent: (essentially) everything is public. Of course, some people will say things in public and wish they hadn’t. I’d argue that any system that protects people from this danger is going to be infantilising and paternalistic. Twitter is so clear that your tweets are public – it’s such a core part of Twitter’s offer — that I don’t see an ethical problem in this. (Though it has been mentioned to me many times.)

Data quality: Fragmentary and not demographically representative, my guess is that combining Twitter data with other sources will be its most fertile application in the future.

Dat availability: Twitter has lovely a API (a computer readable interface for getting data).

Facebook

Facebook is the largest social network, so you might guess that it must have the best data. I think you might be wrong. Many gestures on Facebook — perhaps especially ‘friending’ — are noted for their disposable vapidity. Facebook’s raw friendship data might be little more than a list of who went to school with whom. (Mostly — I’m aware that are niche communities on Facebook that encode different relationships.) Facebook Messenger interactions are likely to be more indicative of meaningful social relations — but then again, it may just tell you about a relatively short list of their closest friends.

Facebook provides a page where you can see what they know about you. I recently asked some students to use the tool and tell the class how accurate they felt it was. From what you hear in the press, everyone would be sent reeling by Facebook’s window into your subconscious. In fact, about half the students felt it had reasonably accurate information, the other half felt it was mostly wrong.

Data quality: Unknown. Alot of people — including marketing companies and Facebook itself— have incentives to give the impression that Facebook’s data is better than it is. WhoTargets.Me was a project monitoring political adverts on Facebook in the run-up to the UK 2017 general election, we saw many poorly targeted adverts. This doesn’t necessarily directly reflect FB’s social network data, but it could be indicative.

Data politics: The overwhelming majority of Facebook users have no idea what data Facebook has or any clue on how to navigate its privacy settings. As is well known, Facebook has also become the subject of political suspicion around its relationship with Russia. Legitimacy is in short supply.

Data availability: Facebook is a walled garden. It’s virtually impossible to build anything on top of it. You can ask users to install a Chrome plugin, which is what we did on the WhoTargets.me project. Mostly though, researchers are locked out of Facebook.

Google

My Gmail network, visualised. Whose data is this for me to publish? (I’ve obscured uncommon names)

If Facebook doesn’t have great social network data, who does? Gmail. Emails are long, time-consuming ways of communicating: they are costly signals. They represent network connections of all kinds, not just your friends: professional, social and even customer relationships with banks, energy suppliers, etc. Gmail — get ready for this — has an amazing amount of network data from non-gmail account holders. Even if you are not using Gmail, half your email (and the associated network data) still ends up with Google. If one person on an email has a Gmail account, Gmail can see the full list of recipients. One freelancer using Gmail can leak the social structure of a whole company structure to Gmail (assuming your company doesn’t already use Google Apps…).

Data quality: probably mind-boggling.

Data politics: Google has somewhat of a clean slate. It isn’t accused of having elected Trump. Lots of people have had the experience of sharing more than they intended on Facebook, fewer with Google. They’ve even tried to use their social network data with the Google + platform, but failed — perhaps mostly because of the implementation rather than privacy politics.

Data availability: Unlike Facebook, Gmail’s social network data is not completely beyond reach. I generated the above visualisation using MIT’s Immersion tool. Users do have to give explicit permission, exactly as they should.

All the rest

I wanted to highlight a couple of the less obvious sources of social network data. One interesting example is co-traveler data — social network data derived from location data.

Co-traveller analysis carried out by the NSA. (via the Washington Post)

Even if you have really good location data for a group of people it’s going to be hard to know if they are interacting, or just standing close to each other. Unless you take a statistical perspective: even with very low-resolution location data, you can start to make strong inferences about social interactions, as long as you have enough location data points to work from. There’s lots of nice Bayesian analysis to do here, but the intuition is simple — if two individuals are often near each other in a diverse range of places, they probably know each other.

The above infographic was made to explain how the NSA uses co-traveler analysis to covertly track potential targets. But you could also use it with people’s permission if you built a platform that they could trust and which made a suitably persuasive offer.

Finally, I wanted to mention academic citation networks, which I worked on with my Who Cites? project. Academic citations occupy a particularly interesting ethical position. Academic research is almost always paid for in part by the state, so it seems reasonable that data about it should be public. It also has the potential to reveal useful information about who should work with whom, and whose work is most influential. This aligns with wider ideas of non-market cooperation (common-pool resources), nicely outlined by Cory Doctorow.

Summary

Across three posts (first and second here), I’ve tried to explain some of what I’ve learned across my PhD. Writing it out in non-thesis language has been incredibly helpful for helping me clarify what I think I’ve learned, and to think about where to go next. I already feel another post, this time about ethics, coming on….