Introduction
Another week, another leak. It would appear that we live in dangerous times, since it wasn't that long ago that the popular dating website Ashley Madison suffered a leakage of data (which we have talked about here) and now, news came out that yet another set of internal data was leaked to the public.
This time, it was the crowdfunding website Patreon that has suffered a major leak and they've already confirmed that several gigabytes of users data was dumped online.
As always, being big fans of privacy and data, at BinaryEdge we have felt that we should give our own take on this data, in order to better grasp what kind of information really gets exposed in situations like this.
As such, this will be a two-part article; we will start by making a quick overview of the content of the leak and at a later time we will show a data science analysis of the most interesting parts of this content. This part of the article corresponds to the Preliminary Report of our Data Science Workflow
The Content
The Patreon data dump consists of a compressed file of 3.7 GB, which contains 4 other files:
- patreon (18 KB)
- patreon.sql (14 GB)
- patreon.tar.gz (121 MB)
- README (1.1 KB)
At first glance, the patreon file appears to contain mostly security certificates, RSA private keys, configuration parameters of databases (hosts, ports, passwords), access parameters of services (secret keys for AWS, PayPal, etc) and SSH public keys.
The patreon.tar.gz file, when uncompressed, contains the following files and folders:
One can quickly observe that the content that was retrieved here corresponds to source code, which may or may not be from the website itself (or at least a version of it).
Finally, the patreon.sql file (which corresponds to a MySQL data dump) when imported appropriately shows the following 99 tables:
These tables contain information about practically everything regarding the Patreon website activity. We can find users' personal information, activities, campaigns, goals, monetary transactions, credit card information, etc. There is also information which is most likely used for website maintenance. In order to better understand what content is present, we have split it into a few topics which we describe below. Note that empty tables were immediately removed from observation and not included in this split.
Users' private information:
Great part of the dump contains personal information and activity information about the users. This includes information about accounts, settings, sessions, messages, emails, follows, shipping addresses, location, etc. The amount of private information that is contained here is alarming and should be viewed as a cautionary tale for both the users and the company itself.
Needless to say that at BinaryEdge, we value privacy more than anything and, as such, we will not reveal any information regarding personal data, even with it being public at this point.
Activity information
Another big component of the data dump is the data regarding actual content of the website. This includes information about activities, campaigns, goals, comments, etc. Most of these tables include links to content such as images and videos. This appears to be the information that is actually browsable on the website, which is OK to analyse and publish since it was public to begin with, but always while protecting the identity of the users involved of course.
Monetary transactions information
One of the most worrying group of data concerns monetary transactions. Here we can access credit card information, charge details, transaction history, tax forms, payments, external payments, bills, pledges, etc. Once again, most of this data shall remain private for out part, in order to protect the users.
Website maintenance information
Some tables appear to contain information that is used just for website maintenance purposes. This includes featured campaigns, popular creations, schema versions, etc.
Authentication information
Finally, there are also some tables containing what appears to be authentication data. OAuth tokens, PayPal tokens, two factor codes, etc.
What Is Next?
As mentioned, this part of the article has the sole intent of providing an overview of the data available. With this, we could gain a clear understanding of the data that was available and which part of it can actually be used.
As we have seen, within the information that can be considered relevant, there is a lot of information that pertains users and endangers their privacy.
For the next part of this article, we will focus on data that, without exposing users, can produce interesting knowledge and perform an analysis from a data science perspective.