How are users choosing their passwords on the internet?

BinaryEdge collects in its platform 40fy different types of OSINT (Open Source Intelligence) information. One of the types of data that we collect is dataleaks that we find on the internet, which we clean up and import into our platform. Our clients are then able to see these dataleaks reflect on their ratings and check which emails have been found in leaks from their organization.

Even though we don't store passwords in our platform, for this blogpost we analysed the passwords leaked in raw dataleaks of 2 lists and we will present the significant statistics. However, there wasn't any correlation made between emails/ usernames and passwords.

The data used in this blogpost refer to The AntiPublic Combo List and the Exploit.in List that were made public in late 2016. These databases are a combination of multiple email addresses/usernames and passwords from various online systems.

The danger of these types of dataleaks is that most people use the same password over and over again in different websites. So, by getting access to this combination of emails and passwords, an hacker could gain access to more than one of the victim's accounts.

In this blogpost, we analysed email addresses and passwords that were leaked as we wanted to find out what are people's habits when it comes to password reuse and strength in general. Nevertheless, we did a brief analysis of the email addresses as well.

We will explain the steps taken for each part of the analysis as well as the tools used. The exact same analysis was performed in both lists.

Throughout this blogpost, a comparison between the two dataleaks is done in order to see if the results are sparse or similar between the two. In all our visualizations we will present the AntiPublic in red and the Exploit.in in blue, in order to make it easy to distinguish between the two.

Overview

The first step we took in order to get the size of data we were dealing with was to analyse all the data and count the passwords and email addresses/ usernames present in the database.

Process:

  1. Count the number of combinations of email addresses/ usernames and passwords, taking in consideration the separators used between emails and passwords
  2. Count the number of unique combinations of email addresses/ usernames and passwords: repeat the process in 1. but count only unique lines
  3. Create a list with emails/usernames only and another list with passwords: divide emails/usernames and passwords, considering all the separators present in the data
  4. Distinguish email addresses from usernames: use an email regex on the list of extracted email addresses and usernames to select only the lines containing emails
  5. Count the total number of email addresses and the unique ones
  6. Count the total number of passwords and the number of unique passwords

In the image below you can see the results we found.

Although most lines of these files were combinations of usernames/emails and passwords, we noticed that some of the entries didn't have any passwords associated, they were just usernames or email addresses. Additionally, when focusing only on email addresses, note that, in the AntiPublic Combo List, the number of passwords is higher than the number of emails since many lines of these files contained usernames instead of emails.

As a preliminary analysis, we can already notice that the number of unique passwords is much smaller than the total number of passwords, which indicates that different people are using the same passwords.

Email Analysis

Starting from the list of email addresses extracted from the raw data (as described before), we proceeded to analyse the domains present in the database.

First, we separated the domains' names from the usernames in all the email addresses and counted them, ignoring the casing so we wouldn't miss results (ex: Hotmail was grouped with hotmail). It was found that yahoo.com accounted for 38.7% of all the emails in the AntiPublic Combo List, followed by aol.com (6.8%). As for the Exploit.in List, yahoo.com accounted for 21.4% of the emails, followed by hotmail.com (16.3%).

But what if we excluded the TLDs (top level domains - ex: .com, .co.uk, .net) and analysed only the labels (ex: gmail)? This way we would get the big picture on the presence of the labels in these lists. The image below represents the results found.

Comparing these results with the ones mentioned before, we didn't see many differences as the domains' labels referred still accounted for almost half of the databases.

When analysing the Top Level Domains, it was found .com, a generic Top Level Domain (gTLD), accounted for almost 62% of all the emails in both lists.

We also wanted to find out which countries had an outstanding presence in the database. So, by looking at the Top Level Domains (TLD) and selecting the ones that were Country Code Top Level Domains (ccTLD), we produced the maps below, representing the top10 countries present in both lists.

As shown in the maps above, the domains with the termination .de, belonging to Germany, are the most common ones in the AntiPublic List, accounting for almost 9% of the email addresses. In the Exploit.in list, emails belonging to Russia were the most common, composing 11% of the list.

Password Analysis

The passwords present in these lists were the main focus of our analysis and we had many ideas of what we wanted to analyse. We used a couple of tools (that we will mention later on) to assist us with the analysis.

First of all, and starting from the list of passwords previously created (as described), we wanted to check if multiple users were using the same password. As expected, they were. Out of all the passwords present in the dataset, we found that 65.6% were repeated passwords. Usually, these passwords are combinations of sequential numbers, letters or even keyboard patterns, which makes them easy to guess.

So, which are the most used passwords in these lists? In the image below you can see what we found (top 10 results).

When comparing these passwords with the Worst Passwords of 2016, one can see that most of the passwords we found are part of that list. These passwords put the users at great risk. Some of the passwords do have numbers mixed with letters but, even so, they are still composed by simple patterns, giving a false sense of security - they are still weak passwords.

For the next part of the analysis we used PACK (Password Analysis and Cracking Toolkit). This tool is great to extract statistics such as password length, character sets, masks, amongst others.

Although this tool is perfect for the analysis we needed, we found a small bug when it comes to encoding. When analysing password length, we noticed that this tool was counting, for example, the character รค as 2 characters instead of one. Therefore, we wrote a small patch that fixes this bug and submitted an issue on github with a solution for this problem.

The image below represents the results of the analysis on length of the passwords presented in the AntiPublic Combo List and the Exploit.in List.

As described in the image above, 64.8% and 66.3% of the passwords of the AntiPublic List and the Exploit.in List, respectively, have lengths between 1 and 8 characters.
According to the United States National Institute of Standards and Technology (NIST), a password should have a minimum of 8 characters, although this minimum should be longer for more sensitive accounts. By looking at both lists, we found that 62.7% of the passwords in AntiPublic and 61.9% were according to this guideline (had at least 8 characters). However, a long password isn't necessarily considered a strong password.

When analysing passwords' strength, besides considering password length, most systems now require that passwords include numbers, mixed case letters and special characters. However, the new guidelines of password management by NIST say that a mix of characters in passwords should not be imposed to users.

The analysis of the type of characters used in the passwords present in the AntiPublic Combo and the Exploit.in Lists revealed the results below. Please notice that this represents the characters contained in the passwords, which doesn't mean that these passwords have only a certain type of character.

The great majority of the passwords contain letters or numbers and only 4% and 5% of the passwords in the AntiPublic Combo List and the Exploit.in List, respectively, contain special characters. Although it is not shown in the image above, it is worth mentioning that only 0.2% and 0.3% of the passwords present in the AntiPublic Combo List and Exploit.in List, respectively, contain all types of characters - capital and small letters, numbers and special characters all in one password.

Besides knowing the number of passwords containing a certain type of character, we also wanted to understand how many passwords were composed exclusively by one type of character. Below are the results found.

As can be seen in the image above, most of the passwords present in both lists are composed exclusively by numbers and letters.

However, the exclusivity of a type of character in a password isn't necessarily correlated with the strength of a password. Take a look at the image below, from XKCD to better understand this idea.

Sometimes it is better to have a passphrase as a password than a mix of characters that make simple and easily guessable patterns.

Just out of curiosity, we actually looked in both lists by the password shown in the image correcthorsebatterystaple. We found it 18 times in the AntiPublic List and 7 times in the Exploit.in List. Using a password that's used as an example is as bad as using the password "password".

The next statistics we looked at was the distribution of characters in passwords, as in, the sequence in which type of character is presented. For example, in 'password123', the sequence of characters would be 'string-digit'. Take a look at the image below to see what we found.

In both the AntiPublic Combo List and the Exploit.in List, the majority of the passwords is composed by numbers only, letters only and a sequence of letters and numbers. Combining this information with the fact that there are a lot of passwords present in the list of worst passwords and a lot of repeated passwords, it can be concluded that the best practices for passwords aren't really being used by most people. Of course we have to consider that this combo list could be old, but at the same time people should have been more careful with the type of passwords they use for a long time now.

Since the character complexity isn't by itself a measure of a password's strength, a further analysis that considered both password length and character complexity together was necessary to
better understand the strength of the password. For this, we used 'zxcvbn-python', which is a python library that estimates password strength through pattern matching and conservative estimation. As can be read in the dropbox blog, this estimator takes into consideration keyboard patterns (ex: qwerty), repetitions (ex: aaa) sequences (ex: 123) and substitutions between letters and numbers (ex: 3 and e), amongst others. This tool gives us the strength of a password by calculating its entropy (in bits). It also gives us a password cracking time which is an easy understandable estimation on how easy it is to decrypt a password.

Take a look at the blogpost mentioned above as it explains very clearly how this tool works: "score calculates an entropy for each matched pattern, independent of the rest of the password, assuming the attacker knows the pattern. A simple example: rrrrr. In this case, the attacker needs to iterate over all repeats from length 1 to 5 that start with a lowercase letter: entropy = lg(26x5) # about 7 bits".

We analysed the most common passwords found in the AntiPublic Combo List and Exploit.in List (the top 20 passwords) and found the following results.

Most passwords have low values of entropy, which means that they are easy to decrypt. In the big picture, if someone was trying to hack an account with one of these passwords, he could easily do it. An even more dangerous situation is if people were using these passwords in more than one account, which we know happens.

When analysing the entropy of the unique passwords in both lists, we found that average entropy was 12.53 for the AntiPublic Combo List and 13.29 for the Exploit.in List.

Final remarks

As mentioned in the beginning, the purpose of the analysis of these two lists was to understand the passwords' habits of users. From what we gathered, there are still a great number of people using weak passwords or using simple masks.

It's fundamental to let users know the risks they face when using this kind of passwords, since they are easy targets to have their accounts hacked and could possibly be victims of identity theft.

Here are a few tips to help increase password security:

  • use different passwords for different websites;

  • strengthen the passwords in general by following some guidelines (ex: NIST);

  • don't use personal information (date of birth, names);

  • use a random password generator;

  • don't use the usernames in the password;

  • use a password manager to store passwords - take a look at this article "Best password managers of 2017: Reviews of the top products";

  • take a look at Have I Been Pwned to check if any of your emails and passwords have been compromised;

  • enforce two factor authentication whenever possible

Also if you are looking at improving your security,check our security ratings tool that we've made available for free: securityrating.io