Thursday, July 7, 2016

Myspace hashes, length 10 and beyond


RecoveredPercent
Total360,213,049
Usable data359,005,905355,886,68699.13%
Unique116,822,086113,830,17697%
Salted hashes68,494,253
Salted pairs66,099,05947,120,45371.29%
non-user pass14,412,2995,8310.04%
meaningful passes51,686,76047,114,62291.15%

When we obtained the Myspace data, we didn’t think too much of it for several reasons. In addition to being a fairly old data-set, the passwords were also truncated to length ten and converted to lowercase prior to being hashed with the SHA-1 algorithm. This means that some of the passwords recovered would be ambiguous and incomplete. This is no longer the case for roughly 68M of the hashes.

The total data-set of roughly 360,213,049 lines contained 359,005,905 usable hashes. This data was de-duplicated to 116,822,086 SHA-1 hashes. Roughly 97% of these hashes were recovered by our group, totaling to 113M hashes. As the passwords were all pre-processed before hashing, the plain-texts which we recovered did not exceed length ten and were all lower-cased.

Since the plain-text passwords aren’t in their original form, they are not as interesting as it does not allow us to gather that much useful information from them. Being truncated, they do give us a glimpse of some longer passwords we may have previously not been able to recover.

Interestingly, user ‘frekvent’ over at the hashes.org forum made an amazing discovery. It appears that for some users there exists an additional salted SHA-1 hash that contains the password in it’s original form, without being truncated or lower-cased. This hash is generated by salting the password with the userid prior to being hashed with SHA-1.

Rather than directly recover the salted SHA-1 hashes, we can take a shortcut. This means for all those users who contain this secondary salted SHA-1 hash, we can now case correct it against the plain-text we previously recovered. It also means we can derive the actual password  for these users prior to length ten truncation.

A generated example

UserID: 65535
Password: Cynosureprime082!

First hash:
(password is truncated to length 10 and lower cased)
cynosurepr->SHA1->6fba0c905ded07590fdbc4b0fa6eb17e565dd814

Second hash:
(userid is applied as a salt to the unmodified password)
Cynosureprime082!->SHA1($salt.$pass)->20c25cbb791bc0b7fcce739f42b682376057eb9e:65535

Stored as:
65535:email:0x6FBA0C905DED07590FDBC4B0FA6EB17E565DD814:0x20C25CBB791BC0B7FCCE739F42B682376057EB9E

Step 1: Recover 6fba0c905ded07590fdbc4b0fa6eb17e565dd814 as cynosurepr
Step 2: Perform case toggling and length extension cynosureprA, cYnosureprBB, cyNosureprZZ etc etc and test against 20c25cbb791bc0b7fcce739f42b682376057eb9e:65535

Out of the entire data-set, about 68M users contain the secondary salted SHA-1 password hash.  Of these 68M users, we were able to pair 66M up with the recovered password. This 66M list was then divided into two groups, ‘non-user pass’ which are users containing system generated passwords (14M) and ‘meaningful passes’, those which belong to users (51.6M). We were only able to pair 66M of the total 68M hashes as we have not fully recovered all the SHA1 hashes, but only 97% of them.

Using our tools we performed either a case toggle and/or length extension attack for each of the salted hash pairs. We have successfully verified over 45M plain-texts against their salted SHA-1 counterpart. The case toggle refers to toggling all passes length ten or less against the salted SHA-1. The length extension attack involves cycling through all possible characters and appending them to the plain-text derived from the recovered normal SHA1 and checking this against the salted SHA-1 hash.

Having both variations of the password hashes has made cracking the longer passwords quite easy since we can first recover the length 10 representation and use this in length extension attacks to obtain the full length password. It would appear that the Myspace data may have some usefulness after all.

Note: The salted hashes can be paired up with their corresponding plaintext data and arranged such that they can be recovered using off the shelf software. However, this won't work for case correction, you will also need to reparse the final output.