Cracking with the Webtionary Using Google and Yahoo! to Light Force an (almost) Infinite Dictionary

by Acrobatic

Attacks on cryptographic schemes have been around for years.  Generally, the most successful attacks rely on time, powerful processors, and a large pool of data from which to test cracking attempts.

One way of alleviating the time problem and thus the processor problem is to have more than one cracker working on the problem simultaneously.  We see the effects of this in contests like distributed.net: Project RC5, which used distributed computing to crack previously uncrackable ciphers, having hundreds of thousands of people employ their computers towards the goal of testing every possible key until the correct one is found.

Many attacks on encrypted passwords rely on dictionary attacks, in which weak passwords are guessed by testing them against millions of entries of plaintext words in a file or database.  Often, these repositories can be found split into themes, such as huge lists of personal names, places, or commonly used passwords.  The larger your pool of data, the better your chances of success - but the longer it will take to test every possibility.

It was recently pointed out that cryptographic hashes such as MD5 can be reversed using search engines such as Google.  For example, searching for the MD5 hash of 5f4dcc3b5aa765d61d8327deb882cf99 takes less than a quarter of a second to return over 500 pages with both 5f4dcc3b5aa765d61d8327deb882cf99 and the word password in them, in close proximity to each other.  (It is no coincidence that password is one of the top 10 most frequently used passwords.)

$ echo -n password | md5sum
5f4dcc3b5aa765d61d8327deb882cf99  -

Remember our three criteria for increasing success at cracking?  We've just used one computer, a search engine, and less than a quarter of a second to crack an "uncrackable" hash.  By using search engines, we use Google and Yahoo!'s immense catalogs of indexed pages and their thousands of server processors to search for a hash on the same page as its plaintext equivalent.

Imagine the possibilities: millions of pages with millions of hashes and their respective plaintexts, indexed by Google and Yahoo! for us to peruse.  The Internet has essentially become both a distributed computing ring and a huge dictionary for us to brute-force from - a webtionary.

Granted, this isn't nearly as easy for more secure passwords or for passwords that have been salted and then hashed.  (Salting is adding text to a password before encrypting it, then using that same text to aid in the decryption.)  However, a vast majority of people use passwords that are very easy to decipher if you know the hash.  Getting the hash is a different problem in itself, and I'll get to that in a second.

Using PHP, I wrote a program that takes care of the dirty work for you.  It does a Google search for a hash, scans the results, sorts them by word frequency, and uses that relatively small subset as a cracking dictionary to find a match.  If it finds a match, it returns the plaintext to you, so you don't have to search all the pages manually.  If the Google search is unsuccessful, the program does a search with Yahoo!; it scans the URL title, the Yahoo! summary of the page, and finally, if that fails, the page itself, and performs similar analysis as we did with the Google results.

I originally thought about creating a huge database full of deciphered hashes as a backup when the webtionary search failed, but the point of the project is not to become a cracking database, but rather to show the power of using the web and search engines to do all the hard work.  Besides, you can find scores of these databases across the web; for example, GDataOnline.com alone has almost 900,000 solved hashes.

As you'll see in the source code, I did build in the ability to use a database, but this is only for storing passwords which have already been deciphered using the script.  This is because the search engine APIs I use only allow a limited amount of lookups per day.  I'll leave the database write method turned off until the search engines start blocking access because I've used up my limit.

Using this script, I've been able to find the matches for hundreds of hashes in less than a few seconds each.  It's important to remember that this is not a cracker - it's a finder.  Instead of brute-force, I like to call it "light-force."  If the hash and plaintext haven't been posted to the web and indexed by the search engines, this script won't help.

Just for fun, I used the script to search for this hash: 32b991e5d77ad140559ffb95522992d0

Yahoo! found and returned the plaintext 2600 to me in 1.074 seconds.  This means that somewhere out there, someone has used and deciphered 2600 as a password and posted it on the Internet.

While writing this program I investigated and inspected many pages of results from search engines.  I was shocked by the number of pages I found that were database dumps of user information, including contact information, security questions and answers, private message logs, and more, tucked away along with the MD5 hashes of their passwords in various websites across the world, where their owners probably thought they were safe.

A more nefarious programmer could write a script to search each of these hashes and easily compromise websites and user accounts.

This should once again be a reminder to programmers to always secure your data.  At least salt your users' passwords before storing them on the web.  And it's always a good idea to test the strength of your own password.

You can create an MD5 hash of a plaintext word in Linux or OS X by typing echo -n plaintext | md5sum, or find one of the many MD5 generators on the web.  Then, see if the program can decipher your hash...

My working model can be found at www.bigtrapeze.com/md5/.

The source code can be found at www.bigtrapeze.com/md5/source/ or in the 2600 code repository.

index.php:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>
<head>
<title>Let's Crack Some MD5</title>
</head>
<body style="margin: 0px; padding: 0px; font-family: arial; font-size: 12px;">

<div style="border-bottom: 1px solid #aaa; background-color:#e6e6e6; padding: 5px; text-align: center;">
<?	if($query) { 
		if($found) {
			echo $query . " => <strong>" . $found . "</strong>";
		} else {
			echo "Unable to crack via search engines";
		}
	} else {
			echo "Light-force MD5 crack using search engines";
		} 
?>
</div>
		
<div id="hash_form" style="margin:5px; padding:5px;">
<form action="<?= $_SERVER[
    "PHP_SELF"
] ?>" method="post">Please enter an MD5 hash you wish to crack:
<input type="text" name="query" size="32" maxlength="32">
<input type="submit" name=" Crack ">
</form>
</div>
		
<div id="hash_response" style="margin:5px; padding:5px;">
<? if($query) { ?>
<strong>Result:</strong><br>
<?
if ($found) {
	// Matched the key
	echo "<ul>Congratulations!";
	echo "<li>The hash <strong>$query</strong> has been deciphered to: <strong>$found</strong></li>";
		switch($found_location) {
		case "YAHOO_URL":
	echo "<li>The plain-text was found in a Yahoo URL</li>";
		break;
		case "YAHOO_SUMMARY":
	echo "<li>The plain-text was found in Yahoo's summary of a page</li>";
		break;
		case "YAHOO_PAGE":
	echo "<li>The plain-text was found in a page via Yahoo</li>";
		break;
		case "GOOGLE_SUMMARY":
	echo "<li>The plain-text was found in Google's summary of a page</li>";
		break;
		case "DATABASE":
	echo "<li>This plain-text had already been found via search engine</li>";
		break;
	}
echo "<li>Found in " . $time . " seconds</li>"; 
echo "</ul>";
} else {
	echo "<ul>Sorry!";
	echo "<li>We couldn't crack the hash with a search engine.</li>";
	echo "<li>Hacked for " . $time . " seconds.</li>";
	echo "</ul>";  
	}
}
?>
</div>
</body>
</html>

md5-source.php:

<?php
/*****************************************************************************************************
	Crack MD5 v.03
	by Acrobatic (jbnunn@gmail.com)
	GNU General Public License, have fun with it
	
	Light-force crack of exposed MD5 hashes
	
	-	"Light-force" because we're not brute-forcing through a huge database of decrypted keys. 
		You can do that all over the internet (shout out to gdataonline.com). We're just showing 
		proof-of-concept by using search engines to find keys and their plaintext.
		
	-	We do use our own database to store successful deciphers made from this application. Mainly
		because we're limited to a certain amount of API calls per day on Yahoo and Google.
		
	-	Theoretically, if enough pages of hash / plaintext combinations were cataloged by seach 
		engines across the internet, we could use the processing power of Google and Yahoo to search 
		much more efficiently than searching millions of rows in a database
		
	-	We search Google first, because if there are search results returned, the plaintext is almost
		always found in the page summary. With Yahoo we have to do a bit more checking.
			
	Changelog
	
	# version .02:
	- added database support for caching found hashes
	- changed regex's a little (still need some vodoo to make them more efficient)
	- organized the code to be more modular
	- added memory limits because it crashed when a large html page was loaded 
		(instead of skipping this page we might need to figure out how to better parse it into manageable chunks)
	- now searches Yahoo's summary field for each page (as well as the URL name and page content)
	- added debug content
	- made user agent show $_SERVER['HTTP_USER_AGENT'] instead of "spider"
	
	# version .03:
	- added upper/lower case hash searches, and searches with punctuation removed
	- added Google search (without API)
	- fixed sorting bug where keyes were being lost
	- made use of php's parse_url instead of homegrown verison
	- made regular expressions more efficient
	- removed dependency on external filter for cleaning input
	- removed references to database (add your own if you want to use a database as a backup for stored hashses)
	
*****************************************************************************************************/

$time_start = microtime(true);	// Time script execution (PHP 5 only)
ini_set("memory_limit","10M");	// Increase memory size for large arrays
ini_set(max_execution_time,120);	// 2 minutes to run through the results

// Setup variables and constants
$yahoo_appid = "yourownappid";	// Use your own APP ID from Yahoo
$url_list = array();
$found = FALSE;
$found_location = FALSE;	// Found in google / yahoo

function searchYahoo($query) {
	/* Search Yahoo for the query. Searches both the URL address and page content */
	global $yahoo_appid;
	global $filter;
	global $found_location;
	
	// Get the Yahoo REST output
	$request =  "http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid=$yahoo_appid&query=$query&results=10&output=php";
	$response = file_get_contents($request);

	if ($response === false) {
		// Request for page failed... error check or graceful exit here
		die('Oops, we messed something up... try another hash.');
	}

	$phpobj = unserialize($response);
	
	// First we need to search the page URL (title) itself. 
	foreach($phpobj['ResultSet'] as $result) {
		if(is_array($result)) {
			foreach($result as $child) {
				$url_list[] = $child['Url'];
			}
		}
	} 
	
	if(!$url_list || count($url_list) == 0) {
		return false;
	}
	
		$url_check = array();
		$url_breakdown_array = array();

		foreach($url_list as $url) {
			$parsed_data = parse_url($url);

			// Breakdown the path for plaintext searching
			$url_array = str_replace(array('/','.','_'),' ',$parsed_data['path']);
			$url_check[] = split(' ',$url_array);
			
			
			// Breakdown the query for plaintext searching
			$url_array = str_replace(array('=','+','amp;','&','quot;'),' ',$parsed_data['query']);
			$url_check[] = split(' ',$url_array);
		}
		
		foreach($url_check as $url_set) {
			foreach($url_set as $url) {
				if(trim($url) != '') {
					$url_breakdown_array[] = $url;
				}
			}
		}
	
	// Sort the array in order of most frequent word
		$url_breakdown_array = array_count_values($url_breakdown_array);
		array_multisort($url_breakdown_array, SORT_DESC);
		$url_breakdown_array = array_keys($url_breakdown_array);
		
	foreach($url_breakdown_array as $plaintext) {
		$found = checkHash($query, $plaintext);
		if($found) {
			$found_location = "YAHOO_URL";
			return $found;
		} 
	}
	
	// URL didn't contain hash? Loop through results and search within the Yahoo "Summary" field
	foreach($phpobj['ResultSet'] as $result) {
		if(is_array($result)) {
			foreach($result as $child) {
				$summary_list[] = $child['Summary'];
			}
		}
	} 
	
	$summary_array = array();
	$summary_breakdown_array = array();
	
	foreach($summary_list as $summary) {
			$summary_array[] = split('/|\?|=|\-|\.|\_|\&amp;|\&quot;|\ |\,|\"|\'',$summary);	// Inefficient regular expression? 
			                                                                                    // We might need more too...
	}
	
	foreach($summary_array as $summary_set) {
		foreach($summary_set as $summary) {
			if(trim($summary) != '') {
				$summary_breakdown_array[] = $summary;
			}
		}
	}
	
	// Sort the array in order of most frequent word
	$summary_breakdown_array = array_count_values($summary_breakdown_array);
	arsort($summary_breakdown_array);
	$summary_breakdown_array = array_keys($summary_breakdown_array);
			
	foreach($summary_breakdown_array as $plaintext) {
		$found = checkHash($query, $plaintext);
		if($found) {
			$found_location = "YAHOO_SUMMARY";
			return $found;
		} 
	}
	
	// Still no match? Pull in the entire page and check the content of the text:
	foreach($url_list as $url) {
		// Get the web page
		$html = get_web_page($url);
		
		if(isset($html['content'])) {
			// Some content is too big for the filter--skip it for now
			if(strlen($html['content']) > 100000) {
				break;
			}
	
			// Remove HTML tags
			$filtered_content = strip_tags($html['content']);
			
			// Remove punctuation
			$filtered_content = strip_punctuation($filtered_content);
			
			// Convert to array
			$single_word_array = explode(" ",$filtered_content);
			
			// Filter out strings that probably aren't valid for hashing
			foreach($single_word_array as $single_word) {
				if(strlen($single_word) < 32) {
					$filtered_word_array[] = $single_word; 
				}
			}
	
			// Sort the array in order of most frequent word
			$filtered_word_array = array_count_values($filtered_word_array);
			arsort($filtered_word_array);
			$filtered_word_array = array_keys($filtered_word_array);
			
		}	
		
		// Run through the array and check the user-requested MD5 hash vs the words in the array
		foreach($filtered_word_array as $plaintext) {
			$found = checkHash($query, $plaintext);
			if($found) {
				$found_location = "YAHOO_PAGE";
				return $found;
			} 
		}
	}
	
}

function searchGoogle($query) {
	global $filter;
	global $found_location;

	$html = get_web_page("http://www.google.com/search?q=$query");
	
	// Remove the HTML tags (using Cal Henderson's libFilter class)
	if(isset($html['content'])) {

		// Some content is too big for the filter--skip it for now
		if(strlen($html['content']) > 100000) {
			break;
		}

		// Remove HTML tags
		$filtered_content = strip_tags($html['content']);
		
		// Remove punctuation
		$filtered_content = strip_punctuation($filtered_content);
		
		// Convert to array
		$single_word_array = explode(" ",$filtered_content);
		
		// Filter out strings that probably aren't valid for hashing
		foreach($single_word_array as $single_word) {
			if(strlen($single_word) < 32) {
				$filtered_word_array[] = $single_word; 
			}
		}

		// Sort the array in order of most frequent word
		$filtered_word_array = array_count_values($filtered_word_array);
		arsort($filtered_word_array);
		$filtered_word_array = array_keys($filtered_word_array);
		
	}	
		
	// Run through the array and check the user-requested MD5 hash vs the words in the array
	foreach($filtered_word_array as $plaintext) {
		$found = checkHash($query, $plaintext);
		if($found) {
			$found_location = "GOOGLE_SUMMARY";
			return $found;
		} 
	}
	
}
	
function checkHash($query, $plaintext) {
	$original = $plaintext;
	/* Check the hash, including upper- and lower-case versions of plaintext, as well as quote-less */
	if(md5($plaintext) == $query) {
		return $plaintext;
	} elseif(md5(strtolower($plaintext)) == $query) {
		return strtolower($plaintext);
	} elseif(md5(strtoupper($plaintext)) == $query) {
		return strtoupper($plaintext);
	} else {
		return false;
	}
}
	
function get_web_page($url) {
	/* Gets a webpage and saves it into an array for further processing */
	$options = array( 'http' => array(
		'user_agent'    => $_SERVER['HTTP_USER_AGENT'],    // run as a web browser rather than a spider
		'max_redirects' => 10,          // stop after 10 redirects
		'timeout'       => 120,         // timeout on response
	) );
	$context = stream_context_create( $options );
	$page    = @file_get_contents( $url, false, $context );
 
	$result  = array( );
	if ( $page != false )
		$result['content'] = $page;
	else if ( !isset( $http_response_header ) )
		return null;    // Bad url, timeout

	// Save the header
	$result['header'] = $http_response_header;

	// Get the *last* HTTP status code
	$nLines = count( $http_response_header );
	for ( $i = $nLines-1; $i >= 0; $i-- )
	{
		$line = $http_response_header[$i];
		if ( strncasecmp( "HTTP", $line, 4 ) == 0 )
		{
			$response = explode( ' ', $line );
			$result['http_code'] = $response[1];
			break;
		}
	}
 
	return $result;
}

function strip_punctuation($text)
{
	/* strips punctuation from text, 
		based on http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page 
	*/
    $urlbrackets    = '\[\]\(\)';
    $urlspacebefore = ':;\'_\*%@&?!' . $urlbrackets;
    $urlspaceafter  = '\.,:;\'\-_\*@&\/\\\\\?!#' . $urlbrackets;
    $urlall         = '\.,:;\'\-_\*%@&\/\\\\\?!#' . $urlbrackets;
 
    $specialquotes  = '\'"\*<>';
 
    $fullstop       = '\x{002E}\x{FE52}\x{FF0E}';
    $comma          = '\x{002C}\x{FE50}\x{FF0C}';
    $arabsep        = '\x{066B}\x{066C}';
    $numseparators  = $fullstop . $comma . $arabsep;
 
    $numbersign     = '\x{0023}\x{FE5F}\x{FF03}';
    $percent        = '\x{066A}\x{0025}\x{066A}\x{FE6A}\x{FF05}\x{2030}\x{2031}';
    $prime          = '\x{2032}\x{2033}\x{2034}\x{2057}';
    $nummodifiers   = $numbersign . $percent . $prime;
 
    return preg_replace(
        array(
        // Remove separator, control, formatting, surrogate, open/close quotes.
            '/[\p{Z}\p{Cc}\p{Cf}\p{Cs}\p{Pi}\p{Pf}]/u',
        // Remove other punctuation except special cases
            '/\p{Po}(?<![' . $specialquotes .
                $numseparators . $urlall . $nummodifiers . '])/u',
        // Remove non-URL open/close brackets, except URL brackets.
            '/[\p{Ps}\p{Pe}](?<![' . $urlbrackets . '])/u',
        // Remove special quotes, dashes, connectors, number separators, and URL characters followed by a space
            '/[' . $specialquotes . $numseparators . $urlspaceafter .
                '\p{Pd}\p{Pc}]+((?= )|$)/u',
        // Remove special quotes, connectors, and URL characters preceded by a space
            '/((?<= )|^)[' . $specialquotes . $urlspacebefore . '\p{Pc}]+/u',
        // Remove dashes preceded by a space, but not followed by a number
            '/((?<= )|^)\p{Pd}+(?![\p{N}\p{Sc}])/u',
        // Remove consecutive spaces
            '/ +/',
        ),
        ' ',
        $text );
}


// This is where the magic happens. First check for form data...
if(isset($_REQUEST['query']) && trim($_REQUEST['query']) != '') {

		#$query = filter_input(INPUT_POST, 'query'); 	// Clean the request for more security... 
		#                                               // doesn't work in all PHP setups so we disable it here
	$query = $_REQUEST['query'];

	// Search Google
	if(!$found) {
		$found = searchGoogle($query);
	}

	// Search Yahoo
	if(!$found) {
		$found = searchYahoo($query);
	}
	
	$time_end = microtime(true);
	$time = $time_end - $time_start;
}
?>

Code: index.php

Code: md5-source.php  Crack MD5 v.03 by Acrobatic (jbnunn@gmail.com)

Return to $2600 Index