You need a use case where a lot of computation is applied to a little data. How about any of the various distributed computing projects out there? Although the SETI@home use case seems like a cool example, I doubt you want to reimplement its client.
It might be far simpler to reimplement a search for Mersenne primes or optimal Golomb rulers or something. Although you're not going to get great speed through the JVM, it may be just fine as an example. And it stands some remote chance of getting useful work done. On Thu, Jun 12, 2014 at 11:11 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Indeed, rainbow tables are helpful for working on unsalted hashes. They > turn a large amount of computational work into a bit of computational work > and a bit of lookup work. The rainbow tables could easily be captured as > RDDs. > > I guess I derailed my own discussion by focusing on password cracking, > since my intention was to explore how Spark applications are written for > compute-intensive workloads as opposed to data intensive ones. And for > certain types of password cracking, the best approach is to turn compute > work into data work. :) > > > > On Thu, Jun 12, 2014 at 5:32 AM, Marek Wiewiorka < > marek.wiewio...@gmail.com> wrote: > >> This actually what I've already mentioned - with rainbow tables kept in >> memory it could be really fast! >> >> Marek >> >> >> 2014-06-12 9:25 GMT+02:00 Michael Cutler <mich...@tumra.com>: >> >> Hi Nick, >>> >>> The great thing about any *unsalted* hashes is you can precompute them >>> ahead of time, then it is just a lookup to find the password which matches >>> the hash in seconds -- always makes for a more exciting demo than "come >>> back in a few hours". >>> >>> It is a no-brainer to write a generator function to create all possible >>> passwords from a charset like " >>> abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", hash >>> them and store them to lookup later. It is however incredibly wasteful on >>> storage space. >>> >>> - all passwords from 1 to 9 letters long >>> - using the charset above = 13,759,005,997,841,642 passwords >>> - assuming 20 bytes to store the SHA-1 and up to 9 to store the password >>> equals approximately 375.4 Petabytes >>> >>> Thankfully there is already a more efficient/compact mechanism to >>> achieve this using Rainbow Tables >>> <http://en.wikipedia.org/wiki/Rainbow_table> -- better still, there is >>> an active community of people who have already precomputed many of these >>> datasets already. The above dataset is readily available to download and >>> is just 864GB -- much more feasible. >>> >>> All you need to do then is write a rainbow-table lookup function in >>> Spark and leverage the precomputed files stored in HDFS. Done right you >>> should be able to achieve interactive (few second) lookups. >>> >>> Have fun! >>> >>> MC >>> >>> >>> *Michael Cutler* >>> Founder, CTO >>> >>> >>> * Mobile: +44 789 990 7847 Email: mich...@tumra.com >>> <mich...@tumra.com> Web: tumra.com >>> <http://tumra.com/?utm_source=signature&utm_medium=email> * >>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* >>> *Registered in England & Wales, 07916412. VAT No. 130595328 <130595328>* >>> >>> >>> This email and any files transmitted with it are confidential and may >>> also be privileged. It is intended only for the person to whom it is >>> addressed. If you have received this email in error, please inform the >>> sender immediately. If you are not the intended recipient you must not >>> use, disclose, copy, print, distribute or rely on this email. >>> >>> >>> On 12 June 2014 01:24, Nick Chammas <nicholas.cham...@gmail.com> wrote: >>> >>>> Spark is obviously well-suited to crunching massive amounts of data. >>>> How about to crunch massive amounts of numbers? >>>> >>>> A few years ago I put together a little demo for some co-workers to >>>> demonstrate the dangers of using SHA1 >>>> <http://codahale.com/how-to-safely-store-a-password/> to hash and >>>> store passwords. Part of the demo included a live brute-forcing of hashes >>>> to show how SHA1's speed made it unsuitable for hashing passwords. >>>> >>>> I think it would be cool to redo the demo, but utilize the power of a >>>> cluster managed by Spark to crunch through hashes even faster. >>>> >>>> But how would you do that with Spark (if at all)? >>>> >>>> I'm guessing you would create an RDD that somehow defined the search >>>> space you're going to go through, and then partition it to divide the work >>>> up equally amongst the cluster's cores. Does that sound right? >>>> >>>> I wonder if others have already used Spark for >>>> computationally-intensive workloads like this, as opposed to just >>>> data-intensive ones. >>>> >>>> Nick >>>> >>>> >>>> ------------------------------ >>>> View this message in context: Using Spark to crack passwords >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-crack-passwords-tp7437.html> >>>> Sent from the Apache Spark User List mailing list archive >>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>>> >>> >>> >> >