whoa...missed the google spellcheckers' warning on: paralleizm ... although that may be the proper lolkidde spelling :-)
On Fri, Aug 14, 2009 at 12:10 PM, bradford cross <bradford.n.cr...@gmail.com > wrote: > We have just released flightcaster.com which uses statistical inference > and machine learning to predict flight delays in advance of airlines > (initial results appear to do so with 85 - 90 % accuracy.) > > The webserver and webapp are all rails running on the Heroku platform; > which also serves our blackberry and iphone apps. > > The research and data heavy lifting is all in Clojure. > > Distributed data mining is done via a custom layer on top of cascading > (which is a layer on top of hadoop.) All run on EC2 and S3 using the very > nice cloudera AMIs and deployment scripts. > > In addition to the machine learning, the layer atop cascading performs all > the complex data data filtering and transformation operations; including > distributed joins from heterogeneous data sources and transformations into a > time series view that is fed to the machine learning computations that are > rolled into mappers and reducers. Remember, this is data from airlines and > the FAA, it is not pretty. Web data is messy but we have lots of good > frameworks, libs and sanitizers for web data. > > We wrapped cascading in a thin layer that we use to wrap clojure functions > in the cascading function objects and inject those into individual steps in > the workflows. This gets us very close to normal function composition for > the client code. Ultimately, we want to be able to do normal function > composition to compose cascading workflows in the same way as we would > would do vanilla function composition for small test runs on our local > machines. This is an execution agnostic programming model; client code > doesn't bear the signs of distributed execution. > > As a beneficial side effect, we found that this model forces us to have > more fine grained abstractions - because each operation must be ultimately > be injectable into a map-reduce phase, otherwise your paralleizm will be > unnecessarily course grained. This steers us clear of monolithic uber > -expressions. > > Another aspect of the design that allows us to do this is that the data > transformations write out clojure data structure literals, so we are > entirely insulated from the normal hadoop input/output formats...the > wrapper layer just uses the normal clojure reader to read in the strings > from hadoop and apply the vanilla clojure functions to the data > structures. But we are not limited to only clojure data structure > literals. We also inject other readers that can read other strings to > clojure data structures, for example. we use Dan Larkin's wonderful jsonlib > for the initial reads of the raw > json data we store. > > All the analytical code is custom, so we don't use many 3rd party libs > outside of cascading, hadoop, the invaluable jets3t for working with s3. > Oh, and of course, - since we do so much with temporal analysis - joda-time > is the only way to work with dates in a sane way on the jvm. :-) > > If you travel a lot, check us out: flightcaster.com ... we have iphone and > blackberry apps. Unfortunately this is domestic US air travel only at the > moment due to the difficulty of of obtaining data for international carriers > and aviation agencies. > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---