Thanks Matei. I will take a look at SchemaRDDs.
On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Hi Mohit, > > This looks pretty interesting, but just a note on the implementation -- it > might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The > reason is that SchemaRDDs already have an efficient in-memory > representation (columnar storage), and can be read from a variety of data > sources (JSON, Hive, soon things like CSV as well). Using the operators in > Spark SQL you can also get really efficient code-generated operations on > them. I know that stuff like zipping two data frames might become harder, > but the overall benefit in performance could be substantial. > > Matei > > On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com) > wrote: > > Folks, > I have been working on a pandas-like dataframe DSL on top of spark. It is > written in Scala and can be used from spark-shell. The APIs have the look > and feel of pandas which is a wildly popular piece of software data > scientists use. The goal is to let people familiar with pandas scale their > efforts to larger datasets by using spark but not having to go through a > steep learning curve for Spark and Scala. > It is open sourced with Apache License and can be found here: > https://github.com/AyasdiOpenSource/df > > I welcome your comments, suggestions and feedback. Any help in developing > it further is much appreciated. I have the following items on the roadmap > (and happy to change this based on your comments) > - Python wrappers most likely in the same way as MLLib > - Sliding window aggregations > - Row indexing > - Graphing/charting > - Efficient row-based operations > - Pretty printing of output on the spark-shell > - Unit test completeness and automated nightly runs > > Mohit. > > P.S.: Thanks to my awesome employer Ayasdi <http://www.ayasdi.com> for > open sourcing this software > > P.P.S.: I need some design advice on making row operations efficient and > I'll start a new thread for that > >