Thanks Matei. I will take a look at SchemaRDDs.

On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi Mohit,
>
> This looks pretty interesting, but just a note on the implementation -- it
> might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
> reason is that SchemaRDDs already have an efficient in-memory
> representation (columnar storage), and can be read from a variety of data
> sources (JSON, Hive, soon things like CSV as well). Using the operators in
> Spark SQL you can also get really efficient code-generated operations on
> them. I know that stuff like zipping two data frames might become harder,
> but the overall benefit in performance could be substantial.
>
> Matei
>
> On September 4, 2014 at 9:28:12 AM, Mohit Jaggi (mohitja...@gmail.com)
> wrote:
>
> Folks,
> I have been working on a pandas-like dataframe DSL on top of spark. It is
> written in Scala and can be used from spark-shell. The APIs have the look
> and feel of pandas which is a wildly popular piece of software data
> scientists use. The goal is to let people familiar with pandas scale their
> efforts to larger datasets by using spark but not having to go through a
> steep learning curve for Spark and Scala.
> It is open sourced with Apache License and can be found here:
> https://github.com/AyasdiOpenSource/df
>
> I welcome your comments, suggestions and feedback. Any help in developing
> it further is much appreciated. I have the following items on the roadmap
> (and happy to change this based on your comments)
> - Python wrappers most likely in the same way as MLLib
> - Sliding window aggregations
> - Row indexing
> - Graphing/charting
> - Efficient row-based operations
> - Pretty printing of output on the spark-shell
> - Unit test completeness and automated nightly runs
>
> Mohit.
>
> P.S.: Thanks to my awesome employer Ayasdi <http://www.ayasdi.com> for
> open sourcing this software
>
> P.P.S.: I need some design advice on making row operations efficient and
> I'll start a new thread for that
>
>

Reply via email to