Hi Micah,

Thank you very much for raising these questions.

We are further analyzing the reasons for Cylon's performance improvement.
We believe the main reason is using Arrow and columnar format and it helps
our shuffleByIndex-compute-recreateData approach (more like BSP). And we
are getting native hardware performance compared to working on a JVM. At
the moment we are using MPI for communication/ transport layer (we are
implementing a ucx comms layer ATM)

We were using Spark scala API for the examples. We used HDFS as the data
storage for Spark tests. Did you mean to say, it would have been better if
we kept the data as parquet for Spark? or test on parquet data itself? On a
different note, we are actually evaluating parquet for disk + memory
computations now.

We are still in search for real-world datasets TBH. Specifically, we are
trying to match a DNN/ML use case that would require relation algebra for
data preprocessing.

On Mon, Jul 27, 2020 at 1:08 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Niranda,
> Interesting results.  Did you do any analysis to understand what was the
> main contributor to the performance differences?  Along these lines, did
> you try joins on any real world datasets?  Are you using Spark SQL for
> comparisons?  Also why not use parquet as a starting point?
>
> Thanks,
> Micah
>
> On Wed, Jul 22, 2020 at 7:45 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> > Hello Niranda,
> >
> > cool to see this. Feel free to open a PR to add it to the Powered By list
> > on https://arrow.apache.org/powered_by/
> >
> > Cheers
> > Uwe
> >
> > On Tue, Jul 21, 2020, at 8:03 PM, Niranda Perera wrote:
> > > Hi all,
> > >
> > > We would like to introduce Cylon to the Arrow community. It is an
> > > open-source, lean distributed data processing library using the Arrow
> > data
> > > format underneath. It is developed in C++ with bindings to Java, and
> > > Python. It has an in-memory Table API that integrates with PyArrow
> Table
> > > API. Cylon enables distributed data operations (ex: join (all
> variants),
> > > union, intersection, difference, etc). It can be imported as a library
> to
> > > existing applications or operate as a standalone framework. At the
> moment
> > > it is using OpenMPI to distribute and communicate. It is released with
> > > Apache License.
> > >
> > > We are developing a distributed data-frame API on top of Cylon table
> API.
> > > It would be similar to the Dask/ Modin data-frame. Our initial
> > experiments
> > > show promising performance. Cylon language bindings are also very
> > > lightweight. We just had the very first release of Cylon. We would like
> > to
> > > hear from the Arrow community... Any comments, ideas are most
> > appreciated!
> > >
> > > Web visit - https://cylondata.org/  <https://cylondata.org/>
> > > Github - https://github.com/cylondata/cylon
> > > Paper - https://arxiv.org/abs/2007.09589
> > >
> > > Best
> > > --
> > > Niranda Perera
> > > @n1r44 <https://twitter.com/N1R44>
> > > +1 812 558 8884 / +94 71 554 8430
> > > https://www.linkedin.com/in/niranda
> > >
> >
>


-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+1 812 558 8884 / +94 71 554 8430
https://www.linkedin.com/in/niranda

Reply via email to