Re: [DISCUSS] Support pandas API layer on PySpark

Holden Karau Sat, 13 Mar 2021 21:58:29 -0800

I think having pandas support inside of Spark makes sense. One of my
questions is who are the majour contributors to this effort, is the
community developing the pandas API layer for Spark interested in being
part of Spark or do they prefer having their own release cycle?


On Sat, Mar 13, 2021 at 5:57 PM Hyukjin Kwon <[email protected]> wrote:

> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas <https://github.com/databricks/koalas>
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> # The package names will change in the final proposal and during review.
> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
> koalas_dataframe.[... pandas APIs on Spark ...]
>
> pyspark_dataframe = koalas_dataframe.to_spark()
> pandas_dataframe = koalas_dataframe.to_pandas()
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>    -
>
>    Python has grown dramatically in the last few years and became one of
>    the most popular languages, see also StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for Python, Java, R and Scala languages.
>    -
>
>    pandas became almost the standard library of data science. Please also
>    see the StackOverFlow trend
>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>    for pandas, Apache Spark and PySpark.
>    -
>
>    PySpark is not Pythonic enough. At least I myself hear a lot of
>    complaints. That initiated Project Zen
>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>    greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask <https://dask.org/>
> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
> <https://github.com/databricks/koalas>). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
> .
>
>
>
>    -
>
>    There are many important features missing that are very common in data
>    science. One of the most important features is plotting and drawing a
>    chart. Almost every data scientist plots and draws a chart to understand
>    their data quickly and visually in their daily work but this is missing in
>    PySpark. Please see one example in pandas:
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to