Re: [DISCUSS] Support pandas API layer on PySpark

Hyukjin Kwon Wed, 17 Mar 2021 04:39:24 -0700

Yeah, that's a good point, Georg. I think we will port as is first, and
discuss further about that indexing system.
We should probably either add non-index mode or switch it to a distributed
default index type that minimizes the side effect in query plan.
We still have some months left. I will very likely raise another discussion
about it in a PR or dev mailing list after finishing the initial porting.


2021년 3월 17일 (수) 오후 8:33, Georg Heiler <georg.kf.hei...@gmail.com>님이 작성:

> Would you plan to keep the existing indexing mechanism then?
>
> https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
> For me, it always even when trying to use the distributed version resulted
> in various window functions being chained, a different query plan than the
> default query plan, and slower execution of the job due to this overhead.
>
> Especially when some people here are thinking about making it the
> default/replacing the regular API I would strongly suggest defaulting to an
> indexing mechanism that is not changing the query plan.
>
> Best,
> Georg
>
> Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <
> gurwls...@gmail.com>:
>
>> > Just out of curiosity, does Koalas pretty much implement all of the
>> Pandas APIs now? If there are some that are yet to be implemented or others
>> that have differences, are these documented so users won't be caught
>> off-guard?
>>
>> It's roughly 75% done so far (in Series, DataFrame and Index).
>> Yeah, and it throws an exception that says it's not implemented yet
>> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
>> easily make users shoot their feet by, for example, for loop ... ).
>>
>>
>> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <cutl...@gmail.com>님이 작성:
>>
>>> +1 the proposal sounds good to me. Having a familiar API built-in will
>>> really help new users get into using Spark that might only have Pandas
>>> experience. It sounds like maintenance costs should be manageable, once the
>>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>>> pretty much implement all of the Pandas APIs now? If there are some that
>>> are yet to be implemented or others that have differences, are these
>>> documented so users won't be caught off-guard?
>>>
>>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <andrew.m...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Integrating Koalas with pyspark might help enable a richer integration
>>>> between the two. Something that would be useful with a tighter
>>>> integration is support for custom column array types. Currently, Spark
>>>> takes dataframes, converts them to arrow buffers then transmits them
>>>> over the socket to Python. On the other side, pyspark takes the arrow
>>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>>> default Pandas representation of a list-type for a column causes it to
>>>> turn what was contiguous value/offset arrays in Arrow into
>>>> deserialized Python objects for each row. Obviously, this kills
>>>> performance.
>>>>
>>>> A PR to extend the pyspark API to elide the pandas conversion
>>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>>> rejected, which is unfortunate, but perhaps this proposed integration
>>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>>> Spark to performantly interchange jagged/ragged lists to/from python
>>>> UDFs.
>>>>
>>>> Cheers
>>>> Andrew
>>>>
>>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <gurwls...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Thank you guys for all your feedback. I will start working on SPIP
>>>> with Koalas team.
>>>> > I would expect the SPIP can be sent late this week or early next week.
>>>> >
>>>> >
>>>> > I inlined and answered the questions unanswered as below:
>>>> >
>>>> > Is the community developing the pandas API layer for Spark interested
>>>> in being part of Spark or do they prefer having their own release cycle?
>>>> >
>>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>>> move quickly.
>>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>>> that it’s now
>>>> > fine to have less frequent releases, and they are happy to work
>>>> together with Spark with
>>>> > contributing to it. The active contributors in the Koalas community
>>>> will continue to
>>>> > make the contributions in Spark.
>>>> >
>>>> > How about test code? Does it fit into the PySpark test framework?
>>>> >
>>>> > Yes, this will be one of the places where it needs some efforts.
>>>> Koalas currently uses pytest
>>>> > with various dependency version combinations (e.g., Python version,
>>>> conda vs pip) whereas
>>>> > PySpark uses the plain unittests with less dependency version
>>>> combinations.
>>>> >
>>>> > For pytest in Koalas <> unittests in PySpark:
>>>> >
>>>> >   I am currently thinking we will have to convert the Koalas tests to
>>>> use unittests to match
>>>> >   with PySpark for now.
>>>> >   It is a feasible option for PySpark to migrate to pytest too but it
>>>> will need extra effort to
>>>> >   make it working with our own PySpark testing framework seamlessly.
>>>> >   Koalas team (presumably and likely I) will take a look in any event.
>>>> >
>>>> > For the combinations of dependency versions:
>>>> >
>>>> >   Due to the lack of the resources in GitHub Actions, I currently
>>>> plan to just add the
>>>> >   Koalas tests into the matrix PySpark is currently using.
>>>> >
>>>> > one question I have; what’s an initial goal of the proposal?
>>>> > Is that to port all the pandas interfaces that Koalas has already
>>>> implemented?
>>>> > Or, the basic set of them?
>>>> >
>>>> > The goal of the proposal is to port all of Koalas project into
>>>> PySpark.
>>>> > For example,
>>>> >
>>>> > import koalas
>>>> >
>>>> > will be equivalent to
>>>> >
>>>> > # Names, etc. might change in the final proposal or during the review
>>>> > from pyspark.sql import pandas
>>>> >
>>>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>>>> difference between
>>>> > DataFrame structures in pandas and PySpark, e.g.) other types as
>>>> column names (labels),
>>>> > index (something like row number in DBMSs) and so on. So I think it
>>>> would make more sense
>>>> > to port the whole layer instead of a subset of the APIs.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <cloud0...@gmail.com>님이 작성:
>>>> >>
>>>> >> +1, it's great to have Pandas support in Spark out of the box.
>>>> >>
>>>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
>>>> linguin....@gmail.com> wrote:
>>>> >>>
>>>> >>> +1; the pandas interfaces are pretty popular and supporting them in
>>>> pyspark looks promising, I think.
>>>> >>> one question I have; what's an initial goal of the proposal?
>>>> >>> Is that to port all the pandas interfaces that Koalas has already
>>>> implemented?
>>>> >>> Or, the basic set of them?
>>>> >>>
>>>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <ieme...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> +1
>>>> >>>>
>>>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>>> >>>> well as better alignment with core Spark improvements, the extra
>>>> >>>> weight looks manageable.
>>>> >>>>
>>>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>> >>>> <nicholas.cham...@gmail.com> wrote:
>>>> >>>> >
>>>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>> >>>> >>
>>>> >>>> >> I don't think we should deprecate existing APIs.
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > +1
>>>> >>>> >
>>>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
>>>> API. I could be wrong, but I wager most people who have worked with both
>>>> Spark and Pandas feel the same way.
>>>> >>>> >
>>>> >>>> > For the large community of current PySpark users, or users
>>>> switching to PySpark from another Spark language API, it doesn't make sense
>>>> to deprecate the current API, even by convention.
>>>> >>>>
>>>> >>>>
>>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>> ---
>>>> >>> Takeshi Yamamuro
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to