Re: [DISCUSS] Support pandas API layer on PySpark

Georg Heiler Wed, 17 Mar 2021 04:33:49 -0700

Would you plan to keep the existing indexing mechanism then?
https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
For me, it always even when trying to use the distributed version resulted
in various window functions being chained, a different query plan than the
default query plan, and slower execution of the job due to this overhead.


Especially when some people here are thinking about making it the
default/replacing the regular API I would strongly suggest defaulting to an
indexing mechanism that is not changing the query plan.

Best,
Georg

Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <[email protected]
>:

> > Just out of curiosity, does Koalas pretty much implement all of the
> Pandas APIs now? If there are some that are yet to be implemented or others
> that have differences, are these documented so users won't be caught
> off-guard?
>
> It's roughly 75% done so far (in Series, DataFrame and Index).
> Yeah, and it throws an exception that says it's not implemented yet
> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
> easily make users shoot their feet by, for example, for loop ... ).
>
>
> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler <[email protected]>님이 작성:
>
>> +1 the proposal sounds good to me. Having a familiar API built-in will
>> really help new users get into using Spark that might only have Pandas
>> experience. It sounds like maintenance costs should be manageable, once the
>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>> pretty much implement all of the Pandas APIs now? If there are some that
>> are yet to be implemented or others that have differences, are these
>> documented so users won't be caught off-guard?
>>
>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Integrating Koalas with pyspark might help enable a richer integration
>>> between the two. Something that would be useful with a tighter
>>> integration is support for custom column array types. Currently, Spark
>>> takes dataframes, converts them to arrow buffers then transmits them
>>> over the socket to Python. On the other side, pyspark takes the arrow
>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>> default Pandas representation of a list-type for a column causes it to
>>> turn what was contiguous value/offset arrays in Arrow into
>>> deserialized Python objects for each row. Obviously, this kills
>>> performance.
>>>
>>> A PR to extend the pyspark API to elide the pandas conversion
>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>> rejected, which is unfortunate, but perhaps this proposed integration
>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>> Spark to performantly interchange jagged/ragged lists to/from python
>>> UDFs.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon <[email protected]>
>>> wrote:
>>> >
>>> > Thank you guys for all your feedback. I will start working on SPIP
>>> with Koalas team.
>>> > I would expect the SPIP can be sent late this week or early next week.
>>> >
>>> >
>>> > I inlined and answered the questions unanswered as below:
>>> >
>>> > Is the community developing the pandas API layer for Spark interested
>>> in being part of Spark or do they prefer having their own release cycle?
>>> >
>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>> move quickly.
>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>> that it’s now
>>> > fine to have less frequent releases, and they are happy to work
>>> together with Spark with
>>> > contributing to it. The active contributors in the Koalas community
>>> will continue to
>>> > make the contributions in Spark.
>>> >
>>> > How about test code? Does it fit into the PySpark test framework?
>>> >
>>> > Yes, this will be one of the places where it needs some efforts.
>>> Koalas currently uses pytest
>>> > with various dependency version combinations (e.g., Python version,
>>> conda vs pip) whereas
>>> > PySpark uses the plain unittests with less dependency version
>>> combinations.
>>> >
>>> > For pytest in Koalas <> unittests in PySpark:
>>> >
>>> >   I am currently thinking we will have to convert the Koalas tests to
>>> use unittests to match
>>> >   with PySpark for now.
>>> >   It is a feasible option for PySpark to migrate to pytest too but it
>>> will need extra effort to
>>> >   make it working with our own PySpark testing framework seamlessly.
>>> >   Koalas team (presumably and likely I) will take a look in any event.
>>> >
>>> > For the combinations of dependency versions:
>>> >
>>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>>> to just add the
>>> >   Koalas tests into the matrix PySpark is currently using.
>>> >
>>> > one question I have; what’s an initial goal of the proposal?
>>> > Is that to port all the pandas interfaces that Koalas has already
>>> implemented?
>>> > Or, the basic set of them?
>>> >
>>> > The goal of the proposal is to port all of Koalas project into PySpark.
>>> > For example,
>>> >
>>> > import koalas
>>> >
>>> > will be equivalent to
>>> >
>>> > # Names, etc. might change in the final proposal or during the review
>>> > from pyspark.sql import pandas
>>> >
>>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>>> difference between
>>> > DataFrame structures in pandas and PySpark, e.g.) other types as
>>> column names (labels),
>>> > index (something like row number in DBMSs) and so on. So I think it
>>> would make more sense
>>> > to port the whole layer instead of a subset of the APIs.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan <[email protected]>님이 작성:
>>> >>
>>> >> +1, it's great to have Pandas support in Spark out of the box.
>>> >>
>>> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
>>> [email protected]> wrote:
>>> >>>
>>> >>> +1; the pandas interfaces are pretty popular and supporting them in
>>> pyspark looks promising, I think.
>>> >>> one question I have; what's an initial goal of the proposal?
>>> >>> Is that to port all the pandas interfaces that Koalas has already
>>> implemented?
>>> >>> Or, the basic set of them?
>>> >>>
>>> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> +1
>>> >>>>
>>> >>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>> >>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>> >>>> well as better alignment with core Spark improvements, the extra
>>> >>>> weight looks manageable.
>>> >>>>
>>> >>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>> >>>> <[email protected]> wrote:
>>> >>>> >
>>> >>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin <[email protected]>
>>> wrote:
>>> >>>> >>
>>> >>>> >> I don't think we should deprecate existing APIs.
>>> >>>> >
>>> >>>> >
>>> >>>> > +1
>>> >>>> >
>>> >>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas
>>> API. I could be wrong, but I wager most people who have worked with both
>>> Spark and Pandas feel the same way.
>>> >>>> >
>>> >>>> > For the large community of current PySpark users, or users
>>> switching to PySpark from another Spark language API, it doesn't make sense
>>> to deprecate the current API, even by convention.
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe e-mail: [email protected]
>>> >>>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> ---
>>> >>> Takeshi Yamamuro
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to