+1 the proposal sounds good to me. Having a familiar API built-in will
really help new users get into using Spark that might only have Pandas
experience. It sounds like maintenance costs should be manageable, once the
hurdle with setting up tests is done. Just out of curiosity, does Koalas
pretty m
Hi,
Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to P
Thank you guys for all your feedback. I will start working on SPIP with
Koalas team.
I would expect the SPIP can be sent late this week or early next week.
I inlined and answered the questions unanswered as below:
Is the community developing the pandas API layer for Spark interested in
being par
There was a similar question (but another approach) and I've explained the
current status a bit.
https://lists.apache.org/thread.html/r89a61a10df71ccac132ce5d50b8fe405635753db7fa2aeb79f82fb77%40%3Cuser.spark.apache.org%3E
I guess this would also answer your question as well. At least for now,
Spa
Hi Spark Developers,
Is it possible to reliably determine the current global watermark that is
being used in a streaming query via StreamingQueryProgress.onQueryProgress
eventTime watermark String?
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.
+1, it's great to have Pandas support in Spark out of the box.
On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro
wrote:
> +1; the pandas interfaces are pretty popular and supporting them in
> pyspark looks promising, I think.
> one question I have; what's an initial goal of the proposal?
> Is th
+1; the pandas interfaces are pretty popular and supporting them in pyspark
looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?
On Tue, Mar 16, 2021 at 1:44
Please follow up the discussion in the origin PR.
https://github.com/apache/spark/pull/26127
Dataset.observe() relies on the query listener for the batch query which is
an "unstable" API - that's why we decided to not add an example for the
batch query. For streaming query, it relies on the stream
I am focusing on batch mode, not streaming mode. I would argue that
Dataset.observe() is equally useful for large batch processing. If you
need some motivating use cases, please let me know.
Anyhow, the documentation of observe states it works for both, batch and
streaming. And in batch mode,