Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Reynold Xin
I don't think we should deprecate existing APIs. Spark's own Python API is relatively stable and not difficult to support. It has a pretty large number of users and existing code. Also pretty easy to learn by data engineers. pandas API is a great for data science, but isn't that great for some

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Dongjoon Hyun
Thank you for the proposal. It looks like a good addition. BTW, what is the future plan for the existing APIs? Are we going to deprecate it eventually in favor of Koalas (because we don't remove the existing APIs in general)? > Fourthly, PySpark is still not Pythonic enough. For example, I hear co

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Hyukjin Kwon
Firstly my biggest reason is that I would like to promote this more as a built-in support because it is simply important to have it with the impact on the large user group, and the needs are increasing as the charts indicate. I usually think that features or add-ons stay as third parties when it’s

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Sean Owen
I like koalas a lot. Playing devil's advocate, why not just let it continue to live as an add on? Usually the argument is it'll be maintained better in Spark but it's well maintained. It adds some overhead to maintaining Spark conversely. On the upside it makes it a little more discoverable. Are th

Re: minikube and kubernetes cluster versions for integration testing

2021-03-14 Thread Attila Zsolt Piros
Thanks Shane! As I promised: - the PR about documenting the change - my Spark PR with checking Minikube versions and using a simpler way to configure kubernetes client for integration testing - the Jira