Re: Shutdown cleanup of disk-based resources that Spark creates

2021-04-06 Thread Steve Loughran
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> I agree with you to extend the documentation around this. Moreover I
> support to have specific unit tests for this.
>
> > There is clearly some demand for Spark to automatically clean up
> checkpoints on shutdown
>
> What about I suggested on the PR? To clean up the checkpoint directory at
> shutdown one can register the directory to be deleted at exit:
>
>  FileSystem fs = FileSystem.get(conf);
>  fs.deleteOnExit(checkpointPath);
>
>>
>>
 I wouldn't recommend that. It's really for testing. It should probably get
tagged as deprecated. Better for your own cleanup code to have some atomic
bool which makes the decision.


   1. It does the delete sequentially -the more paths, the longer it takes
   2. doesn't notice/skip if a file has changed since it's added
   3. doesn't distinguish from files and dirs. So if you have a file /temp/1
   4. then replace it with dir /temp/1, the entire tree gets deleted on
   shutdown. Is that what you wanted.

I've played with some optimisation of the s3a case (
https://github.com/apache/hadoop/pull/1924 ) ; but  really it should be
some of

-store any checksum/timestamp/size on submit, + dir/file status
-only delete on a match
-do this in a thread pool. Though you can't always create them on shutdown,
can you?

But of course do that and something, somewhere will break.

safer to roll your own.


Re: Support User Defined Types in pandas_udf for Spark's own Python API

2021-04-06 Thread Hyukjin Kwon
Yeah, we still should improve PySpark APIs together. I am currently stuck
at some work and porting Koalas at this moment so couldn't have a chance to
take a very close look (but drop some comments and skim).

2021년 4월 6일 (화) 오후 5:31, Darcy Shen 님이 작성:

> was: [DISCUSS] Support pandas API layer on PySpark
>
>
> I'm working on [SPARK-34600] Support user defined types in Pandas UDF -
> ASF JIRA (apache.org) .
>
> I'm wondering if we are still working on improving Spark's own Python API.
>
> SPARK-34600 is relatively a big feature for PySpark. I splited it into
> several small tickets and submitted the first small PR:
>
> [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support
> Enabled by sadhen · Pull Request #32026 · apache/spark (github.com)
> 
>
> I'm afraid that the Spark community are busy working on pandas API layer
> on PySpark and the improvements for Spark's own Python API will be
> postponed and postponed.
>
> As gongjonn.hyun said:
> > BTW, what is the future plan for the existing APIs?
>
> If we are keeping these existing APIs, will we add new features for
> Spark's own Python API?
>
> Or will we fix bugs for Spark's own Python API?
>
> Specifically, will we add support for User Defined Types in pandas_udf for
> Spark's own Python API?
>
>
>  On Mon, 2021-03-15 14:12:28 *Reynold Xin  >* wrote 
>
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to support.
> It has a pretty large number of users and existing code. Also pretty easy
> to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have learned
> it, or great for copy paste from Stackoverflow.
>
>
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun 
> wrote:
>
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
>
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
>
>
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon  wrote:
>
> Firstly my biggest reason is that I would like to promote this more as a
> built-in support because it is simply
> important to have it with the impact on the large user group, and the
> needs are increasing
> as the charts indicate. I usually think that features or add-ons stay as
> third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs, etc. I think
> this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more people use
> it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
> experts from the
> bigger community. Koalas’ team isn’t experts in all the areas, and there
> are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its own
> (hacky) way
> 
> .
> Fortunately the way Koalas implemented is now partially proposed into
> Python officially (PEP 646).
> But Koalas could have been better with interacting with the Python
> community more and actively
> joining in the design issues together to lead the best output that
> benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark. The growth
> of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up twice
> almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
> Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
> APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will be able to
> address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting features. As I
> emphasized before with
> elaboration, I do think this is an important feature missing in PySpark
> that users need.
> I do think Koalas completes what PySpark is currently missing.
>
>
>
> 2021년 3월 14일 (일) 오후 7:12, Sean Owen 님이 작성:
>
> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's wel

Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-06 Thread Hyukjin Kwon
Hi all,

I am an Apache Spark PMC, and would like to know the future plan about
GitHub Actions in ASF.
Please also see the INFRA ticket I filed:
https://issues.apache.org/jira/browse/INFRA-21646.

I am aware of the limited GitHub Actions resources that are shared
across all projects in ASF,
and many projects suffer from it. This issue significantly slows down the
development cycle of
 other projects, at least Apache Spark.

How do we plan to increase the resources in GitHub Actions, and what are
the blockers? I would appreciate any input and thoughts on this.

Thank you so much.

CC'ing Spark @dev  for more visibility. Please take
it out if considered inappropriate.