Re: Exposing Spark parallelized directory listing & non-locality listing in core

Felix Cheung Wed, 22 Jul 2020 20:44:13 -0700

+1

________________________________
From: Holden Karau <[email protected]>
Sent: Wednesday, July 22, 2020 10:49:49 AM
To: Steve Loughran <[email protected]>
Cc: dev <[email protected]>
Subject: Re: Exposing Spark parallelized directory listing & non-locality 
listing in core


Wonderful. To be clear the patch is more to start the discussion about how we 
want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran 
<[email protected]<mailto:[email protected]>> wrote:


On Wed, 22 Jul 2020 at 00:51, Holden Karau 
<[email protected]<mailto:[email protected]>> wrote:
Hi Folks,

In Spark SQL there is the ability to have Spark do it's partition 
discovery/file listing in parallel on the worker nodes and also avoid locality 
lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit 
more complicated to do right. I

That's ultimately fixable, if we can sort out what's good from the app side and 
reconcile that with 'what is not pathologically bad across both HDFS and object 
stores".

Bad: globStatus, anything which returns an array rather than a remote iterator, 
encourages treewalk
Good: deep recursive listings, remote iterator results for: incremental/async 
fetch of next page of listing, soon: option for iterator, if cast to 
IOStatisticsSource, actually serve up stats on IO performance during the 
listing. (e.g. #of list calls, mean time to get a list response back., store 
throttle events)

Also look at LocatedFileStatus to see how it parallelises its work. its not 
perfect because wildcards are supported, which means globStatus gets used

happy to talk about this some more, and I'll review the patch

-steve

made a quick POC and two potential different paths we could do for 
implementation and wanted to see if anyone had thoughts - 
https://github.com/apache/spark/pull/29179.

Cheers,

Holden

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Exposing Spark parallelized directory listing & non-locality listing in core

Reply via email to