Wonderful. To be clear the patch is more to start the discussion about how
we want to do it and less what I think is the right way.

On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran <ste...@cloudera.com> wrote:

>
>
> On Wed, 22 Jul 2020 at 00:51, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> Hi Folks,
>>
>> In Spark SQL there is the ability to have Spark do it's partition
>> discovery/file listing in parallel on the worker nodes and also avoid
>> locality lookups. I'd like to expose this in core, but given the Hadoop
>> APIs it's a bit more complicated to do right. I
>>
>
> That's ultimately fixable, if we can sort out what's good from the app
> side and reconcile that with 'what is not pathologically bad across both
> HDFS and object stores".
>
> Bad: globStatus, anything which returns an array rather than a remote
> iterator, encourages treewalk
> Good: deep recursive listings, remote iterator results for:
> incremental/async fetch of next page of listing, soon: option for iterator,
> if cast to IOStatisticsSource, actually serve up stats on IO performance
> during the listing. (e.g. #of list calls, mean time to get a list
> response back., store throttle events)
>
> Also look at LocatedFileStatus to see how it parallelises its work. its
> not perfect because wildcards are supported, which means globStatus gets
> used
>
> happy to talk about this some more, and I'll review the patch
>
> -steve
>
>
>> made a quick POC and two potential different paths we could do for
>> implementation and wanted to see if anyone had thoughts -
>> https://github.com/apache/spark/pull/29179.
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to