Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way.
On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran <ste...@cloudera.com> wrote: > > > On Wed, 22 Jul 2020 at 00:51, Holden Karau <hol...@pigscanfly.ca> wrote: > >> Hi Folks, >> >> In Spark SQL there is the ability to have Spark do it's partition >> discovery/file listing in parallel on the worker nodes and also avoid >> locality lookups. I'd like to expose this in core, but given the Hadoop >> APIs it's a bit more complicated to do right. I >> > > That's ultimately fixable, if we can sort out what's good from the app > side and reconcile that with 'what is not pathologically bad across both > HDFS and object stores". > > Bad: globStatus, anything which returns an array rather than a remote > iterator, encourages treewalk > Good: deep recursive listings, remote iterator results for: > incremental/async fetch of next page of listing, soon: option for iterator, > if cast to IOStatisticsSource, actually serve up stats on IO performance > during the listing. (e.g. #of list calls, mean time to get a list > response back., store throttle events) > > Also look at LocatedFileStatus to see how it parallelises its work. its > not perfect because wildcards are supported, which means globStatus gets > used > > happy to talk about this some more, and I'll review the patch > > -steve > > >> made a quick POC and two potential different paths we could do for >> implementation and wanted to see if anyone had thoughts - >> https://github.com/apache/spark/pull/29179. >> >> Cheers, >> >> Holden >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau