+1 ________________________________ From: Holden Karau <hol...@pigscanfly.ca> Sent: Wednesday, July 22, 2020 10:49:49 AM To: Steve Loughran <ste...@cloudera.com> Cc: dev <dev@spark.apache.org> Subject: Re: Exposing Spark parallelized directory listing & non-locality listing in core
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way. On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran <ste...@cloudera.com<mailto:ste...@cloudera.com>> wrote: On Wed, 22 Jul 2020 at 00:51, Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote: Hi Folks, In Spark SQL there is the ability to have Spark do it's partition discovery/file listing in parallel on the worker nodes and also avoid locality lookups. I'd like to expose this in core, but given the Hadoop APIs it's a bit more complicated to do right. I That's ultimately fixable, if we can sort out what's good from the app side and reconcile that with 'what is not pathologically bad across both HDFS and object stores". Bad: globStatus, anything which returns an array rather than a remote iterator, encourages treewalk Good: deep recursive listings, remote iterator results for: incremental/async fetch of next page of listing, soon: option for iterator, if cast to IOStatisticsSource, actually serve up stats on IO performance during the listing. (e.g. #of list calls, mean time to get a list response back., store throttle events) Also look at LocatedFileStatus to see how it parallelises its work. its not perfect because wildcards are supported, which means globStatus gets used happy to talk about this some more, and I'll review the patch -steve made a quick POC and two potential different paths we could do for implementation and wanted to see if anyone had thoughts - https://github.com/apache/spark/pull/29179. Cheers, Holden -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau