The default implementation is a recursive treewalk, though HDFS and ADL both push the work out to the remote system for performance.
If odd numbers are coming back on getContentSummary() against HDFS, then it's a bug there. Though if its Jenkins test runs against the local FS, then it's in the client-side treewalk, Reimplementing the treewalk in spark work, but very inefficient on a deep/wide tree compared to one RPC call to HDFS, which can then lock the directory once & do a recurse down. And, if needed, the blobstore clients can do a flat listing which is much more efficient than the recursion, in time and $. Only ADSL does though...if getContentSummary() does get used on a path where performance matters, the other stores could be uprated fairly easily -steve On 2 Jan 2018, at 09:45, Jacek Laskowski <ja...@japila.pl<mailto:ja...@japila.pl>> wrote: Hi, I was wondering what's wrong with FileSystem.getContentSummary in CommandUtils.calculateLocationSize as "expressed" in the comment [1]: // This method is mainly based on // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table) // in Hive 0.13 (except that we do not use fs.getContentSummary). // TODO: Generalize statistics collection. // TODO: Why fs.getContentSummary returns wrong size on Jenkins? // Can we use fs.getContentSummary in future? // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use // countFileSize to count the table size. until I found out that there seems to be no issue whatsoever since DetermineTableStats uses it just fine [2]. Why does CommandUtils.calculateLocationSize *not* use what DetermineTableStats does successfully? [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala#L66-L73 [2] https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala?utf8=%E2%9C%93#L126<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala?utf8=✓#L126> Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski