FileSystem.getContentSummary for total size stats in DetermineTableStats VS CommandUtils?
Hi, I was wondering what's wrong with FileSystem.getContentSummary in CommandUtils.calculateLocationSize as "expressed" in the comment [1]: // This method is mainly based on // org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table) // in Hive 0.13 (except that we do not use fs.getContentSummary). // TODO: Generalize statistics collection. // TODO: Why fs.getContentSummary returns wrong size on Jenkins? // Can we use fs.getContentSummary in future? // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use // countFileSize to count the table size. until I found out that there seems to be no issue whatsoever since DetermineTableStats uses it just fine [2]. Why does CommandUtils.calculateLocationSize *not* use what DetermineTableStats does successfully? [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala#L66-L73 [2] https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala?utf8=%E2%9C%93#L126 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski
SQL Visualization for cached Dataset
Hi, Recently I had to optimize few Apache Spark SQL queries. Some of the Datasets were reused, so they were cached. However after caching I don't see SQL Visualization for the cached Dataset in Spark UI - I see only InMemoryRelation node. Explain result at the bottom of the page still has full plan. Is this an expected behaviour? In such cases we have much less options to debug performance in Spark. My suggestion is to show full diagram on the first action after cache or to show separate SQL query for cache - second option however probably is not possible as cache does not trigger calculation, so we can't get metrics. Workaround is to temporairly disable caching, but it consumes much time to do it, especially on large datasets Pozdrawiam / Best regards, Tomek