Hi, I've continued my investigation and found it's an issue with pyspark's explain() not showing cost information - adding the EXPLAIN COST to the queries does show me rowCount information when turning the CBO on.
Thanks On Thu, Feb 20, 2020 at 1:36 PM Aelur Sadgod <aelur...@gmail.com> wrote: > Hi, > > > > I’m using Spark 2.4.4 (on EMR) to try and test the CBO on a partitioned > external Hive table, files are saved as parquet. > > > > I’ve set up the required configuration: > > .config("spark.sql.cbo.enabled","true")\ > > .config("spark.sql.cbo.joinreorder.enabled","true")\ > > > > After running the analyse command for all columns: > > spark.sql(f""" > > ANALYZE TABLE mytable PARTITION(year,month,date) > > COMPUTE STATISTICS FOR COLUMNS {','.join(spark.table(‘mytable').columns)} > > """) > > I run the describe extended: > > spark.sql("DESCRIBE EXTENDED mytable").show(truncate=False, n=400) > > > > and can only see table-level stats, not column-level > > > col_name data_type comment > > Statistics 698171162 bytes, 19001020 rows > > > > Running a sample query shows no usage of statistics whatsoever > > spark.sql(""" > > select * from mytable a, mytable b > > where a.col = b.col and a.anothercol = 6566 > > """).explain(True) > > > > == Parsed Logical Plan == > > 'Project [*] > > +- 'Filter (('a.col = 'b.col) && ('a.anothercol = 3462)) > > +- 'Join Inner > > :- 'SubqueryAlias `a` > > : +- 'UnresolvedRelation `mytable` > > +- 'SubqueryAlias `b` > > +- 'UnresolvedRelation `mytable` > > > > == Analyzed Logical Plan == > > col: bigint, anothercol: bigint, ... 37 more fields] > > +- Filter ((col#477L = col#497L) && (anothercol#479L = cast(3462 as > bigint))) > > +- Join Inner > > :- SubqueryAlias `a` > > : +- SubqueryAlias `mydb`.`mytable` > > : +- Relation[col#477L,anothercol#479L...] parquet > > +- SubqueryAlias `b` > > +- SubqueryAlias `mydb`.`mytable` > > +- Relation[col#497L,anothercol#499L,...] parquet > > > > == Optimized Logical Plan == > > Join Inner, (col#477L = col#497L) > > :- Filter ((isnotnull(anothercol#479L) && (anothercol#479L = 3462)) && > isnotnull(col#477L)) > > : +- Relation[col#477L,anothercol#479L,...] parquet > > +- Filter isnotnull(col#497L) > > +- Relation[col#497L,anothercol#499L,...] parquet > > > > == Physical Plan == > > *(5) SortMergeJoin [col#477L], [col#497L], Inner > > :- *(2) Sort [col#477L ASC NULLS FIRST], false, 0 > > : +- Exchange hashpartitioning(col#477L, 1000) > > : +- *(1) Project [col#477L, anothercol#479L, ...] > > : +- *(1) Filter ((isnotnull(anothercol#479L) && (anothercol#479L = > 3462)) && isnotnull(col#477L)) > > : +- *(1) FileScan parquet > mydb.mytable[col#477L,anothercol#479L...] Batched: true, Format: Parquet, > Location: CatalogFileIndex[s3://bucket/prefix..., PartitionCount: 91, > PartitionFilters: [], PushedFilters: [IsNotNull(anothercol), > EqualTo(anothercol,3462), IsNotNull(col)], ReadSchema: > struct<col:bigint,anothercol:bigint,... > > +- *(4) Sort [col#497L ASC NULLS FIRST], false, 0 > > +- Exchange hashpartitioning(col#497L, 1000) > > +- *(3) Project [col#497L, anothercol#499L, ...] > > +- *(3) Filter isnotnull(col#497L) > > +- *(3) FileScan parquet > mydb.mytable[col#497L,anothercol#499L,...] Batched: true, Format: Parquet, > Location: CatalogFileIndex[s3://bucket/prefix..., PartitionCount: 91, > PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: > struct<col:bigint,anothercol:bigint,... > > > > I see the following tickets on JIRA but no helpful information on whether > they’re bugs or not > > https://issues.apache.org/jira/browse/SPARK-29335 > > > > https://issues.apache.org/jira/browse/SPARK-25185 > > > > Is this a bug or is there something I’m missing? > > > > Thanks! >