Hi Spark SQL query seems to be doing a table scan instead of utilizing partitions in Iceberg.
I have created a partition using the spec as follows: public PartitionSpec getPartitionSpec() { PartitionSpec.Builder icebergBuilder = PartitionSpec.builderFor(getSchema()); icebergBuilder.hour(FIELD_NAME.CREATED_AT); return icebergBuilder.build(); } My expectation was that iceberg would implement hidden partitioning on CREATED_AT field which is of type Timestamp. When I look at S3, it seems to have created hourly partitions ( great!) While running the Query I load the Table as follows- dlS3Connector.getSparkSession().read().format("iceberg") .load(getTableLocation()) // S3 Bucket .where(new Column(TweetItem.FIELD_NAME.CREATED_AT).$greater$eq(new Timestamp(startDate)) .and(new Column(TweetItem.FIELD_NAME.CREATED_AT).lt(new Timestamp(endDate))) .and(new Column(TweetItem.FIELD_NAME.TEXT).rlike(regExp))) .as(Tweet.getEncoder()); But on read it is doing a table scan as per 2019-10-26 00:04:27.731 ^[[32m[INFO ]^[[m [main] o.a.s.s.e.d.v.DataSourceV2Strategy (Logging.scala:54) - Pushing operators to class org.apache.iceberg.spark.source.IcebergSource Pushed Filters: isnotnull(created_at#5), isnotnull(text#4) Post-Scan Filters: (cast(created_at#5 as string) > 2019-04-01 04:35:06.0),(cast(created_at#5 as string) < 2019-04-01 04:40:06.0),text#4 RLIKE hackathon|understand|Trump,isnotnull(created_at#5),isnotnull(text#4) Output: mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L 2019-10-26 00:04:27.764 ^[[32m[INFO ]^[[m [main] o.a.i.TableScan (BaseTableScan.java:178) - Scanning table s3a://hackathon-hour/ snapshot 5785804775998605063 created at 2019-10-24 09:30:26.550 with filter (not_null(ref(name="created_at")) and not_null(ref(name="text"))) The physical plan is displayed as - == Physical Plan == *(1) Project [mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L] +- *(1) Filter (((((cast(created_at#5 as string) > 2019-04-01 04:35:06.0) && (cast(created_at#5 as string) < 2019-04-01 04:40:06.0)) && text#4 RLIKE hackathon|understand|Trump) && isnotnull(created_at#5)) && isnotnull(text#4)) +- *(1) ScanV2 iceberg[mwId#0, mwVersion#1L, id#2L, id_str#3, text#4, created_at#5, lang#6, created_at_ms#7L] (Filters: [isnotnull(created_at#5), isnotnull(text#4)], Options: [path=s3a://hackathon-hour/,paths=[]]) Please let me know where I am going wrong here. Iceberg version used is - 57b1099.dirty Thanks Sandeep -- The information contained in this email may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this email is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this email in error, please notify the sender immediately and destroy all copies of the message.