I guess that it has to do with indexing and partitioning data to nodes. Have a look at data partitioning system design concept <https://www.enjoyalgorithms.com/blog/data-partitioning-system-design-concept> and key range partitions <https://martinfowler.com/articles/patterns-of-distributed-systems/key-range-partitions.html>
You can work around this by creating a temp view where date is casted to string. Note I did have to test this sometimes so I'm using .mode("overwrite") on the file. from pyspark.sql import SparkSession, Row from datetime import date spark = SparkSession.builder.getOrCreate() mock_data = [ Row(id=1, name="John", partition_col=date(2023, 4, 11)), Row(id=2, name="Jane", partition_col=date(2023, 4, 11)), Row(id=3, name="Alice", partition_col=date(2023, 4, 12)), Row(id=4, name="Bob", partition_col=date(2023, 4, 12)), ] mock_df = spark.createDataFrame(mock_data) parquet_data_path = "test_date" mock_df.write.partitionBy("partition_col").mode("overwrite").parquet(parquet_data_path) create_table_sql = f""" CREATE TABLE IF NOT EXISTS my_table ( id INT, name STRING) USING parquet PARTITIONED BY (partition_col DATE) OPTIONS ('path' = '{parquet_data_path}') """ spark.sql(create_table_sql) # temp view with the string partition column create_view_sql = f""" CREATE OR REPLACE TEMPORARY VIEW my_table_with_string_partition AS SELECT *, CAST(partition_col AS STRING) AS partition_col_str FROM my_table; """ spark.sql(create_view_sql) query = f"SELECT * FROM my_table_with_string_partition WHERE partition_col_str = '2023-04-11';" result = spark.sql(query) result.show() +---+----+-------------+-----------------+ | id|name|partition_col|partition_col_str| +---+----+-------------+-----------------+ | 1|John| 2023-04-11| 2023-04-11| | 2|Jane| 2023-04-11| 2023-04-11| +---+----+-------------+-----------------+ lør. 15. apr. 2023 kl. 21:41 skrev Charles vinodh <mig.flan...@gmail.com>: > > bumping this up again for suggestions?.. Is the official recommendation to > not have *int* or *date* typed partition columns? > > On Wed, 12 Apr 2023 at 10:44, Charles vinodh <mig.flan...@gmail.com> > wrote: > >> There are other distributed execution engines (like hive, trino) that do >> support non-string data types for partition columns such as date and >> integer. >> Any idea why this restriction exists in Spark? .. >> >> >> On Tue, 11 Apr 2023 at 20:34, Chitral Verma <chitralve...@gmail.com> >> wrote: >> >>> Because the name of the directory cannot be an object, it has to be a >>> string to create partitioned dirs like "date=2023-04-10" >>> >>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <mig.flan...@gmail.com> >>> wrote: >>> >>>> >>>> Hi Team, >>>> >>>> We are running into the below error when we are trying to run a simple >>>> query a partitioned table in Spark. >>>> >>>> *MetaException(message:Filtering is supported only on partition keys of >>>> type string) >>>> * >>>> >>>> >>>> Our the partition column has been to type *date *instead of string and >>>> query is a very simple SQL as shown below. >>>> >>>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'* >>>> >>>> Any idea why spark mandates partition columns to be of type string?. Is >>>> there a recommended work around for this issue? >>>> >>>> >>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297