Re: Non string type partitions

Bjørn Jørgensen Sat, 15 Apr 2023 14:24:54 -0700

I guess that it has to do with indexing and partitioning data to nodes.
Have a look at data partitioning system design concept
<https://www.enjoyalgorithms.com/blog/data-partitioning-system-design-concept>
 and key range partitions
<https://martinfowler.com/articles/patterns-of-distributed-systems/key-range-partitions.html>



You can work around this by creating a temp view where date is casted to
string.

Note I did have to test this sometimes so I'm using .mode("overwrite") on
the file.


from pyspark.sql import SparkSession, Row
from datetime import date

spark = SparkSession.builder.getOrCreate()

mock_data = [
    Row(id=1, name="John", partition_col=date(2023, 4, 11)),
    Row(id=2, name="Jane", partition_col=date(2023, 4, 11)),
    Row(id=3, name="Alice", partition_col=date(2023, 4, 12)),
    Row(id=4, name="Bob", partition_col=date(2023, 4, 12)),
]

mock_df = spark.createDataFrame(mock_data)

parquet_data_path = "test_date"
mock_df.write.partitionBy("partition_col").mode("overwrite").parquet(parquet_data_path)

create_table_sql = f"""
CREATE TABLE IF NOT EXISTS my_table (
    id INT,
    name STRING)
USING parquet
PARTITIONED BY (partition_col DATE)
OPTIONS ('path' = '{parquet_data_path}')
"""

spark.sql(create_table_sql)

# temp view with the string partition column
create_view_sql = f"""
CREATE OR REPLACE TEMPORARY VIEW my_table_with_string_partition AS
SELECT *, CAST(partition_col AS STRING) AS partition_col_str FROM my_table;
"""
spark.sql(create_view_sql)

query = f"SELECT * FROM my_table_with_string_partition WHERE
partition_col_str = '2023-04-11';"
result = spark.sql(query)

result.show()


+---+----+-------------+-----------------+
| id|name|partition_col|partition_col_str|
+---+----+-------------+-----------------+
|  1|John|   2023-04-11|       2023-04-11|
|  2|Jane|   2023-04-11|       2023-04-11|
+---+----+-------------+-----------------+



lør. 15. apr. 2023 kl. 21:41 skrev Charles vinodh <[email protected]>:

>
> bumping this up again for suggestions?.. Is the official recommendation to
> not have *int* or *date* typed partition columns?
>
> On Wed, 12 Apr 2023 at 10:44, Charles vinodh <[email protected]>
> wrote:
>
>> There are  other distributed execution engines (like hive, trino) that do
>> support non-string data types for partition columns such as date and
>> integer.
>> Any idea why this restriction exists in Spark? ..
>>
>>
>> On Tue, 11 Apr 2023 at 20:34, Chitral Verma <[email protected]>
>> wrote:
>>
>>> Because the name of the directory cannot be an object, it has to be a
>>> string to create partitioned dirs like "date=2023-04-10"
>>>
>>> On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Hi Team,
>>>>
>>>> We are running into the below error when we are trying to run a simple
>>>> query a partitioned table in Spark.
>>>>
>>>> *MetaException(message:Filtering is supported only on partition keys of 
>>>> type string)
>>>> *
>>>>
>>>>
>>>> Our the partition column has been to type *date *instead of string and
>>>> query is a very simple SQL as shown below.
>>>>
>>>> *SELECT * FROM my_table WHERE partition_col = date '2023-04-11'*
>>>>
>>>> Any idea why spark mandates partition columns to be of type string?. Is
>>>> there a recommended work around for this issue?
>>>>
>>>>
>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Non string type partitions

Reply via email to