Re: Which version of spark version supports parquet version 2 ?

Mich Talebzadeh Tue, 16 Apr 2024 10:06:06 -0700

Well let us do a test in PySpark.

Take this code and create a default parquet file. My spark is 3.4


cat parquet_checxk.py
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()

data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
21893000)]
df = spark.createDataFrame(data, ["city", "population"])

df.write.mode("overwrite").parquet("parquet_example")  # it create file in
hdfs directory

Use a tool called parquet-tools (downloadable using pip from
https://pypi.org/project/parquet-tools/)

Get the parquet files from hdfs to the current directory say

hdfs dfs -get /user/hduser/parquet_example .
cd ./parquet_example
do an ls and pickup file 3 like below to inspect
 parquet-tools inspect
part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet

Now this is the output

############ file meta data ############
created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 1.0
serialized_size: 563


############ Columns ############
name
age

############ Column(name) ############
name: name
path: name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: -5%)

############ Column(age) ############
name: age
path: age
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -5%)

File Information:

   - format_version: 1.0: This line explicitly states that the format
   version of the Parquet file is 1.0, which corresponds to Parquet version 1.
   - created_by: parquet-mr version 1.12.3: While this doesn't directly
   specify the format version, itt is accepted that older versions of
   parquet-mr like 1.12.3 typically write Parquet version 1 files.

Since in this case Spark 3.4 is capable of reading both versions (1 and 2),
you don't  necessarily need to modify your Spark code to access this file.
However, if you want to create Parquet files in version 2 using Spark, you
might need to consider additional changes like excluding parquet-mr or
upgrading Parquet libraries and do a custom build.of Spark. However, taking
klaws of diminishing returns, I would not advise that either.. You can
ofcourse usse gzip for compression that may be more suitable for your needs.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <[email protected]> wrote:

> Hello Community,
> Could any of you shed some light on below questions please ?
> Sent from my iPhone
>
> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <[email protected]> wrote:
>
> 
> Any specific reason spark does not support or community doesn't want to go
> to Parquet V2 , which is more optimized and read and write is too much
> faster (form other component which I am using)
>
> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <[email protected]> wrote:
>
>> Spark will read data written with v2 encodings just fine. You just don't
>> need to worry about making Spark produce v2. And you should probably also
>> not produce v2 encodings from other systems.
>>
>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <[email protected]> wrote:
>>
>>> oops but so spark does not support parquet V2  atm ?, as We have a use
>>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>>
>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <[email protected]> wrote:
>>>
>>>> Hi Prem,
>>>>
>>>> Parquet v1 is the default because v2 has not been finalized and adopted
>>>> by the community. I highly recommend not using v2 encodings at this time.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <[email protected]>
>>>> wrote:
>>>>
>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
>>>>> looking how to write in Parquet version2 ?
>>>>>
>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Sorry you have a point there. It was released in version 3.00. What
>>>>>> version of spark are you using?
>>>>>>
>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you so much for the info! But do we have any release notes
>>>>>>> where it says spark2.4.0 onwards supports parquet version 2. I was under
>>>>>>> the impression Spark3.0 onwards it started supporting .
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Well if I am correct, Parquet version 2 support was introduced in
>>>>>>>> Spark version 2.4.0. Therefore, any version of Spark starting from 
>>>>>>>> 2.4.0
>>>>>>>> supports Parquet version 2. Assuming that you are using Spark version
>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet 
>>>>>>>> version 2
>>>>>>>> features.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>> note
>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>> expert opinions (Werner
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you for the information!
>>>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>>>
>>>>>>>>> regarding 2nd question .
>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Parquet-mr is a Java library that provides functionality for
>>>>>>>>>> working with Parquet files with hadoop. It is therefore  more geared
>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem,
>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way to 
>>>>>>>>>> check
>>>>>>>>>> exact compatible versions within the library itself. However, you 
>>>>>>>>>> can have
>>>>>>>>>> a look at this
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative
>>>>>>>>>> AI
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>>> note
>>>>>>>>>> that, as with any advice, quote "one test result is worth 
>>>>>>>>>> one-thousand
>>>>>>>>>> expert opinions (Werner
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Team,
>>>>>>>>>>> May I know how to check which version of parquet is supported by
>>>>>>>>>>> parquet-mr 1.2.1 ?
>>>>>>>>>>>
>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2 (V2)
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>> May I get the release notes where parquet versions are mentioned
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Which version of spark version supports parquet version 2 ?

Reply via email to