Avcu opened a new issue, #3149: URL: https://github.com/apache/parquet-java/issues/3149
### Describe the bug, including details regarding any error messages, version, and platform. ### Issue I am saving a parquet file with spark where one of the columns is decimal. Physical type of this column becomes INT32 and INT64 based on its precision. Then, when I read the parquet file with AvroParquetReader, I see logical type being long with the wrong value. For example, if original value is 23.4 then read value is 234. #### Spark side If I enable `spark.sql.parquet.writeLegacyFormat` for the Spark (ex Jira: [SPARK-20297](https://issues.apache.org/jira/browse/SPARK-20297)), I see that Spark does not use INT32/INT64 as physical type and then I can successfully read the parquet file. However, this is not the default option and also based on the [decimal documentation of this repo](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal), INT32/INT64 should be viable options. ### How to reproduce 1. #### Writing with Spark (version: 3.3.0) ``` df_temp = spark.createDataFrame([ (120.321, "Alex"), (24.45, "John")], schema=["salary", "name"] ) df_temp.createOrReplaceTempView("companyTable") df = spark.sql("SELECT *, CAST(salary as DECIMAL(10,1)) as decimal_salary FROM companyTable") df.show() df.write.parquet("my_path") ``` ``` +-------+----+--------------+ | salary|name|decimal_salary| +-------+----+--------------+ |120.321|Alex| 120.3| | 24.45|John| 24.5| +-------+----+--------------+ ``` 2. #### Confirming the schema Running the parquet-tools: `parquet-tools inspect github_example.parquet` ``` ############ file meta data ############ created_by: parquet-mr version 1.12.2 (build ${buildNumber}) num_columns: 3 num_rows: 1 num_row_groups: 1 format_version: 1.0 serialized_size: 757 ############ Columns ############ salary name decimal_salary ############ Column(salary) ############ name: salary path: salary max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: -5%) ############ Column(name) ############ name: name path: name max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: SNAPPY (space_saved: -5%) ############ Column(decimal_salary) ############ name: decimal_salary path: decimal_salary max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: Decimal(precision=10, scale=1) converted_type (legacy): DECIMAL compression: SNAPPY (space_saved: -5%) ``` 3. #### Reading with AvroParquetReader (verson: 1.15.0) ``` public static void main(String[] args) { String filePath = "my_path"; // Check if file exists File file = new File(filePath); if(!file.exists() || file.isDirectory()) { System.err.println("Invalid file path"); return; } GenericData genericData = new GenericData(); genericData.addLogicalTypeConversion(new Conversions.DecimalConversion()); try { Path path = new Path(filePath); ParquetReader<GenericRecord> reader = AvroParquetReader .<GenericRecord>builder(HadoopInputFile.fromPath(path, new Configuration())) .withDataModel(genericData) .build(); GenericRecord record; while ((record = reader.read()) != null) { // Process the record System.out.println(record.toString()); System.out.println(record.getSchema()); } } catch (IOException e) { e.printStackTrace(); } } ``` ``` {"salary": 120.321, "name": "Alex", "decimal_salary": 1203} {"type":"record","name":"spark_schema","fields":[{"name":"salary","type":["null","double"],"default":null},{"name":"name","type":["null","string"],"default":null},{"name":"decimal_salary","type":["null","long"],"default":null}]} ``` ##### Dependencies ``` <dependencies> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-common</artifactId> <version>1.15.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-encoding</artifactId> <version>1.15.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-column</artifactId> <version>1.15.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-hadoop</artifactId> <version>1.15.0</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-avro</artifactId> <version>1.15.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>3.4.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>3.4.1</version> </dependency> </dependencies> ``` ### Artifacts [github_example.parquet.zip](https://github.com/user-attachments/files/18717468/github_example.parquet.zip) ### Component(s) Avro -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org