Re: [PR] PARQUET-2416: Use 'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck closed pull request #1244: PARQUET-2416: Use 'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat URL: https://github.com/apache/parquet-java/pull/1244 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

[I] read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x read() for footer length + 1 x read(byte[4]) for magic marker ) [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck opened a new issue, #3074: URL: https://github.com/apache/parquet-java/issues/3074 ### Describe the enhancement requested This is a minor performance improvement, but worth when reading many files. read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x

[PR] GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck opened a new pull request, #3075: URL: https://github.com/apache/parquet-java/pull/3075 GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls ### Rationale for this change performance ### What changes are included in this PR? only method F

Re: [I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck commented on issue #3076: URL: https://github.com/apache/parquet-java/issues/3076#issuecomment-2495541296 see also related issue: [https://github.com/apache/parquet-java/issues/3077](https://github.com/apache/parquet-java/issues/3077) : AzureBlobFileSystem.open() should ret

Re: [PR] GH-463: Add more types - time, nano timestamps, UUID to Variant spec [parquet-format]

2024-11-23 Thread via GitHub
emkornfield commented on code in PR #464: URL: https://github.com/apache/parquet-format/pull/464#discussion_r1855229728 ## VariantEncoding.md: ## @@ -386,11 +386,15 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Exact Numeric| deci

[I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck opened a new issue, #3076: URL: https://github.com/apache/parquet-java/issues/3076 ### Describe the enhancement requested When reading some column chunks but not all, parquet is building a list of "ConsecutivePartList", then trying to call the Hadoop api for vectorized

Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck closed issue #3077: AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads URL: https://github.com/apache/parquet-java/issues/3077 -- This is an automated message from the Apache Git Service. T

Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck commented on issue #3077: URL: https://github.com/apache/parquet-java/issues/3077#issuecomment-2495543888 Sorry, I misclick in github project parquet instead of hadoop. I have recreated [(https://issues.apache.org/jira/browse/HADOOP-19345](HADOOP-19345), as I did not saw

Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-23 Thread via GitHub
emkornfield commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1855230005 ## LogicalTypes.md: ## @@ -670,6 +681,13 @@ optional group array_of_arrays (LIST) { Backward-compatibility rules +Modern writers should always produc

[I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub
Arnaud-Nauwynck opened a new issue, #3077: URL: https://github.com/apache/parquet-java/issues/3077 ### Describe the enhancement requested In hadoop-azure, there are huge performance problems when reading file in a too fragmented way: by reading many small file fragments even with the

Re: [PR] Add map_no_value.parquet [parquet-testing]

2024-11-23 Thread via GitHub
alamb commented on PR #63: URL: https://github.com/apache/parquet-testing/pull/63#issuecomment-2495459421 > Probably...I've never touched that behavior because I don't know if it is intentional or not. I vaguely remember the original rationale being it was impossible to decode ranges