Re: [PR] PARQUET-2416: Use 'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat [parquet-java]
Arnaud-Nauwynck closed pull request #1244: PARQUET-2416: Use 'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat URL: https://github.com/apache/parquet-java/pull/1244 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x read() for footer length + 1 x read(byte[4]) for magic marker ) [parquet-java]
Arnaud-Nauwynck opened a new issue, #3074: URL: https://github.com/apache/parquet-java/issues/3074 ### Describe the enhancement requested This is a minor performance improvement, but worth when reading many files. read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x read() for footer length + 1 x read(byte[4]) for magic marker ) in summary the patch is for file ParquetFileReader.java, method "readFooter()" : > -byte[] magic = new byte[MAGIC.length]; > -long fileMetadataLengthIndex = fileLen - magic.length - FOOTER_LENGTH_SIZE; > +long fileMetadataLengthIndex = fileLen - MAGIC.length - FOOTER_LENGTH_SIZE; > LOG.debug("reading footer index at {}", fileMetadataLengthIndex); > f.seek(fileMetadataLengthIndex); > -int fileMetadataLength = readIntLittleEndian(f); > -f.readFully(magic); > +byte[] magicAndLengthBytes = new byte[FOOTER_LENGTH_SIZE + MAGIC.length]; > +f.readFully(magicAndLengthBytes); > +int fileMetadataLength = readIntLittleEndian(magicAndLengthBytes, 0); > ### Component(s) Core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[PR] GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls [parquet-java]
Arnaud-Nauwynck opened a new pull request, #3075: URL: https://github.com/apache/parquet-java/pull/3075 GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls ### Rationale for this change performance ### What changes are included in this PR? only method FilePargetReader.readFooter() ### Are these changes tested? ### Are there any user-facing changes? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]
Arnaud-Nauwynck commented on issue #3076: URL: https://github.com/apache/parquet-java/issues/3076#issuecomment-2495541296 see also related issue: [https://github.com/apache/parquet-java/issues/3077](https://github.com/apache/parquet-java/issues/3077) : AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-463: Add more types - time, nano timestamps, UUID to Variant spec [parquet-format]
emkornfield commented on code in PR #464: URL: https://github.com/apache/parquet-format/pull/464#discussion_r1855229728 ## VariantEncoding.md: ## @@ -386,11 +386,15 @@ The Decimal type contains a scale, but no precision. The implied precision of a | Exact Numeric| decimal8| `9` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | Exact Numeric| decimal16 | `10`| DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | | Date | date| `11`| DATE | 4 byte little-endian | -| Timestamp| timestamp | `12`| TIMESTAMP(true, MICROS) | 8-byte little-endian | -| TimestampNTZ | timestamp without time zone | `13`| TIMESTAMP(false, MICROS)| 8-byte little-endian | +| Timestamp| timestamp with time zone| `12`| TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | Review Comment: I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]
Arnaud-Nauwynck opened a new issue, #3076: URL: https://github.com/apache/parquet-java/issues/3076 ### Describe the enhancement requested When reading some column chunks but not all, parquet is building a list of "ConsecutivePartList", then trying to call the Hadoop api for vectorized reader of FSDataInputStream#readVectored(List ...) Unfortunatly, many implementations of "FSDataInputStream" do not override the readVectored() method, which trigger many distinct calls to read. For example on hadoop-azure, the Azure Datalake Storage is much slower at establishing a new Https connection (using infamous calls HttpURLConnection for jdk 1.0, then doing TLS hand-shake), that to get only few more megas of data on an existing socket !! The case with small wholes to avoid reading is very frequent when having columns in parquet files that are not read, and are highly compressed because of RLE encoding. Typically, a very sparse column with only few values, or even always null within a page. Such a column could be encoded in only few hundred of bytes by parquet, so it is NOT a problem of reading 100 bytes more. Parquet should at least honor the following method from hadoop class FileSystem, that says that a seek of less than 4096 bytes is NOT reasonable. ``` /** * What is the smallest reasonable seek? * @return the minimum number of bytes */ default int minSeekForVectorReads() { return 4 * 1024; } ``` The logic for building this List for a list of column chunks is here: org.apache.parquet.hadoop.ParquetFileReader#internalReadRowGroup ``` private ColumnChunkPageReadStore internalReadRowGroup(int blockIndex) throws IOException { ... for (ColumnChunkMetaData mc : block.getColumns()) { ... // first part or not consecutive => new list if (currentParts == null || currentParts.endPos() != startingPos) { // <= SHOULD honor minSeekForVectorReads() currentParts = new ConsecutivePartList(startingPos); allParts.add(currentParts); } currentParts.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, mc.getTotalSize())); } } // actually read all the chunks ChunkListBuilder builder = new ChunkListBuilder(block.getRowCount()); readAllPartsVectoredOrNormal(allParts, builder); rowGroup.setReleaser(builder.releaser); for (Chunk chunk : builder.build()) { readChunkPages(chunk, block, rowGroup); } return rowGroup; } ``` maybe a possible implementation could be to add fictive "ConsecutivePartList" that are to be ignored while receiving the data, but that would avoid having some wholes in the ranges to read. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]
Arnaud-Nauwynck closed issue #3077: AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads URL: https://github.com/apache/parquet-java/issues/3077 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]
Arnaud-Nauwynck commented on issue #3077: URL: https://github.com/apache/parquet-java/issues/3077#issuecomment-2495543888 Sorry, I misclick in github project parquet instead of hadoop. I have recreated [(https://issues.apache.org/jira/browse/HADOOP-19345](HADOOP-19345), as I did not saw you comment. Will mark it as duplicates in Jira. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]
emkornfield commented on code in PR #466: URL: https://github.com/apache/parquet-format/pull/466#discussion_r1855230005 ## LogicalTypes.md: ## @@ -670,6 +681,13 @@ optional group array_of_arrays (LIST) { Backward-compatibility rules +Modern writers should always produce the 3-level LIST structure shown above. Review Comment: ```suggestion New writer implementations should always produce the 3-level LIST structure shown above. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]
Arnaud-Nauwynck opened a new issue, #3077: URL: https://github.com/apache/parquet-java/issues/3077 ### Describe the enhancement requested In hadoop-azure, there are huge performance problems when reading file in a too fragmented way: by reading many small file fragments even with the readVectored() Hadoop API, resulting in distinct Https Requests (=TCP-IP connection established + TLS handshake + requests). Internally, at lowest level, haddop azure is using class HttpURLConnection from jdk 1.0, and the ReadAhead Threads do not sufficiently solve all problems. The hadoop azure implementation of "readVectored()" should make a compromise between reading extra ignored data wholes, and establishing too many https connections. Currently, the class AzureBlobFileSystem#open() does return a default inneficient imlpementation of readVectored: ``` private FSDataInputStream open(final Path path, final Optional parameters) throws IOException { ... InputStream inputStream = getAbfsStore().openFileForRead(qualifiedPath, parameters, statistics, tracingContext); return new FSDataInputStream(inputStream); // <== FSDataInputStream is not efficiently overriding readVectored() ! } ``` see default implementation of FSDataInpustStream.readVectored: ``` public void readVectored(List ranges, IntFunction allocate) throws IOException { ((PositionedReadable)this.in).readVectored(ranges, allocate); } ``` it calls the underlying method from class AbfsInputStream, which is not overriden: ``` default void readVectored(List ranges, IntFunction allocate) throws IOException { VectoredReadUtils.readVectored(this, ranges, allocate); } ``` AbfsInputStream should override this method, and accept internally to do less Https calls, with merged range, and ignore some returned data (wholes in the range). It is like honouring the parameter of hadoop FSDataInputStream (implements PositionedReadable) ``` /** * What is the smallest reasonable seek? * @return the minimum number of bytes */ default int minSeekForVectorReads() { return 4 * 1024; } ``` Even this 4096 value is very conservative, and should be redined by AbfsFileSystem to be 4Mo or even 8mo. ask chat gpt: "on Azure Storage, what is the speed of getting 8Mo of a page block, compared to the time to establish a https tls handshake ?" The response (untrusted from chat gpt..) says : HTTPS/TLS Handshake: ~100–300 ms ... is generally slower than downloading 8 MB from Page Blob: on Standard Tier: ~100–200 ms / on Premium Tier: ~30–50 ms Azure Abfsclient already setup by default a lot of Threads for Prefecth Read Ahead, to prefetch 4Mo of data, but it is NOT sufficent, and less efficient that simply implementing correctly what is already in Hadoop API : readVectored(). It also have the drawback of reading tons of useless data (past parquet blocks), that are never used. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] Add map_no_value.parquet [parquet-testing]
alamb commented on PR #63: URL: https://github.com/apache/parquet-testing/pull/63#issuecomment-2495459421 > Probably...I've never touched that behavior because I don't know if it is intentional or not. I vaguely remember the original rationale being it was impossible to decode ranges (e.g. rows 100-200) if you didn't have the offset index, and since it was small it was ok to always write. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org