Re: [PR] PARQUET-2416: Use 'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck closed pull request #1244: PARQUET-2416: Use 
'mapreduce.outputcommitter.factory.class' in ParquetOutpuFormat
URL: https://github.com/apache/parquet-java/pull/1244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x read() for footer length + 1 x read(byte[4]) for magic marker ) [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck opened a new issue, #3074:
URL: https://github.com/apache/parquet-java/issues/3074

   ### Describe the enhancement requested
   
   This is a minor performance improvement, but worth when reading many files.
   read footer using 1 call readFully(byte[8]) instead of 5 calls ( 4 x read() 
for footer length + 1 x read(byte[4]) for magic marker )
   
   in summary the patch is for file ParquetFileReader.java,  method 
"readFooter()" :
   
   > -byte[] magic = new byte[MAGIC.length];
   > -long fileMetadataLengthIndex = fileLen - magic.length - 
FOOTER_LENGTH_SIZE;
   > +long fileMetadataLengthIndex = fileLen - MAGIC.length - 
FOOTER_LENGTH_SIZE;
   >  LOG.debug("reading footer index at {}", fileMetadataLengthIndex);
   >  f.seek(fileMetadataLengthIndex);
   > -int fileMetadataLength = readIntLittleEndian(f);
   > -f.readFully(magic);
   > +byte[] magicAndLengthBytes = new byte[FOOTER_LENGTH_SIZE + 
MAGIC.length];
   > +f.readFully(magicAndLengthBytes);
   > +int fileMetadataLength = readIntLittleEndian(magicAndLengthBytes, 0);
   > 
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[PR] GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck opened a new pull request, #3075:
URL: https://github.com/apache/parquet-java/pull/3075

   GH-3074: read footer using 1 call readFully(byte[8]) instead of 5 calls
   
   ### Rationale for this change
   performance
   
   ### What changes are included in this PR?
   only method FilePargetReader.readFooter()
   
   ### Are these changes tested?
   
   
   ### Are there any user-facing changes?
   no
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck commented on issue #3076:
URL: https://github.com/apache/parquet-java/issues/3076#issuecomment-2495541296

   see also related issue:  
[https://github.com/apache/parquet-java/issues/3077](https://github.com/apache/parquet-java/issues/3077)
 : AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that 
override readVectored() much more efficiently for small reads


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-463: Add more types - time, nano timestamps, UUID to Variant spec [parquet-format]

2024-11-23 Thread via GitHub


emkornfield commented on code in PR #464:
URL: https://github.com/apache/parquet-format/pull/464#discussion_r1855229728


##
VariantEncoding.md:
##
@@ -386,11 +386,15 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | Exact Numeric| decimal8| `9` | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table) |
 | Exact Numeric| decimal16   | `10`| 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table) |
 | Date | date| `11`| DATE  
  | 4 byte little-endian
|
-| Timestamp| timestamp   | `12`| 
TIMESTAMP(true, MICROS) | 8-byte little-endian  
  |
-| TimestampNTZ | timestamp without time zone | `13`| 
TIMESTAMP(false, MICROS)| 8-byte little-endian  
  |
+| Timestamp| timestamp with time zone| `12`| 
TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian  

  |

Review Comment:
   I wonder if we should just eliminate the first column (and annotate logical 
type with additional information like exact numeric).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] should not use seek() for skipping very small column chunks. better to read and ignore data. [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck opened a new issue, #3076:
URL: https://github.com/apache/parquet-java/issues/3076

   ### Describe the enhancement requested
   
   When reading some column chunks but not all, parquet is building a list of 
"ConsecutivePartList", then trying to call the Hadoop api for vectorized reader 
of FSDataInputStream#readVectored(List ...)
   
   Unfortunatly, many implementations of "FSDataInputStream" do not override 
the readVectored() method, which trigger many distinct calls to read. 
   
   For example on hadoop-azure, the Azure Datalake Storage is much slower at 
establishing a new Https connection (using infamous calls HttpURLConnection for 
jdk 1.0, then doing TLS hand-shake), that to get only few more megas of data on 
an existing socket !!
   
   The case with small wholes to avoid reading is very frequent when having 
columns in parquet files that are not read, and are highly compressed because 
of RLE encoding. Typically, a very sparse column with only few values, or even 
always null within a page. Such a column could be encoded in only few hundred 
of bytes by parquet, so it is NOT a problem of reading 100 bytes more.
   
   
   Parquet should at least honor the following method from hadoop class 
FileSystem, that says that a seek of less than 4096 bytes is NOT reasonable.
   ```
 /**
  * What is the smallest reasonable seek?
  * @return the minimum number of bytes
  */
 default int minSeekForVectorReads() {
   return 4 * 1024;
 }
   ```
   
   The logic for building this List for a list of column 
chunks is here:
   org.apache.parquet.hadoop.ParquetFileReader#internalReadRowGroup
   ```
 private ColumnChunkPageReadStore internalReadRowGroup(int blockIndex) 
throws IOException {
   ...
   for (ColumnChunkMetaData mc : block.getColumns()) {
   ...
   // first part or not consecutive => new list
   if (currentParts == null || currentParts.endPos() != startingPos) {  
// <= SHOULD honor minSeekForVectorReads()
 currentParts = new ConsecutivePartList(startingPos);
 allParts.add(currentParts);
   }
   currentParts.addChunk(new ChunkDescriptor(columnDescriptor, mc, 
startingPos, mc.getTotalSize()));
 }
   }
   // actually read all the chunks
   ChunkListBuilder builder = new ChunkListBuilder(block.getRowCount());
   readAllPartsVectoredOrNormal(allParts, builder);
   rowGroup.setReleaser(builder.releaser);
   for (Chunk chunk : builder.build()) {
 readChunkPages(chunk, block, rowGroup);
   }
   
   return rowGroup;
   }
   ```
   
   maybe a possible implementation could be to add fictive 
"ConsecutivePartList" that are to be ignored while receiving the data, but that 
would avoid having some wholes in the ranges to read.
   
   
   
   
   
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck closed issue #3077: AzureBlobFileSystem.open() should return a 
sub-class fsDataInputStream that override readVectored() much more efficiently 
for small reads
URL: https://github.com/apache/parquet-java/issues/3077


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck commented on issue #3077:
URL: https://github.com/apache/parquet-java/issues/3077#issuecomment-2495543888

   Sorry, I misclick in github project parquet instead of hadoop.
   I have recreated 
[(https://issues.apache.org/jira/browse/HADOOP-19345](HADOOP-19345), as I did 
not saw you comment. 
   Will mark it as duplicates in Jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-465: Clarify backward-compatibility rules on LIST type [parquet-format]

2024-11-23 Thread via GitHub


emkornfield commented on code in PR #466:
URL: https://github.com/apache/parquet-format/pull/466#discussion_r1855230005


##
LogicalTypes.md:
##
@@ -670,6 +681,13 @@ optional group array_of_arrays (LIST) {
 
  Backward-compatibility rules
 
+Modern writers should always produce the 3-level LIST structure shown above.

Review Comment:
   ```suggestion
   New writer implementations should always produce the 3-level LIST structure 
shown above.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads [parquet-java]

2024-11-23 Thread via GitHub


Arnaud-Nauwynck opened a new issue, #3077:
URL: https://github.com/apache/parquet-java/issues/3077

   ### Describe the enhancement requested
   
   In hadoop-azure, there are huge performance problems when reading file in a 
too fragmented way: by reading many small file fragments even with the 
readVectored() Hadoop API, resulting in distinct Https Requests (=TCP-IP 
connection established + TLS handshake + requests).
   Internally, at lowest level, haddop azure is using class HttpURLConnection 
from jdk 1.0,  and the ReadAhead Threads do not sufficiently solve all problems.
   The hadoop azure implementation of "readVectored()" should make a compromise 
between reading extra ignored data wholes, and establishing too many https 
connections.
   
   Currently, the class AzureBlobFileSystem#open() does return a default 
inneficient imlpementation of readVectored:
   ```
 private FSDataInputStream open(final Path path,
 final Optional parameters) throws IOException {
   ...
 InputStream inputStream = 
getAbfsStore().openFileForRead(qualifiedPath, parameters, statistics, 
tracingContext);
 return new FSDataInputStream(inputStream);  // <== FSDataInputStream 
is not efficiently overriding readVectored() !
}
   ```
   
   see default implementation of FSDataInpustStream.readVectored:
   ```
   public void readVectored(List ranges, 
IntFunction allocate) throws IOException {
   ((PositionedReadable)this.in).readVectored(ranges, allocate);
   }
   ```
   
   it calls the underlying method from class AbfsInputStream, which is not 
overriden:
   ```
   default void readVectored(List ranges, 
IntFunction allocate) throws IOException {
   VectoredReadUtils.readVectored(this, ranges, allocate);
   }
   ```
   
   AbfsInputStream should override this method, and accept internally to do 
less Https calls, with merged range, and ignore some returned data (wholes in 
the range). 
   
   It is like honouring the parameter of hadoop FSDataInputStream (implements 
PositionedReadable)
   ```
 /**
  * What is the smallest reasonable seek?
  * @return the minimum number of bytes
  */
 default int minSeekForVectorReads() {
   return 4 * 1024;
 }
   ```
   Even this 4096 value is very conservative, and should be redined by 
AbfsFileSystem to be 4Mo or even 8mo.
   
   ask chat gpt: "on Azure Storage, what is the speed of getting 8Mo of a page 
block, compared to the time to establish a https tls handshake ?"
   The response (untrusted from chat gpt..) says :
   HTTPS/TLS Handshake: ~100–300 ms  ... is generally slower than  downloading 
8 MB from Page Blob:  on Standard Tier: ~100–200 ms / on Premium Tier: ~30–50 ms
   
   Azure Abfsclient already setup by default a lot of Threads for Prefecth Read 
Ahead, to prefetch 4Mo of data,  but it is NOT sufficent, and less efficient 
that simply implementing correctly what is already in Hadoop API : 
readVectored(). It also have the drawback of reading tons of useless data (past 
parquet blocks), that are never used.
   
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] Add map_no_value.parquet [parquet-testing]

2024-11-23 Thread via GitHub


alamb commented on PR #63:
URL: https://github.com/apache/parquet-testing/pull/63#issuecomment-2495459421

   > Probably...I've never touched that behavior because I don't know if it is 
intentional or not.
   
   I vaguely remember the original rationale being it was impossible to decode 
ranges (e.g. rows 100-200) if you didn't have the offset index, and since it 
was small it was ok to always write.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org