Kimahriman commented on issue #3261:
URL: https://github.com/apache/parquet-java/issues/3261#issuecomment-3581707300

   We recently hit this as well (with a slightly different stack trace). I 
think it's simply a bug in how the CapacityByteArrayOutputStream handles trying 
to catch this error.
   
   The check for overflow is based on `bytesUsed` 
https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L171-L177:
   ```java
     private void addSlab(int minimumSize) {
       int nextSlabSize;
   
       // check for overflow
       try {
         Math.addExact(bytesUsed, minimumSize);
       } catch (ArithmeticException e) {
         // This is interpreted as a request for a value greater than 
Integer.MAX_VALUE
         // We throw OOM because that is what java.io.ByteArrayOutputStream 
also does
         throw new OutOfMemoryError("Size of data exceeded Integer.MAX_VALUE (" 
+ e.getMessage() + ")");
       }
   ```
   
   But this error is happening at the end when `bytesAllocated` is updated 
https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L198C5-L198C76:
   ```java
     this.bytesAllocated = Math.addExact(this.bytesAllocated, nextSlabSize);
   ```
   
   And if you look at the `write` method 
https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L211-L225
   
   ```java
     public void write(byte b[], int off, int len) {
       if ((off < 0) || (off > b.length) || (len < 0) || ((off + len) - 
b.length > 0)) {
         throw new IndexOutOfBoundsException(String.format(
             "Given byte array of size %d, with requested length(%d) and 
offset(%d)", b.length, len, off));
       }
       if (len > currentSlab.remaining()) {
         final int length1 = currentSlab.remaining();
         currentSlab.put(b, off, length1);
         final int length2 = len - length1;
         addSlab(length2);
         currentSlab.put(b, off + length1, length2);
       } else {
         currentSlab.put(b, off, len);
       }
       bytesUsed = Math.addExact(bytesUsed, len);
     }
   ```
   
   The current slab is filled, a new slab is added, and then the rest of the 
data goes in the new slab, and `bytesUsed` isn't updated until the end. So when 
a new slab is added `bytesUsed` isn't actually properly up to date so the OOM 
isn't caught in certain edge cases.
   
   Now why a page is getting this big in the first place I assume is just a 
weird data problem? We're stilling trying to figure that part out for our issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to