Kimahriman commented on issue #3261: URL: https://github.com/apache/parquet-java/issues/3261#issuecomment-3581707300
We recently hit this as well (with a slightly different stack trace). I think it's simply a bug in how the CapacityByteArrayOutputStream handles trying to catch this error. The check for overflow is based on `bytesUsed` https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L171-L177: ```java private void addSlab(int minimumSize) { int nextSlabSize; // check for overflow try { Math.addExact(bytesUsed, minimumSize); } catch (ArithmeticException e) { // This is interpreted as a request for a value greater than Integer.MAX_VALUE // We throw OOM because that is what java.io.ByteArrayOutputStream also does throw new OutOfMemoryError("Size of data exceeded Integer.MAX_VALUE (" + e.getMessage() + ")"); } ``` But this error is happening at the end when `bytesAllocated` is updated https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L198C5-L198C76: ```java this.bytesAllocated = Math.addExact(this.bytesAllocated, nextSlabSize); ``` And if you look at the `write` method https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L211-L225 ```java public void write(byte b[], int off, int len) { if ((off < 0) || (off > b.length) || (len < 0) || ((off + len) - b.length > 0)) { throw new IndexOutOfBoundsException(String.format( "Given byte array of size %d, with requested length(%d) and offset(%d)", b.length, len, off)); } if (len > currentSlab.remaining()) { final int length1 = currentSlab.remaining(); currentSlab.put(b, off, length1); final int length2 = len - length1; addSlab(length2); currentSlab.put(b, off + length1, length2); } else { currentSlab.put(b, off, len); } bytesUsed = Math.addExact(bytesUsed, len); } ``` The current slab is filled, a new slab is added, and then the rest of the data goes in the new slab, and `bytesUsed` isn't updated until the end. So when a new slab is added `bytesUsed` isn't actually properly up to date so the OOM isn't caught in certain edge cases. Now why a page is getting this big in the first place I assume is just a weird data problem? We're stilling trying to figure that part out for our issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
