Jack, thanks for the follow-up.
The issue was with the MPUs support in our internal S3 backend. It currently
doesn’t support content-type=‘binary/octet-stream’. It worked when we changed
it to the following:
CreateMultipartUploadRequest.builder().bucket(bucket).key(key).contentType("application/octet-stream");
We are changing our backend to support ‘binary/octet-stream’ and that would fix
the issue without any code change to Iceberg.
Thanks,
Mayur
From: Jack Ye <[email protected]>
Sent: Wednesday, October 6, 2021 1:34 AM
To: Iceberg Dev List <[email protected]>
Subject: Re: Error when writing large number of rows with S3FileIO
Hi Mayur, sorry I did not follow up on this, were you able to fix the issue
with the AWS SDK upgrade?
-Jack Ye
On Thu, Sep 23, 2021 at 1:13 PM Mayur Srivastava
<[email protected]<mailto:[email protected]>> wrote:
I’ll try to upgrade the version and retry.
Thanks,
Mayur
From: Jack Ye <[email protected]<mailto:[email protected]>>
Sent: Thursday, September 23, 2021 2:35 PM
To: Iceberg Dev List <[email protected]<mailto:[email protected]>>
Subject: Re: Error when writing large number of rows with S3FileIO
Thanks, while I am looking into this, this seems to be a very old version, is
there any reason to use that version specifically? Have you tried a newer
version? I know there have been quite a few updates to the S3 package related
to uploading since then, maybe upgrading can solve the problem.
-Jack
On Thu, Sep 23, 2021 at 11:02 AM Mayur Srivastava
<[email protected]<mailto:[email protected]>> wrote:
No problem Jack.
I’m using https://mvnrepository.com/artifact/software.amazon.awssdk/s3/2.10.53
Thanks,
Mayur
From: Jack Ye <[email protected]<mailto:[email protected]>>
Sent: Thursday, September 23, 2021 1:24 PM
To: Iceberg Dev List <[email protected]<mailto:[email protected]>>
Subject: Re: Error when writing large number of rows with S3FileIO
Hi Mayur,
Thanks for reporting this issue, could you report what version of AWS SDK V2
you are using?
Best,
Jack Ye
On Thu, Sep 23, 2021 at 8:39 AM Mayur Srivastava
<[email protected]<mailto:[email protected]>> wrote:
Hi,
I've an Iceberg table partitioned by a single "time" (monthly partitioned)
column that has 400+ columns and >100k rows. I'm using parquet files and
PartitionedWriter<Record> + S3FileIO to write the data. When I write <~50k
rows, the writer works. But it fails with the exception below if I write more
than ~50k rows. The writer, however, works for the full >100k rows if I use
HadoopFileIO.
Has anyone seen this error before and know a way to fix this?
The writer code is as follows:
AppendFiles append = table.newAppend();
for (GenericRecord record : records) {
writer.write(record);
}
Arrays.stream(writer.complete().dataFiles()).forEach(append::appendFile);
append.commit();
Thanks,
Mayur
software.amazon.awssdk.services.s3.model.S3Exception: The specified media type
is unsupported. Content type binary/octet-stream is not legal. (Service: S3,
Status Code: 415, Request ID: xxxxxx)
at
software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:158)
at
software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
at
software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:86)
at
software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:44)
at
software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:94)
at
software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$4(BaseClientHandler.java:215)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:30)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:74)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:43)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.doExecute(RetryableStage.java:114)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage$RetryExecutor.execute(RetryableStage.java:87)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:63)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:43)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:57)
at
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:37)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:81)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:61)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at
software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:198)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:122)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:148)
at
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:102)
at
software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at
software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:55)
at
software.amazon.awssdk.services.s3.DefaultS3Client.createMultipartUpload(DefaultS3Client.java:1410)
at
org.apache.iceberg.aws.s3.S3OutputStream.initializeMultiPartUpload(S3OutputStream.java:209)
at
org.apache.iceberg.aws.s3.S3OutputStream.write(S3OutputStream.java:168)
at java.io.OutputStream.write(OutputStream.java:122)
at
org.apache.parquet.io.DelegatingPositionOutputStream.write(DelegatingPositionOutputStream.java:56)
at
org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeColumnChunk(ParquetFileWriter.java:620)
at
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:241)
at
org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:319)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.iceberg.common.DynMethods$UnboundMethod.invokeChecked(DynMethods.java:65)
at
org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:77)
at
org.apache.iceberg.common.DynMethods$BoundMethod.invoke(DynMethods.java:180)
at
org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:176)
at
org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:211)
at org.apache.iceberg.io.DataWriter.close(DataWriter.java:71)
at
org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.closeCurrent(BaseTaskWriter.java:282)
at
org.apache.iceberg.io.BaseTaskWriter$BaseRollingWriter.close(BaseTaskWriter.java:298)
at
org.apache.iceberg.io.PartitionedWriter.close(PartitionedWriter.java:82)
at
org.apache.iceberg.io.BaseTaskWriter.complete(BaseTaskWriter.java:83)