[ https://issues.apache.org/jira/browse/FLINK-35536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Juliusz Nadberezny updated FLINK-35536: --------------------------------------- Description: Compaction on FileSystem sink on S3 uses multipart upload process. When compaction is turned on, everything is working as expected and sink produces correct files. The problem is when you disable compaction for the sink that previously had it enabled. In this case files that where being kept by multipart upload and now are "released" with CompleteMultipartUpload will be broken. Broken Avro files seem to have Avro schema duplicated at the beginning of the file. Attached please find: 1. Implementation of RecordWiseFileCompactor.Reader.Factory that we are using. 2. FileSink definition Steps to reproduce: 1. Deploy job with FileSystem sink with compaction enabled writing to S3/MinIO. 2. Wait for job to produce some output. 3. Redeploy job with compaction disabled. 4. Wait for multipart upload complete and verify released files. was: Compaction on FileSystem sink on S3 uses multipart upload process. When compaction is turned on, everything is working as expected and sink produces correct files. The problem is when you disable compaction for the sink that previously had it enabled. In this case files that where being kept by multipart upload and then are "released" with CompleteMultipartUpload will be broken. Broken Avro files seem to have Avro schema duplicated at the beginning of the file. Attached please find: 1. Implementation of RecordWiseFileCompactor.Reader.Factory that we are using. 2. FileSink definition Steps to reproduce: 1. Deploy job with FileSystem sink with compaction enabled writing to S3/MinIO. 2. Wait for job to produce some output. 3. Redeploy job with compaction disabled. 4. Wait for multipart upload complete and verify released files. > FileSystem sink on S3 produces invalid Avros when compaction is turned off > -------------------------------------------------------------------------- > > Key: FLINK-35536 > URL: https://issues.apache.org/jira/browse/FLINK-35536 > Project: Flink > Issue Type: Bug > Components: Connectors / FileSystem > Affects Versions: 1.19.0 > Reporter: Juliusz Nadberezny > Priority: Major > Attachments: FileSink.java, > RecordWiseFileCompactorSpecificAvroReaderFactory.java > > > Compaction on FileSystem sink on S3 uses multipart upload process. > When compaction is turned on, everything is working as expected and sink > produces correct files. > The problem is when you disable compaction for the sink that previously had > it enabled. In this case files that where being kept by multipart upload and > now are "released" with CompleteMultipartUpload will be broken. > Broken Avro files seem to have Avro schema duplicated at the beginning of the > file. > > Attached please find: > 1. Implementation of RecordWiseFileCompactor.Reader.Factory that we are using. > 2. FileSink definition > > Steps to reproduce: > 1. Deploy job with FileSystem sink with compaction enabled writing to > S3/MinIO. > 2. Wait for job to produce some output. > 3. Redeploy job with compaction disabled. > 4. Wait for multipart upload complete and verify released files. -- This message was sent by Atlassian Jira (v8.20.10#820010)