Hi there, Using Flink 1.9.1, trying to write .tgz files with the StreamingFileSink#BulkWriter. It seems like flushing the output stream doesn't flush all the data written. I've verified I can create valid files using the same APIs and data on there own, so thinking it must be something I'm doing wrong with the bulk format. I'm writing to the local filesystem, with the `file://` protocol.
For Tar/ Gzipping, I'm using the Apache Commons Compression library, version 1.20. Here's a runnable example of the issue: import org.apache.commons.compress.archivers.tar.TarArchiveEntry; import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream; import org.apache.commons.compress.compressors.gzip.GzipCompressorOutputStream; import org.apache.flink.api.common.serialization.BulkWriter; import org.apache.flink.core.fs.FSDataOutputStream; import org.apache.flink.core.fs.Path; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; import java.io.FileOutputStream; import java.io.IOException; import java.io.Serializable; import java.nio.charset.StandardCharsets; class Scratch { public static class Record implements Serializable { private static final long serialVersionUID = 1L; String id; public Record() {} public Record(String id) { this.id = id; } public String getId() { return id; } public void setId(String id) { this.id = id; } } public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); TarArchiveOutputStream taos = new TarArchiveOutputStream(new GzipCompressorOutputStream(new FileOutputStream("/home/austin/Downloads/test.tgz"))); TarArchiveEntry fileEntry = new TarArchiveEntry(String.format("%s.txt", "test")); String fullText = "hey\nyou\nwork"; byte[] fullTextData = fullText.getBytes(); fileEntry.setSize(fullTextData.length); taos.putArchiveEntry(fileEntry); taos.write(fullTextData, 0, fullTextData.length); taos.closeArchiveEntry(); taos.flush(); taos.close(); StreamingFileSink<Record> textSink = StreamingFileSink .forBulkFormat(new Path("file:///home/austin/Downloads/text-output"), new BulkWriter.Factory<Record>() { @Override public BulkWriter<Record> create(FSDataOutputStream out) throws IOException { final TarArchiveOutputStream compressedOutputStream = new TarArchiveOutputStream(new GzipCompressorOutputStream(out)); return new BulkWriter<Record>() { @Override public void addElement(Record record) throws IOException { TarArchiveEntry fileEntry = new TarArchiveEntry(String.format("%s.txt", record.id)); byte[] fullTextData = "hey\nyou\nplease\nwork".getBytes(StandardCharsets.UTF_8); fileEntry.setSize(fullTextData.length); compressedOutputStream.putArchiveEntry(fileEntry); compressedOutputStream.write(fullTextData, 0, fullTextData.length); compressedOutputStream.closeArchiveEntry(); } @Override public void flush() throws IOException { compressedOutputStream.flush(); } @Override public void finish() throws IOException { this.flush(); } }; } }) .withBucketCheckInterval(1000) .build(); env .fromElements(new Record("1"), new Record("2")) .addSink(textSink) .name("Streaming File Sink") .uid("streaming-file-sink"); env.execute("streaming file sink test"); } } >From the stat/ hex dumps, you can see that the first bits are there, but are then cut off: ~/Downloads » stat test.tgz File: test.tgz Size: 114 Blocks: 8 IO Block: 4096 regular file Device: 801h/2049d Inode: 30041077 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ austin) Gid: ( 1000/ austin) Access: 2020-02-21 19:30:06.009028283 -0500 Modify: 2020-02-21 19:30:44.509424406 -0500 Change: 2020-02-21 19:30:44.509424406 -0500 Birth: - ~/Downloads » tar -tvf test.tgz -rw-r--r-- 0/0 12 2020-02-21 19:35 test.txt ~/Downloads » hd test.tgz 00000000 1f 8b 08 00 00 00 00 00 00 ff ed cf 31 0e 80 20 |............1.. | 00000010 0c 85 61 66 4f c1 09 cc 2b 14 3c 8f 83 89 89 03 |..afO...+.<.....| 00000020 09 94 a8 b7 77 30 2e ae 8a 2e fd 96 37 f6 af 4c |....w0......7..L| 00000030 45 7a d9 c4 34 04 02 22 b3 c5 e9 be 00 b1 25 1f |Ez..4.."......%.| 00000040 1d 63 f0 81 82 05 91 77 d1 58 b4 8c ba d4 22 63 |.c.....w.X...."c| 00000050 36 78 7c eb fe dc 0b 69 5f 98 a7 bd db 53 ed d6 |6x|....i_....S..| 00000060 94 97 bf 5b 94 52 4a 7d e7 00 4d ce eb e7 00 08 |...[.RJ}..M.....| 00000070 00 00 |..| 00000072 text-output/37 » tar -xzf part-0-0 gzip: stdin: unexpected end of file tar: Child returned status 1 tar: Error is not recoverable: exiting now text-output/37 » stat part-0-0 File: part-0-0 Size: 10 Blocks: 8 IO Block: 4096 regular file Device: 801h/2049d Inode: 4590487 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ austin) Gid: ( 1000/ austin) Access: 2020-02-21 19:33:06.258888702 -0500 Modify: 2020-02-21 19:33:04.466870139 -0500 Change: 2020-02-21 19:33:05.294878716 -0500 Birth: - text-output/37 » hd part-0-0 00000000 1f 8b 08 00 00 00 00 00 00 ff |..........| 0000000a Is there anything simple I'm missing? Best, Austin