Hey everyone, I apologize if this has been asked before, but I was unable to find a similar problem in the archives. I have successfully configured Flume to write to s3n. However, if I turn on gzip compression the files that end up in s3n are malformed gzip files. Their "packed size" is larger than their extracted size.
I have a hypothesis that this is due to the s3n "driver" not implementing isFileClosed(). I imagine that the HDFS is not closing and reopening a compressed stream somewhere under the hood. This seems like it would be a common configuration scenario though so I'm wondering if anyone has some insight. Not sure if it matters, but I'm using Windows Server 2008. Here is a copy of the agent configuration I'm using: # Agent agent.sources = http agent.channels = s3 agent.sinks = s3 # source agent.sources.http.type = http agent.sources.http.bind = localhost agent.sources.http.port = 6162 agent.sources.http.channels = s3 # route events base on event type header agent.sources.http.selector.type = multiplexing agent.sources.http.selector.header = event-type #... agent.sources.http.selector.default = s3 # s3 ########################################################### # channel agent.channels.s3.type = file agent.channels.s3.checkpointDir = D:\\flume-data\\flume-file-channel\\s3\\checkpoint agent.channels.s3.dataDirs = D:\\flume-data\\flume-file-channel\\s3\\data agent.channels.s3.maxFileSize = 10485760 # sink agent.sinks.s3.type = hdfs agent.sinks.s3.channel = s3 agent.sinks.s3.hdfs.path = s3n://XXXXX:XXXX@mybucket /%{event-type}/y=%Y/m=%m/d=%d/h=%H agent.sinks.s3.hdfs.fileType = DataStream agent.sinks.s3.hdfs..writeFormat = Text agent.sinks.s3.hdfs.batchSize = 10000 agent.sinks.s3.hdfs.rollCount = 10000 agent.sinks.s3.hdfs.rollInterval = 300 agent.sinks.s3.hdfs.rollSize = 0 agent.sinks.s3.hdfs.filePrefix = flume.%{host} agent.sinks.s3.hdfs.fileSuffix = .txt agent.sinks.s3.hdfs.timeZone = UTC