Re: Many files in file channel data directory

Hari Shreedharan Tue, 13 Jan 2015 10:08:17 -0800

This issue is that your max file size is really small.




This parameter:





flume_test.channels.ch1.maxFileSize = 10000000




sets the size of each file in bytes, so you want it to be much bigger. Why not 
simply leave it at default which will keep at around 1.6G? These are not your 
HDFS files’ sizes, these are the file sizes for internal file sizes of the file 
channel. 






Thanks, Hari

On Tue, Jan 13, 2015 at 7:32 AM, Needham, Guy
<guy.need...@virginmedia.co.uk> wrote:

> I'm running Flume 1.5.0 with this configuration:
> flume_test.sources = sr1
> flume_test.channels = ch1
> flume_test.sinks = sk1
> #avro source
> flume_test.sources.sr1.type = avro
> flume_test.sources.sr1.channels = ch1
> flume_test.sources.sr1.bind = 10.92.211.22
> flume_test.sources.sr1.port = 55000
> flume_test.sources.sr1.ssl = true
> flume_test.sources.sr1.keystore = 
> /nas/used_by_hadoop/hadoop-kn-p2/rdd/hadoop_keystore.jks
> flume_test.sources.sr1.keystore-password = *****
> flume_test.sources.sr1.compression-type = gzip
> #custom interceptor
> flume_test.sources.sr1.interceptors = i1
> flume_test.sources.sr1.interceptors.i1.type = 
> com.vm.rdd.TimeBodyInterceptor$Builder
> #memory channel
> flume_test.channels.ch1.type = file
> flume_test.channels.ch1.checkpointDir = 
> /hadoop/user/flume/channels/flumeTest/checkpoint
> flume_test.channels.ch1.dataDirs = /hadoop/user/flume/channels/flumeTest/data
> flume_test.channels.ch1.capacity = 100000000
> flume_test.channels.ch1.transactionCapacity = 10000
> flume_test.channels.ch1.maxFileSize = 10000000
> #HDFS channel
> flume_test.sinks.sk1.channel = ch1
> flume_test.sinks.sk1.type = hdfs
> #dynamic path
> flume_test.sinks.sk1.hdfs.path = hdfs:///landing/data/flumeTest/%Y-%m-%d
> flume_test.sinks.sk1.hdfs.inUsePrefix = _
> flume_test.sinks.sk1.hdfs.codeC = gzip
> #write at this block size when reach it
> flume_test.sinks.sk1.hdfs.rollSize = 2560000000
> # roll file every 2 minutes if not filled a block
> flume_test.sinks.sk1.hdfs.rollInterval = 120
> # if file has been idle for 15s, close file
> flume_test.sinks.sk1.hdfs.idleTimeout = 15
> flume_test.sinks.sk1.hdfs.rollCount = 0
> flume_test.sinks.sk1.hdfs.batchSize = 100
> When I look at the data dir, there are many files:
> -rw-r--r-- 1 rdd rdd    0 Jan 13 12:06 in_use.lock
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-1
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-2
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-3
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-4
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-5
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-6
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-7
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-8
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-9
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-10
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-11
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-12
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-13.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-1.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-2.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-3.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-4.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-5.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-6.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-7.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-8.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-9.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-10.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-11.meta
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-12.meta
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-13
> -rw-r--r-- 1 rdd rdd 1.0M Jan 13 12:09 log-14
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:09 log-14.meta
> -rw-r--r-- 1 rdd rdd    0 Jan 13 12:15 log-15
> -rw-r--r-- 1 rdd rdd   47 Jan 13 12:15 log-15.meta
> This list can grow up to hundreds of log files which are all accessed around 
> the same time. When looking in the log file, it appears that the channels are 
> duplicates of one another, as the agent is writing to each channel at the 
> same time, for example:
> 2015-01-13 12:09:52,451 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.Log.writeCheckpoint(Log.java:1005)] Updated 
> checkpoint for file: /hadoop/user/flume/channels/flumeTest/data/log-14 
> position: 120431 logWriteOrderID: 1421150977639
> 2015-01-13 12:09:52,451 (Log-BackgroundWorker-ch1) [DEBUG - 
> org.apache.flume.channel.file.Log.removeOldLogs(Log.java:1067)] Files 
> currently in use: [14]
> 2015-01-13 12:09:52,451 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-1
> 2015-01-13 12:09:52,456 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-2
> 2015-01-13 12:09:52,461 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-3
> 2015-01-13 12:09:52,467 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-4
> 2015-01-13 12:09:52,472 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-5
> 2015-01-13 12:09:52,477 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-6
> 2015-01-13 12:09:52,482 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-7
> 2015-01-13 12:09:52,487 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-8
> 2015-01-13 12:09:52,492 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-9
> 2015-01-13 12:09:52,497 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-10
> 2015-01-13 12:09:52,503 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-11
> 2015-01-13 12:09:52,508 (Log-BackgroundWorker-ch1) [INFO - 
> org.apache.flume.channel.file.LogFile$RandomReader.close(LogFile.java:504)] 
> Closing RandomReader /hadoop/user/flume/channels/flumeTest/data/log-12
> Does anyone know why there are so many active files at one time? Is this 
> expected behaviour?
> Regards,
> Guy Needham | Data Discovery
> Virgin Media   | Technology and Transformation | Data
> Bartley Wood Business Park, Hook, Hampshire RG27 9UP
> D 01256 75 3362
> I welcome VSRE emails. Learn more at http://vsre.info/
> --------------------------------------------------------------------
> Save Paper - Do you really need to print this e-mail?
> Visit www.virginmedia.com for more information, and more fun.
> This email and any attachments are or may be confidential and legally 
> privileged
> and are sent solely for the attention of the addressee(s). If you have 
> received this
> email in error, please delete it from your system: its use, disclosure or 
> copying is
> unauthorised. Statements and opinions expressed in this email may not 
> represent
> those of Virgin Media. Any representations or commitments in this email are
> subject to contract. 
> Registered office: Media House, Bartley Wood Business Park, Hook, Hampshire, 
> RG27 9UP
> Registered in England and Wales with number 2591237

Re: Many files in file channel data directory

Reply via email to