How to override jars in java classpath

2015-01-15 Thread Buntu Dev
Hi -- I'm passing the jars that have patches and would like to use the patched jars instead of the ones in the java classpath. When I pass to 'flume-ng agent .. --classpath ..', I still see the old jars ahead of the patched jars. How can I override the default jars? Thanks!

Re: How to override jars in java classpath

2015-01-15 Thread Buntu Dev
arget like mvn > dependency:copy-dependencies to copy the dependencies to > ${FLUME_HOME}/lib/ > > -Mike > > > -- > *From:* Buntu Dev [buntu...@gmail.com] > *Sent:* Thursday, January 15, 2015 12:57 PM > *To:* user@flume.apache.org > *Subject

Flume to handle spam detection and rate limiting

2015-01-23 Thread Buntu Dev
Hi -- I'm ingesting data to HDFS via Flume and wanted to know if there any built-in features to handle spam detection and rate limiting to avoid any possible flooding of data. Please let me know. Thanks!

Real-time events sessionization and more

2015-03-17 Thread Buntu Dev
We got Kafka->Flume->Kite Dataset sink configured to write to Hive backed dataset. One of the main requirements for us is to do some sessionization on the data and do funnel analysis. We are currently handling this relying on Impala/Hive but its quite slow and given that we want the reports to be

De-duping events during ingestion

2015-04-17 Thread Buntu Dev
Are there any known strategies to handle duplicate events during ingestion? I use Flume to ingest apache logs to parse the request using Morphlines and there are some duplicate requests with certain query params differing. I would like to handle these once I parse and split the query params into to

Re: De-duping events during ingestion

2015-04-17 Thread Buntu Dev
gt; That would have to be done outside Flume, perhaps using something like > Spark Streaming, or Storm. > > Thanks, > Hari > > > On Fri, Apr 17, 2015 at 12:15 AM, Buntu Dev wrote: > >> Are there any known strategies to handle duplicate events during >> ingestion? I

Fetching from specific Kafka offset

2015-07-10 Thread Buntu Dev
I'm using the Kafka source and need to replay some events from past 3 or 4 days. I do notice there is a "auto.offset.reset" option but seems to take values like 'largest' or 'smallest'. How do I go about setting the offset to some timestamp or specific offset? Thanks!

Sinks are likely not keeping up with sources, or the buffer size is too tight error

2015-07-11 Thread Buntu Dev
I'm using Memory channel along with Kite dataset sink and keep running into this error: ERROR kafka.KafkaSource: KafkaSource EXCEPTION, {} org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: tracksChannel} at org.apache.flu

Estimating the event loss

2015-07-16 Thread Buntu Dev
I'm using a Memory channel with capacity set to 10. Does this mean when the flume agent restarts, its possible that I loose about 100k events? For any other durable channel say File channel I noticed .tmp files created and written to but when I restart the agent these .tmp files are left as-is

Flume agent with only Source and Channel

2015-09-01 Thread Buntu Dev
I'm planning on implementing a tiered Flume setup with a master Flume agent to use Kafka Source and then transform the events via Morphlines to write to another Kafka Channel without any Sink. The Kafka Channel of the master Flume will be then used as a source for other downstream Flume agents to w

Kafka source with Avro throwing "Could not find schema for event" exception

2015-09-08 Thread Buntu Dev
I got Flume configured to read Avro events from Kafka source and I'm also attaching the schema like this: ~~~ f1.sources.kafka-source.interceptors.attach-f1-schema.type = static f1.sources.kafka-source.interceptors.attach-f1-schema.key = flume.avro.schema.url f1.sources.kafka-source.interceptors.a

Cannot instantiate deserializer exception when using Avro deserializer

2015-09-09 Thread Buntu Dev
I need to read Avro files using spooldir and here is how I've configured the source: f1.sources.src1.type = spooldir f1.sources.src1.spoolDir = /path/to/avro/files f1.sources.src1.deserializer = avro But when I run the flume agent, I keep running into these exceptions: ~~~ org.apache.flume.Flume

Re: I am missing something basic

2015-09-14 Thread Buntu Dev
Don't mean to hijack this thread but I've some issue along the same lines -- does the hdfs.fileType need to be set to DataStream even if the source (kafka in my case instead of spooldir) data is Avro? On Mon, Sep 14, 2015 at 12:35 PM, Robin Jain wrote: > Hi Darshan, > > Define hdfs.fileType para

Avro source and sink

2015-09-15 Thread Buntu Dev
Currently I have a single flume agent that converts apache logs into Avro and writes to HDFS sink. I'm looking for ways to create tiered topology and want to have the Avro records available to other flume agents. I used Kafka channel/sink to write these Avro records but was running into this error

Re: Avro source and sink

2015-09-15 Thread Buntu Dev
source to > a topic that is used by a channel and not by a kafka sink > > Regards, > Gonzalo > > > On Sep 15, 2015 6:42 PM, "Buntu Dev" wrote: > >> Currently I have a single flume agent that converts apache logs into Avro >> and writes to HDFS sink. I'

How to calculate the throughput?

2015-09-23 Thread Buntu Dev
I have enabled JSON reporting and was able to get the metrics going to the /metrics page. But based on the metrics reported, how do I go about generating some sort of throughput summary to benchmark the Source, Channel and Sink? Here is the sample JSON metrics: { "CHANNEL.my-file-channel": {

Determine the backlog in File channel

2015-11-02 Thread Buntu Dev
I got a File channel with HDFS sink. In the case when the sink slows down and event taken from the channel falls behind while the event puts continue at the same pace, how would one go about finding the amount of backlog or time it takes to clear the backlog? Thanks!

Re: Determine the backlog in File channel

2015-11-02 Thread Buntu Dev
com.sun.management.jmxremote.port=12346 >> >> or whatever port you choose. >> >> >> -- >> *From:* Buntu Dev [buntu...@gmail.com] >> *Sent:* Monday, November 02, 2015 4:33 PM >> *To:* user@flume.apache.org >> *Subject

Re: Determine the backlog in File channel

2015-11-03 Thread Buntu Dev
inutes and push the results to a TSDB ( > http://opentsdb.net/) database. TSDB is great for visualizing your data > rates. Depending on your flume configuration, you will get greatly varying > rates. If you are using spinning disks with a file channel you'll want to > m