Thanks for your replies Steve and Chris.

Steve,

I am creating a real-time pipeline, so I am not looking to dump data to
hdfs rite now. Also, since the log sources would be Nginx, Mongo and
application events, it might not be possible to always route events
directly from the source to flume. Therefore, I thought that "tail -f"
strategy used by fluentd, logstash and others might be the only unifying
solution to collect logs.

Chris,

Can you please elaborate on the Source to Kafka part. Do all event sources
have integration with Kafka. Eg. if you need to send the Server Logs
(Apache/Nginx/Mongo etc) to Kafka, what could be the ideal strategy?

Regards,
Ashish

On Thu, Mar 31, 2016 at 5:16 PM, Chris Fregly <ch...@fregly.com> wrote:

> oh, and I forgot to mention Kafka Streams which has been heavily talked
> about the last few days at Strata here in San Jose.
>
> Streams can simplify a lot of this architecture by perform some
> light-to-medium-complex transformations in Kafka itself.
>
> i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams,
> so I can try this out myself - and hopefully remove a lot of extra plumbing.
>
> On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <ch...@fregly.com> wrote:
>
>> this is a very common pattern, yes.
>>
>> note that in Netflix's case, they're currently pushing all of their logs
>> to a Fronting Kafka + Samza Router which can route to S3 (or HDFS),
>> ElasticSearch, and/or another Kafka Topic for further consumption by
>> internal apps using other technologies like Spark Streaming (instead of
>> Samza).
>>
>> this Fronting Kafka + Samza Router also helps to differentiate between
>> high-priority events (Errors or High Latencies) and normal-priority events
>> (normal User Play or Stop events).
>>
>> here's a recent presentation i did which details this configuration
>> starting at slide 104:
>> http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations
>> .
>>
>> btw, Confluent's distribution of Kafka does have a direct Http/REST API
>> which is not recommended for production use, but has worked well for me in
>> the past.
>>
>> these are some additional options to think about, anyway.
>>
>>
>> On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>>
>>> On 31 Mar 2016, at 09:37, ashish rawat <dceash...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I have been evaluating Spark for analysing Application and Server Logs.
>>> I believe there are some downsides of doing this:
>>>
>>> 1. No direct mechanism of collecting log, so need to introduce other
>>> tools like Flume into the pipeline.
>>>
>>>
>>> you need something to collect logs no matter what you run. Flume isn't
>>> so bad; if you bring it up on the same host as the app then you can even
>>> collect logs while the network is playing up.
>>>
>>> Or you can just copy log4j files to HDFS and process them later
>>>
>>> 2. Need to write lots of code for parsing different patterns from logs,
>>> while some of the log analysis tools like logstash or loggly provide it out
>>> of the box
>>>
>>>
>>>
>>> Log parsing is essentially an ETL problem, especially if you don't try
>>> to lock down the log event format.
>>>
>>> You can also configure Log4J to save stuff in an easy-to-parse format
>>> and/or forward directly to your application.
>>>
>>> There's a log4j to flume connector to do that for you,
>>>
>>>
>>> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html
>>>
>>> or you can output in, say, JSON (
>>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java
>>>  )
>>>
>>> I'd go with flume unless you had a need to save the logs locally and
>>> copy them to HDFS laster.
>>>
>>>
>>>
>>> On the benefits side, I believe Spark might be more performant (although
>>> I am yet to benchmark it) and being a generic processing engine, might work
>>> with complex use cases where the out of the box functionality of log
>>> analysis tools is not sufficient (although I don't have any such use case
>>> right now).
>>>
>>> One option I was considering was to use logstash for collection and
>>> basic processing and then sink the processed logs to both elastic search
>>> and kafka. So that Spark Streaming can pick data from Kafka for the complex
>>> use cases, while logstash filters can be used for the simpler use cases.
>>>
>>> I was wondering if someone has already done this evaluation and could
>>> provide me some pointers on how/if to create this pipeline with Spark.
>>>
>>> Regards,
>>> Ashish
>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>

Reply via email to