Thank you for your answer, I don't know if I typed the question
correctly. But your nswer helps me.
I'm going to make the question again for knowing if you understood me.
I have this topology:
DataSource1, .... , DataSourceN --> Kafka --> SparkStreaming --> HDFS
Kafka -->
HDFS (raw data)
DataSource1, .... , DataSourceN --> Flume --> SparkStreaming --> HDFS
Flume -->
HDFS (raw data)
All data are going to be processed and going to HDFS as raw and
processed data. I don't know if it makes sense to use Kafka in this
case if data are just going to HDFS. I guess that before this
FlumeSpark Sink has more sense to feed SparkStream with a real-time
flow of data.. It doesn't look too much sense to have SparkStreaming
and get the data from HDFS.
2014-11-19 22:55 GMT+01:00 Guillermo Ortiz <[email protected]>:
> Thank you for your answer, I don't know if I typed the question
> correctly. But your nswer helps me.
>
> I'm going to make the question again for knowing if you understood me.
>
> I have this topology:
>
> DataSource1, .... , DataSourceN --> Kafka --> SparkStreaming --> HDFS
>
> DataSource1, .... , DataSourceN --> Flume --> SparkStreaming --> HDFS
>
> All data are going to be pro
>
>
> 2014-11-19 21:50 GMT+01:00 Hari Shreedharan <[email protected]>:
>> Btw, if you want to write to Spark Streaming from Flume -- there is a sink
>> (it is a part of Spark, not Flume). See Approach 2 here:
>> http://spark.apache.org/docs/latest/streaming-flume-integration.html
>>
>>
>>
>> On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan
>> <[email protected]> wrote:
>>>
>>> As of now, you can feed Spark Streaming from both kafka and flume.
>>> Currently though there is no API to write data back to either of the two
>>> directly.
>>>
>>> I sent a PR which should eventually add something like this:
>>> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
>>> that would allow Spark Streaming to write back to Kafka. This will likely be
>>> reviewed and committed after 1.2.
>>>
>>> I would consider writing something similar to push data to Flume as well,
>>> if there is a sufficient use-case for it. I have seen people talk about
>>> writing back to kafka quite a bit - hence the above patch.
>>>
>>> Which one is better is upto your use-case and existing infrastructure and
>>> preference. Both would work as is, but writing back to Flume would usually
>>> be if you want to write to HDFS/HBase/Solr etc -- which you could write back
>>> directly from Spark Streaming (of course, there are benefits of writing back
>>> using Flume like the additional buffering etc Flume gives), but it is still
>>> possible to do so from Spark Streaming itself.
>>>
>>> But for Kafka, the usual use-case is a variety of custom applications
>>> reading the same data -- for which it makes a whole lot of sense to write
>>> back to Kafka. An example is to sanitize incoming data in Spark Streaming
>>> (from Flume or Kafka or something else) and make it available for a variety
>>> of apps via Kafka.
>>>
>>> Hope this helps!
>>>
>>> Hari
>>>
>>>
>>> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz <[email protected]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm starting with Spark and I just trying to understand if I want to
>>>> use Spark Streaming, should I use to feed it Flume or Kafka? I think
>>>> there's not a official Sink for Flume to Spark Streaming and it seems
>>>> that Kafka it fits better since gives you readibility.
>>>>
>>>> Could someone give a good scenario for each alternative? When would it
>>>> make sense to use Kafka and when Flume for Spark Streaming?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]