Re: Spark Streaming with Kafka DirectStream

Cyril Scetbon Wed, 17 Feb 2016 10:57:33 -0800

I don't think we can print an integer value in a spark streaming process As 
opposed to a spark job. I think I can print the content of an rdd but not debug 
messages. Am I wrong ?


Cyril Scetbon

> On Feb 17, 2016, at 12:51 AM, ayan guha <[email protected]> wrote:
> 
> Hi
> 
> You can always use RDD properties, which already has partition information.
> 
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
>  
> 
>> On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <[email protected]> wrote:
>> Your understanding is the right one (having re-read the documentation). 
>> Still wondering how I can verify that 5 partitions have been created. My job 
>> is reading from a topic in Kafka that has 5 partitions and sends the data to 
>> E/S. I can see that when there is one task to read from Kafka there are 5 
>> tasks writing to E/S. So I'm supposing that the task reading from Kafka does 
>> it in // using 5 partitions and that's why there are then 5 tasks to write 
>> to E/S. But I'm supposing ...
>> 
>>> On Feb 16, 2016, at 21:12, ayan guha <[email protected]> wrote:
>>> 
>>> I have a slightly different understanding. 
>>> 
>>> Direct stream generates 1 RDD per batch, however, number of partitions in 
>>> that RDD = number of partitions in kafka topic.
>>> 
>>>> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <[email protected]> 
>>>> wrote:
>>>> Hi guys,
>>>> 
>>>> I'm making some tests with Spark and Kafka using a Python script. I use 
>>>> the second method that doesn't need any receiver (Direct Approach). It 
>>>> should adapt the number of RDDs to the number of partitions in the topic. 
>>>> I'm trying to verify it. What's the easiest way to verify it ? I also 
>>>> tried to co-locate Yarn, Spark and Kafka to check if RDDs are created 
>>>> depending on the leaders of partitions in a topic, and they are not. Can 
>>>> you confirm that RDDs are not created depending on the location of 
>>>> partitions and that co-locating Kafka with Spark is not a must-have or 
>>>> that Spark does not take advantage of it ?
>>>> 
>>>> As the parallelism is simplified (by creating as many RDDs as there are 
>>>> partitions) I suppose that the biggest part of the tuning is playing with 
>>>> KafKa partitions (not talking about network configuration or management of 
>>>> Spark resources) ?
>>>> 
>>>> Thank you
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Re: Spark Streaming with Kafka DirectStream

Reply via email to