I don't think we can print an integer value in a spark streaming process As opposed to a spark job. I think I can print the content of an rdd but not debug messages. Am I wrong ?
Cyril Scetbon > On Feb 17, 2016, at 12:51 AM, ayan guha <[email protected]> wrote: > > Hi > > You can always use RDD properties, which already has partition information. > > https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html > > >> On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <[email protected]> wrote: >> Your understanding is the right one (having re-read the documentation). >> Still wondering how I can verify that 5 partitions have been created. My job >> is reading from a topic in Kafka that has 5 partitions and sends the data to >> E/S. I can see that when there is one task to read from Kafka there are 5 >> tasks writing to E/S. So I'm supposing that the task reading from Kafka does >> it in // using 5 partitions and that's why there are then 5 tasks to write >> to E/S. But I'm supposing ... >> >>> On Feb 16, 2016, at 21:12, ayan guha <[email protected]> wrote: >>> >>> I have a slightly different understanding. >>> >>> Direct stream generates 1 RDD per batch, however, number of partitions in >>> that RDD = number of partitions in kafka topic. >>> >>>> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <[email protected]> >>>> wrote: >>>> Hi guys, >>>> >>>> I'm making some tests with Spark and Kafka using a Python script. I use >>>> the second method that doesn't need any receiver (Direct Approach). It >>>> should adapt the number of RDDs to the number of partitions in the topic. >>>> I'm trying to verify it. What's the easiest way to verify it ? I also >>>> tried to co-locate Yarn, Spark and Kafka to check if RDDs are created >>>> depending on the leaders of partitions in a topic, and they are not. Can >>>> you confirm that RDDs are not created depending on the location of >>>> partitions and that co-locating Kafka with Spark is not a must-have or >>>> that Spark does not take advantage of it ? >>>> >>>> As the parallelism is simplified (by creating as many RDDs as there are >>>> partitions) I suppose that the biggest part of the tuning is playing with >>>> KafKa partitions (not talking about network configuration or management of >>>> Spark resources) ? >>>> >>>> Thank you >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha > > > > -- > Best Regards, > Ayan Guha
