Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

Jerry Wed, 17 Feb 2016 10:08:09 -0800

Rado,

Yes. you are correct. A lots of messages are created almost in the same
time (even use milliseconds). I changed to use "UUID.randomUUID()" with
which all messages can be inserted in the Cassandra table without time lag.


Thank you very much!
Jerry Wong

On Wed, Feb 17, 2016 at 1:50 AM, radoburansky [via Apache Spark User List] <
ml-node+s1001560n26246...@n3.nabble.com> wrote:

> Hi Jerry,
>
> How do you know that only 100 messages are inserted? What is the primary
> key of the "tableOfTopicA" Cassandra table? Isn't it possible that you
> map more messages to the same primamary key and therefore they overwrite
> each other in Cassandra?
>
> Regards
>
> Rado
>
> On Tue, Feb 16, 2016 at 10:29 PM, Jerry [via Apache Spark User List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=26246&i=0>> wrote:
>
>> Hello,
>>
>> I have questions using Spark streaming to consume data from Kafka and
>> insert to Cassandra database.
>>
>> 5 AWS instances (each one does have 8 cores, 30GB memory) for Spark,
>> Hadoop, Cassandra
>> Scala: 2.10.5
>> Spark: 1.2.2
>> Hadoop: 1.2.1
>> Cassandra 2.0.18
>>
>> 3 AWS instances for Kafka cluster (each one does have 8 cores, 30GB
>> memory)
>> Kafka: 0.8.2.1
>> Zookeeper: 3.4.6
>>
>> Other configurations:
>> batchInterval = 6 Seconds
>> blockInterval = 1500 millis
>> spark.locality.wait = 500 millis
>> #Consumers = 10
>>
>> There are two columns in the cassandra table
>> keySpaceOfTopicA.tableOfTopicA, "createdtime" and "log".
>>
>> Here is a piece of codes,
>>
>> @transient val kstreams = (1 to numConsumers.toInt).map { _ =>
>> KafkaUtils.createStream(ssc, zkeeper, groupId, Map("topicA"->1),
>>  StorageLevel.MEMORY_AND_DISK_SER)
>>         .map(_._2.toString).map(Tuple1(_))
>>         .map{case(log) => (System.currentTimeMillis(), log)}
>> }
>> @transient val unifiedMessage = ssc.union(kstreams)
>>
>> unifiedMessage.saveToCassandra("keySpaceOfTopicA", "tableOfTopicA",
>> SomeColumns("createdtime", "log"))
>>
>> I created a producer and send messages to Brokers (1000 messages/per
>> time)
>>
>> But the Cassandra can only be inserted about 100 messages in each round
>> of test.
>> Can anybody give me advices why the other messages (about 900 message)
>> can't be consumed?
>> How do I configure and tune the parameters in order to improve the
>> throughput of consumers?
>>
>> Thank you very much for your reading and suggestions in advances.
>>
>> Jerry Wong
>>
>> ------------------------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Optimize-the-performance-of-inserting-data-to-Cassandra-with-Kafka-and-Spark-Streaming-tp26244.html
>> To start a new topic under Apache Spark User List, email [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=26246&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Optimize-the-performance-of-inserting-data-to-Cassandra-with-Kafka-and-Spark-Streaming-tp26244p26246.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Optimize the performance of inserting data to
> Cassandra with Kafka and Spark Streaming, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=26244&code=amVycnkua2luZzIud29uZ0BnbWFpbC5jb218MjYyNDR8MTYwMzcyMjg3MQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimize-the-performance-of-inserting-data-to-Cassandra-with-Kafka-and-Spark-Streaming-tp26244p26252.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

Reply via email to