Monitoring Spark Jobs

2015-06-07 Thread SamyaMaiti
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the ave

Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread SamyaMaiti
Hi Team, Is there a way we can consume from Kafka using spark Streaming direct API using multiple consumers (belonging to same consumer group) Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Kafka-Direct-API-Multiple-consumers-

Spark logging

2016-07-10 Thread SamyaMaiti
Hi Team, I have a spark application up & running on a 10 node Standalone cluster. When i launch the application in cluster mode i am able to create separate log file for driver & executors (common for all executors). But, my requirement is to create separate log file for each executors. Is it fe

spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Hi Team, I am using *CDH 5.7.1* with spark *1.6.0* I have a spark streaming application that read s from kafka & do some processing. The issue is while starting the application in CLUSTER mode, i want to pass custom log4j.properies file to both driver & executor. *I have the below command :-*

Re: spark.driver.extraJavaOptions

2016-07-21 Thread SamyaMaiti
Thanks for the reply RK. Using the first option, my application doesn't recognize spark.driver.extraJavaOptions. With the second option, the issue remains as same, 2016-07-21 12:59:41 ERROR SparkContext:95 - Error initializing SparkContext. org.apache.spark.SparkException: Found both spark.exe

Re: Spark Streaming and JMS

2015-12-02 Thread SamyaMaiti
Hi All, Is there any Pub-Sub for JMS provided by Spark out of box like Kafka? Thanks. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html Sent from the Apache Spark User List mailing list archive at Nabbl

LIVY VS Spark Job Server

2016-09-14 Thread SamyaMaiti
Hi Team, I am evaluating different ways to submit & monitor spark Jobs using REST Interfaces. When to use Livy vs Spark Job Server? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html Sent from the Apache Sp

Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti
Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors and when all try to write to a given file I get a concurrent exception. I way to mitigate the issue is to repartition & have a

Parquet file + increase read parallelism

2015-03-23 Thread SamyaMaiti
Hi All, Suppose I have a parquet file of 100 MB in HDFS & my HDFS block is 64MB, so I have 2 block of data. When I do, *sqlContext.parquetFile("path")* followed by an action , two tasks are stared on two partitions. My intend is to read this 2 blocks in more partitions to fully utilize my cluste

persist(MEMORY_ONLY) takes lot of time

2015-04-01 Thread SamyaMaiti
Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP

Spark Vs MR

2015-04-04 Thread SamyaMaiti
How is spark faster than MR when data is in disk in both cases? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-04 Thread SamyaMaiti
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/4-seconds-to-count-13M-lines-Does-it-make-sense-tp22360p22374.html Sent from the Apache Spark User List mailing list archive a

Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team, I have a hive partition table with partition column having spaces. When I try to run any query, say a simple "Select * from table_name", it fails. *Please note the same was working in spark 1.2.0, now I have upgraded to 1.3.1. Also there is no change in my application code base.* If I

Re: Spark Job execution time

2015-05-15 Thread SamyaMaiti
It does depend on the network IO within your cluster & CPU usage. Said that the difference in time to run should not be huge (assumption, you are not running any other job in the cluster in parallel). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Jo

ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
eated by samyamaiti on 12/25/14. */ object Driver { def main(args: Array[String]) { //CheckPoint dir in HDFS val checkpointDirectory = "hdfs://localhost:8020/user/samyamaiti/SparkCheckpoint1" //functionToCreateContext def functionToCreateContext(): StreamingContext = {

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-25 Thread SamyaMaiti
Sorry for the typo. Apache Hadoop version is 2.6.0 Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-with-remote-system-tp20859p20860.html Sent from the Apache Spark User List mailing list archive at Nabbl

Can we say 1 RDD is generated every batch interval?

2014-12-29 Thread SamyaMaiti
Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is the foreachRDD() operator executed one & only once for each batch processing? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-we

Re: ReliableDeliverySupervisor: Association with remote system

2014-12-29 Thread SamyaMaiti
Resolved. I changed to Apache Hadoop 2.4.0 & Apache spark 1.2.0 combination, all works fine. Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReliableDeliverySupervisor-Association-wi

Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one & only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular b

save rdd to ORC file

2015-01-03 Thread SamyaMaiti
Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a w