Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-11 Thread hotdog
hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from the Apach

Re: Spark handling parallel requests

2015-10-11 Thread Akhil Das
Instead of pushing your requests to the socket, why don't you push them to a Kafka or any other message queue and use spark streaming to process them? Thanks Best Regards On Mon, Oct 5, 2015 at 6:46 PM, wrote: > Hi , > i am using Scala , doing a socket program to catch multiple requests at > sa

Re: yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Venkatakrishnan Sowrirajan
Hi Rachana, Are you by any chance saying something like this in your code ​? ​ "sparkConf.setMaster("yarn-cluster");" ​SparkContext is not supported with yarn-cluster mode.​ I think you are hitting this bug -- > https://issues.apache.org/jira/browse/SPARK-7504. This got fixed in Spark-1.4.0,

Spark retrying task indefinietly

2015-10-11 Thread Amit Singh Hora
I am running spark locally to understand how countByValueAndWindow works val Array(brokers, topics) = Array("192.XX.X.XX:9092", "test1") // Create context with 2 second batch interval val sparkConf = new SparkConf().setAppName("ReduceByWindowExample").setMaster("local[1,1]

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Probably you have to read the source code, I am not sure if there are any .ppt or slides. Hao From: VJ Anand [mailto:vjan...@sankia.com] Sent: Monday, October 12, 2015 11:43 AM To: Cheng, Hao Cc: Raajay; user@spark.apache.org Subject: Re: Join Order Optimization Hi - Is there a design document

RE: Best practices to call small spark jobs as part of REST api

2015-10-11 Thread Nuthan Kumar
If the data is also on-demand, spark as back end is also good option.. Sent from Outlook Mail for Windows 10 phone From: Akhil Das Sent: Sunday, October 11, 2015 1:32 AM To: unk1102 Cc: user@spark.apache.org Subject: Re: Best practices to call small spark jobs as part of REST api One approach

yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Rachana Srivastava
I am trying to submit a job using yarn-cluster mode using spark-submit command. My code works fine when I use yarn-client mode. Cloudera Version: CDH-5.4.7-1.cdh5.4.7.p0.3 Command Submitted: spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming" \ --driver-java-options "-Dlog4j

Re: Join Order Optimization

2015-10-11 Thread VJ Anand
Hi - Is there a design document for those operations that have been implemented in 1.4.0? if so,where can I find them -VJ On Sun, Oct 11, 2015 at 7:27 PM, Cheng, Hao wrote: > Yes, I think the SPARK-2211 should be the right place to follow the CBO > stuff, but probably that will not happen right

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Yes, I think the SPARK-2211 should be the right place to follow the CBO stuff, but probably that will not happen right away. The jira issue introduce the statistic info can be found at: https://issues.apache.org/jira/browse/SPARK-2393 Hao From: Raajay [mailto:raaja...@gmail.com] Sent: Monday, O

Re: Join Order Optimization

2015-10-11 Thread Raajay
Hi Cheng, Could you point me to the JIRA that introduced this change ? Also, is this SPARK-2211 the right issue to follow for cost-based optimization? Thanks Raajay On Sun, Oct 11, 2015 at 7:57 PM, Cheng, Hao wrote: > Spark SQL supports very basic join reordering optimization, based on the

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
Thank you Ted, that’s very informative; from the DB optimization point of view, the Cost Base join re-ordering, and the multi-way joins does provide better performance; But from the API design point of view, 2 arguments (relation) for JOIN in the DF API probably be enough for the multiple table

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Spark SQL supports very basic join reordering optimization, based on the raw table data size, this was added couple major releases back. And the “EXPLAIN EXTENDED query” command is a very informative tool to verify whether the optimization taking effect. From: Raajay [mailto:raaja...@gmail.com]

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Ted Yu
Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 at 5:32 PM, Cheng, Hao wrote: > A join B join C === (A join B) join C > > Semantically they are equivalent, right? > > > > *From:* Richard Eggert [mailto:richard.egg...

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C Semantically they are equivalent, right? From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Monday, October 12, 2015 5:12 AM To: Subhajit Purkayastha Cc: User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? It's the same as joining 2. Join

RE: Hive with apache spark

2015-10-11 Thread Cheng, Hao
One option is you can read the data via JDBC, however, probably it's the worst option, as you probably need some hacky work to enable the parallel reading in Spark SQL. Another option is copy the hive-site.xml of your Hive Server to $SPARK_HOME/conf, then Spark SQL will see everything that Hive

Handling expirying state in UDF

2015-10-11 Thread brightsparc
Hi, I have created a python UDF to make an API which requires an expirying OAuth token which requires refreshing every 600 seconds which is longer than any given stage. Due to the nature of threads and local state, if I use a global variable, the variable goes out of scope regularly. I look int

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Richard Eggert
It's the same as joining 2. Join two together, and then join the third one to the result of that. On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" wrote: > Can I join 3 different RDDs together in a Spark SQL DF? I can find > examples for 2 RDDs but not 3. > > > > Thanks > > >

Re: SQLcontext changing String field to Long

2015-10-11 Thread Yana Kadiyska
In our case, we do not actually need partition inference so the workaround was easy -- instead of using the path as rootpath/batch_id=333/... we changed the paths to rootpath/333/ This works for us because we compute the set of HDFS paths manually and then register a dataframe into a SQLContex

Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Subhajit Purkayastha
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 2 RDDs but not 3. Thanks

Re: Compute Real-time Visualizations using spark streaming

2015-10-11 Thread Akhil Das
Simplest approach would be to push the streaming data (after the computations) to a SQL-Like DB and then let your visualization piece pull it from the DB. Another approach would be to make your visualization piece a web-socket (If you are using D3JS etc) and then from your streaming application you

Hive with apache spark

2015-10-11 Thread Hafiz Mujadid
Hi how can we read data from external hive server. Hive server is running and I want to read data remotely using spark. is there any example ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hive-with-apache-spark-tp25020.html Sent from the Apache

Re: Spark cluster - use machine name in WorkerID, not IP address

2015-10-11 Thread Akhil Das
Did you try setting the SPARK_LOCAL_IP in the conf/spark-env.sh file on each node? Thanks Best Regards On Fri, Oct 2, 2015 at 4:18 AM, markluk wrote: > I'm running a standalone Spark cluster of 1 master and 2 slaves. > > My slaves file under /conf list the fully qualified domain name of the 2 >

Why Spark Stream job stops producing outputs after a while?

2015-10-11 Thread Uthayan Suthakar
Hello all, I have a Spark Streaming job that run and produce results successfully. However, after a few days the job stop producing any output. I can see the job is still running ( polling data from Flume, completing jobs and it's subtasks) however, it is failing to produce any output. I have to r

Re: Best practices to call small spark jobs as part of REST api

2015-10-11 Thread Akhil Das
One approach would be to make your spark job push the computed results (json) to a database and your reset server can pull it from there and power the UI. Thanks Best Regards On Wed, Sep 30, 2015 at 12:26 AM, unk1102 wrote: > Hi I would like to know any best practices to call spark jobs in rest

Re: "Too many open files" exception on reduceByKey

2015-10-11 Thread Tian Zhang
It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark U