Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread Sean Owen
Not directly. If you could access brzPi and brzTheta in the
NaiveBayesModel, you could repeat its same computation in predict() and
exponentiate it to get back class probabilities, since input and internal
values are in log space.

Hm I wonder how people feel about exposing those fields or a different
method to expose class probabilities? Seems useful since it is conceptually
directly available.
On Nov 10, 2014 5:46 AM, "jatinpreet"  wrote:

> Hi,
>
> Is there a way to get the confidence value of a prediction with  MLlib's
> implementation of Naive Baye's classification. I wish to eliminate the
> samples that were classified with low confidence.
>
> Thanks,
> Jatin
>
>
>
> -
> Novice Big Data Programmer
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


"-Error stopping receiver" in running Spark+Flume sample code "FlumeEventCount.scala"

2014-11-10 Thread Ping Tang
Hi,

Can somebody help me to understand why this error occurred?


2014-11-10 00:17:44,512 INFO  [Executor task launch worker-0] 
receiver.BlockGenerator (Logging.scala:logInfo(59)) - Started BlockGenerator

2014-11-10 00:17:44,513 INFO  [Executor task launch worker-0] 
receiver.ReceiverSupervisorImpl (Logging.scala:logInfo(59)) - Starting receiver

2014-11-10 00:17:44,513 INFO  [Thread-31] receiver.BlockGenerator 
(Logging.scala:logInfo(59)) - Started block pushing thread

2014-11-10 00:17:44,789 INFO  [Executor task launch worker-0] 
receiver.ReceiverSupervisorImpl (Logging.scala:logInfo(59)) - Stopping receiver 
with message: Error starting receiver 0: java.lang.AbstractMethodError

2014-11-10 00:17:44,796 ERROR [Executor task launch worker-0] 
receiver.ReceiverSupervisorImpl (Logging.scala:logError(75)) - Error stopping 
receiver 0org.apache.spark.Logging$class.log(Logging.scala:52)

org.apache.spark.streaming.flume.FlumeReceiver.log(FlumeInputDStream.scala:134)

org.apache.spark.Logging$class.logInfo(Logging.scala:59)

org.apache.spark.streaming.flume.FlumeReceiver.logInfo(FlumeInputDStream.scala:134)

org.apache.spark.streaming.flume.FlumeReceiver.onStop(FlumeInputDStream.scala:151)

org.apache.spark.streaming.receiver.ReceiverSupervisor.stopReceiver(ReceiverSupervisor.scala:136)

org.apache.spark.streaming.receiver.ReceiverSupervisor.stop(ReceiverSupervisor.scala:112)

org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:127)

org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)


2014-11-10 00:17:44,797 INFO  [Executor task launch worker-0] 
receiver.BlockGenerator (Logging.scala:logInfo(59)) - Stopping BlockGenerator

2014-11-10 00:17:44,800 INFO  [Executor task launch worker-0] 
util.RecurringTimer (Logging.scala:logInfo(59)) - Stopped timer for 
BlockGenerator after time 1415607464800

2014-11-10 00:17:44,801 INFO  [Executor task launch worker-0] 
receiver.BlockGenerator (Logging.scala:logInfo(59)) - Waiting for block pushing 
thread

2014-11-10 00:17:44,815 INFO  [Thread-31] receiver.BlockGenerator 
(Logging.scala:logInfo(59)) - Pushing out the last 0 blocks

2014-11-10 00:17:44,816 INFO  [Thread-31] receiver.BlockGenerator 
(Logging.scala:logInfo(59)) - Stopped block pushing thread

2014-11-10 00:17:44,816 INFO  [Executor task launch worker-0] 
receiver.BlockGenerator (Logging.scala:logInfo(59)) - Stopped BlockGenerator

2014-11-10 00:17:44,817 INFO  [Executor task launch worker-0] 
receiver.ReceiverSupervisorImpl (Logging.scala:logInfo(59)) - Waiting for 
executor stop is over

2014-11-10 00:17:44,818 ERROR [Executor task launch worker-0] 
receiver.ReceiverSupervisorImpl (Logging.scala:logError(75)) - Stopped executor 
with error: java.lang.AbstractMethodError

2014-11-10 00:17:44,820 ERROR [Executor task launch worker-0] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.AbstractMethodError

at org.apache.spark.Logging$class.log(Logging.scala:52)

at 
org.apache.spark.streaming.flume.FlumeReceiver.log(FlumeInputDStream.scala:134)

at org.apache.spark.Logging$class.logInfo(Logging.scala:59)

at 
org.apache.spark.streaming.flume.FlumeReceiver.logInfo(FlumeInputDStream.scala:134)

at 
org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:146)

at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)

at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)

at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)

at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)

at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)

at 
java.util.concurrent.ThreadPoolExecu

dealing with large values in kv pairs

2014-11-10 Thread YANG Fan
Hi,

I've got a huge list of key-value pairs, where the key is an integer and
the value is a long string(around 1Kb). I want to concatenate the strings
with the same keys.

Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)

Then tried to save the result to HDFS. But it was extremely slow. I had to
kill the job at last.

I guess it's because the value part is too big and it slows down the
shuffling phase. So I tried to use sortByKey before doing reduceByKey.
sortByKey is very fast, and it's also fast when writing the result back to
HDFS. But when I did reduceByKey, it was as slow as before.

How can I make this simple operation faster?

Thanks,
Fan


Re: dealing with large values in kv pairs

2014-11-10 Thread Sean Owen
You are suggesting that the String concatenation is slow? It probably is
because of all the allocation.

Consider foldByKey instead which starts with an empty StringBuilder as its
zero value. This will build up the result far more efficiently.
On Nov 10, 2014 8:37 AM, "YANG Fan"  wrote:

> Hi,
>
> I've got a huge list of key-value pairs, where the key is an integer and
> the value is a long string(around 1Kb). I want to concatenate the strings
> with the same keys.
>
> Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)
>
> Then tried to save the result to HDFS. But it was extremely slow. I had to
> kill the job at last.
>
> I guess it's because the value part is too big and it slows down the
> shuffling phase. So I tried to use sortByKey before doing reduceByKey.
> sortByKey is very fast, and it's also fast when writing the result back to
> HDFS. But when I did reduceByKey, it was as slow as before.
>
> How can I make this simple operation faster?
>
> Thanks,
> Fan
>


canopy clustering

2014-11-10 Thread amin mohebbi
I want to run k-means of MLib  on a big dataset, it seems for big datsets, we 
need to perform pre-clustering methods such as canopy clustering. By starting 
with an initial clustering the number of more expensive distance measurements 
can be significantly reduced by ignoring points outside of the initial 
canopies. 

I I am not mistaken, in the k-means of MLib, there are three initialization 
steps : Kmeans ++, Kmeans|| and random . 

So, can anyone explain to me that can we use kmeans|| instead of canopy 
clustering? or these two methods act completely different?

 
 

Best Regards 

... 

Amin Mohebbi 

PhD candidate in Software Engineering  
 at university of Malaysia   

Tel : +60 18 2040 017 



E-Mail : tp025...@ex.apiit.edu.my 

  amin_...@me.com

closure serialization behavior driving me crazy

2014-11-10 Thread Sandy Ryza
I'm experiencing some strange behavior with closure serialization that is
totally mind-boggling to me.  It appears that two arrays of equal size take
up vastly different amount of space inside closures if they're generated in
different ways.

The basic flow of my app is to run a bunch of tiny regressions using
Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
the results from a transformation.  I was running into OOME's and
NotSerializableExceptions and tried to get closer to the root issue by
calling the closure serializer directly.
  scala> val arr = models.map(_.estimateRegressionParameters()).toArray

The result array is 1867 x 5. It serialized is 80k bytes, which seems about
right:
  scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
  res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
cap=80027]

If I reference it from a simple function:
  scala> def func(x: Long) => arr.length
  scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
I get a NotSerializableException.

If I take pains to create the array using a loop:
  scala> val arr = Array.ofDim[Double](1867, 5)
  scala> for (s <- 0 until models.length) {
  | factorWeights(s) = models(s).estimateRegressionParameters()
  | }
Serialization works, but the serialized closure for the function is a
whopping 400MB.

If I pass in an array of the same length that was created in a different
way, the size of the serialized closure is only about 90K, which seems
about right.

Naively, it seems like somehow the history of how the array was created is
having an effect on what happens to it inside a closure.

Is this expected behavior?  Can anybody explain what's going on?

any insight very appreciated,
Sandy


index File create by mapFile can't

2014-11-10 Thread buring
Hi
Recently I want to save a big RDD[(k,v)] in form of index and data ,I
deceide to use hadoop mapFile. I tried some examples like this
:https://gist.github.com/airawat/6538748  
I runs the code well and generate a index and data file. I can use 
command
"hadoop fs -text /spark/out2/mapFile/data" to open the file .But when I run
command :hadoop fs -text /spark/out2/mapFile/index ,I can't see the index
content .there are only some informations in console :
14/11/10 16:11:04 INFO zlib.ZlibFactory: Successfully loaded & 
initialized
native-zlib library
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
and commond :hadoop fs -ls /spark/out2/mapFile/  shows follows
-rw-r--r--   3 spark hdfs  24002 2014-11-10 15:19
/spark/out2/mapFile/data
-rw-r--r--   3 spark hdfs136 2014-11-10 15:19
/spark/out2/mapFile/index 

I think "INFO compress.CodecPool: Got brand-new decompressor [.deflate]"
should't prohibit show the data in index. It'e really confused me. My code
was as follows:
def try_Map_File(writePath:String) = { 
val uri = writePath+"/mapFile"
val data=Array(
  "One, two, buckle my shoe","Three, four, shut the door","Five, 
six,
pick up sticks",
  "Seven, eight, lay them straight","Nine, ten, a big fat hen")

val con = new SparkConf()
   
con.set("spark.io.compression.codec","org.apache.spark.io.LZ4CompressionCodec")
val sc= new SparkContext(con)

val conf = sc.hadoopConfiguration
val fs = FileSystem.get(URI.create(uri),conf)
val key = new IntWritable()
val value = new Text()
var writer:MapFile.Writer = null
try{
  val writer = new Writer(conf,fs,uri,key.getClass,value.getClass)
  writer.setIndexInterval(64)
  for(i<- Range(0,512)){
key.set(i+1)
value.set(data(i%data.length))
writer.append(key,value)
  }
}finally {
  IOUtils.closeStream(writer)
}
}
can anyone give me some idea or other method to instead mapFile?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/index-File-create-by-mapFile-can-t-tp18469.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



index File create by mapFile can't read

2014-11-10 Thread buring
Hi
Recently i want to save a big RDD[(k,v)] in form of index and data ,I
deceide to use hadoop mapFile. I tried some examples like this
:https://gist.github.com/airawat/6538748  
I runs the code well and generate a index and data file. I can use 
command
"hadoop fs -text /spark/out2/mapFile/data" to open the file .But when I run
command :hadoop fs -text /spark/out2/mapFile/index ,I can't see the index
content .there are only some informations in console :
14/11/10 16:11:04 INFO zlib.ZlibFactory: Successfully loaded & 
initialized
native-zlib library
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
14/11/10 16:11:04 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]

and commond :hadoop fs -ls /spark/out2/mapFile/  shows follows
-rw-r--r--   3 spark hdfs  24002 2014-11-10 15:19
/spark/out2/mapFile/data
-rw-r--r--   3 spark hdfs136 2014-11-10 15:19
/spark/out2/mapFile/index

I think "INFO compress.CodecPool: Got brand-new decompressor [.deflate]"
should't prohibit show the data in index. It'e really confused me. My code
was as follows:
def try_Map_File(writePath:String) = { 
val uri = writePath+"/mapFile"
val data=Array(
  "One, two, buckle my shoe","Three, four, shut the door","Five, 
six,
pick up sticks",
  "Seven, eight, lay them straight","Nine, ten, a big fat hen")

val con = new SparkConf()
   
con.set("spark.io.compression.codec","org.apache.spark.io.LZ4CompressionCodec")
val sc= new SparkContext(con)

val conf = sc.hadoopConfiguration
val fs = FileSystem.get(URI.create(uri),conf)
val key = new IntWritable()
val value = new Text()
var writer:MapFile.Writer = null
try{
  val writer = new Writer(conf,fs,uri,key.getClass,value.getClass)
  writer.setIndexInterval(64)
  for(i<- Range(0,512)){
key.set(i+1)
value.set(data(i%data.length))
writer.append(key,value)
  }
}finally {
  IOUtils.closeStream(writer)
}
}
can anyone give me some idea or other method to instead mapFile?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/index-File-create-by-mapFile-can-t-read-tp18471.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks for the answer. The variables brzPi and brzTheta are declared private.
I am writing my code with Java otherwise I could have replicated the scala
class and performed desired computation, which is as I observed  a
multiplication of brzTheta  with test vector and adding this value to brzPi.

Any suggestions of a way out other than replicating the whole functionality
of Naive Baye's model in Java? That would be a time consuming process.



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there a step-by-step instruction on how to build Spark App with IntelliJ IDEA?

2014-11-10 Thread MEETHU MATHEW
Hi,
This question was asked  earlier  and I did it in the way specified..I am 
getting java.lang.ClassNotFoundException..
Can somebody explain all the steps required to build a spark app using IntelliJ 
(latest version)starting from creating the project to running it..I searched a 
lot but couldnt find an appropriate documentation..
Re: Is there a step-by-step instruction on how to build Spark App with IntelliJ 
IDEA?

|   |
|   |   |   |   |   |
| Re: Is there a step-by-step instruction on how to build Spark App with 
IntelliJ IDEA?Don’t try to use spark-core as an archetype. Instead just create 
a plain Scala project (noarchetype) and add a Maven dependency on spark-core. 
That should be all you need.  |
|  |
| View on mail-archives.apache.org | Preview by Yahoo |
|  |
|   |

   Thanks & Regards,
Meethu M

Solidifying Understanding of Standalone Mode

2014-11-10 Thread Ashic Mahtab
Hello,
I'm hoping to understand exactly what happens when a spark compiled app is 
submitted to a spark stand-alone cluster master. Say, our master is A, and 
workers are W1 and W2. Client machine C is submitting an app to the master 
using spark-submit. Here's what I think happens?

* C submits jar (possibly uber jar) to A. A starts execution and sends 
partitions to W1 and W2 to carry out work. Results are sent back to A. Results 
are stored in output files / tables according to the application. W1 and W2 may 
also be reading and writing data to and from sources. The submission from C is 
fire and forget, and the final results aren't sent back to C.

Is this correct?

I noticed something about the submitting processes working as the driver 
application for Spark stand alone. That would mean the above is wrong. Is there 
some information about exactly what happens when I submit an app to the Spark 
master in a stand alone cluster?

Thanks,
Ashic.
  

Mysql retrieval and storage using JdbcRDD

2014-11-10 Thread akshayhazari
So far I have tried this and I am able to compile it successfully . There
isn't enough documentation on spark for its usage with databases. I am using
AbstractFunction0 and AbsctractFunction1 here. I am unable to access the
database. The jar just runs without doing anything when submitted. I want to
know how is it supposed to be done and what wrongs have I done here. Any
help is appreciated.

import java.io.Serializable;
import scala.*;
import scala.runtime.AbstractFunction0;
import scala.runtime.AbstractFunction1;
import scala.runtime.*;
import scala.collection.mutable.LinkedHashMap;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.rdd.*;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.api.java.JavaSQLContext;
import org.apache.spark.sql.api.java.JavaSchemaRDD;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.Row;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import java.sql.*;
import java.util.*;
import java.io.*;
public class Spark_Mysql {
@SuppressWarnings("serial")
static class Z extends AbstractFunction0 
{
Connection con;
public Connection apply()
{

try {
Class.forName("com.mysql.jdbc.Driver");

con=DriverManager.getConnection("jdbc:mysql://localhost:3306/?user=azkaban&password=password");
}
catch(Exception e)
{
}
return con;
}

}
static public class Z1 extends AbstractFunction1 
{
int ret;
public Integer apply(ResultSet i) {

try{
ret=i.getInt(1);
}
catch(Exception e)
{}
return ret;
}
}

@SuppressWarnings("serial")
public static void main(String[] args) throws Exception {

String arr[]=new String[1];

arr[0]="/home/hduser/Documents/Credentials/Newest_Credentials_AX/spark-1.1.0-bin-hadoop1/lib/mysql-connector-java-5.1.33-bin.jar";

JavaSparkContext ctx = new JavaSparkContext(new
SparkConf().setAppName("JavaSparkSQL").setJars(arr));
SparkContext sctx = new SparkContext(new
SparkConf().setAppName("JavaSparkSQL").setJars(arr));
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);

  try 
{
Class.forName("com.mysql.jdbc.Driver");
}
catch(Exception ex) 
{
System.exit(1);
}
  Connection
zconn=DriverManager.getConnection("jdbc:mysql://localhost:3306/?user=azkaban&password=password");

  JdbcRDD rdd=new JdbcRDD(sctx,new Z(),"SELECT * FROM spark WHERE ? <= 
id
AND id <= ?",0L, 1000L, 10,new
Z1(),scala.reflect.ClassTag$.MODULE$.AnyRef());
  
  rdd.saveAsTextFile("hdfs://127.0.0.1:9000/user/hduser/mysqlrdd"); 
  rdd.saveAsTextFile("/home/hduser/mysqlrdd"); 
}
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mysql-retrieval-and-storage-using-JdbcRDD-tp18479.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Zalzberg, Idan (Agoda)
Hello,
I have a big cluster running CDH 5.1.3 which I can't upgrade to 5.2.0 at the 
current time.
I would like to run Spark-On-Yarn in that cluster.

I tried to compile spark with CDH-5.1.3 and I got HDFS to work but I am having 
problems with the connection to hive:

java.sql.SQLException: Could not establish connection to 
jdbc:hive2://localhost.localdomain:1/: Required field 
'serverProtocolVersion' is unset! 
Struct:TOpenSessionResp(status:TStatus(statusCode:SUCCESS_STATUS), 
serverProtocolVersion:null, 
sessionHandle:TSessionHandle(sessionId:THandleIdentifier(guid:C7 86 85 3D 38 91 
41 A1 AF 02 83 DA 80 74 A5 B1, secret:62 80 00
99 D6 73 48 9B 81 13 FB D7 DB 32 32 26)), configuration:{})
[info]   at 
org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:246)
[info]   at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:132)
[info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
[info]   at 
com.agoda.mse.hadooputils.HiveTools$.getHiveConnection(HiveTools.scala:135)
[info]   at 
com.agoda.mse.hadooputils.HiveTools$.withConnection(HiveTools.scala:19)
[info]   at 
com.agoda.mse.hadooputils.HiveTools$.withStatement(HiveTools.scala:30)
[info]   at 
com.agoda.mse.hadooputils.HiveTools$.copyFileToHdfsThenRunQuery(HiveTools.scala:110)
[info]   at 
SparkAssemblyTest$$anonfun$4.apply$mcV$sp(SparkAssemblyTest.scala:41)

This happens when I try to create a hive connection myself, using the 
hive-jdbc-cdh5.1.3 package ( I can connect if I don't have the spark in the 
classpath)

How can I get spark jar to be consistent with hive-jdbc for CDH5.1.3?

Thanks


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.


Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
How can I remove all the INFO logs that appear on the console when I submit
an application using spark-submit?


Re: Removing INFO logs

2014-11-10 Thread YANG Fan
Hi,

In conf/log4j.properties, change the following

log4j.rootCategory=INFO, console

to
 log4j.rootCategory=WARN, console

This works for me.

Best,
Fan

On Mon, Nov 10, 2014 at 8:21 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> How can I remove all the INFO logs that appear on the console when I
> submit an application using spark-submit?
>


Re: Removing INFO logs

2014-11-10 Thread Ritesh Kumar Singh
It works.

Thanks

On Mon, Nov 10, 2014 at 6:32 PM, YANG Fan  wrote:

> Hi,
>
> In conf/log4j.properties, change the following
>
> log4j.rootCategory=INFO, console
>
> to
>  log4j.rootCategory=WARN, console
>
> This works for me.
>
> Best,
> Fan
>
> On Mon, Nov 10, 2014 at 8:21 PM, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> How can I remove all the INFO logs that appear on the console when I
>> submit an application using spark-submit?
>>
>
>


Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread Sean Owen
It's hacky, but you could access these fields via reflection. It'd be
better to propose opening them up in a PR.

On Mon, Nov 10, 2014 at 9:25 AM, jatinpreet  wrote:
> Thanks for the answer. The variables brzPi and brzTheta are declared private.
> I am writing my code with Java otherwise I could have replicated the scala
> class and performed desired computation, which is as I observed  a
> multiplication of brzTheta  with test vector and adding this value to brzPi.
>
> Any suggestions of a way out other than replicating the whole functionality
> of Naive Baye's model in Java? That would be a time consuming process.
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
Hi,

I am trying to submit my application using spark-submit, using following
spark-default.conf params:

spark.master spark://:7077
spark.eventLog.enabled   true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
-Dnumbers="one two three"

===
But every time I am getting this error:

14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
remote Akka client disassociated
14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:17 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:20 ERROR TaskSchedulerImpl: Lost executor 2 on aa.local:
remote Akka client disassociated
14/11/10 18:39:20 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:20 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:26 ERROR TaskSchedulerImpl: Lost executor 4 on aa.local:
remote Akka client disassociated
14/11/10 18:39:26 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:26 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:29 ERROR TaskSchedulerImpl: Lost executor 5 on aa.local:
remote Akka client disassociated
14/11/10 18:39:29 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7,
aa.local): ExecutorLostFailure (executor lost)
14/11/10 18:39:29 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/11/10 18:39:29 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6,
aa.local): ExecutorLostFailure (executor lost)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 7, gonephishing.local): ExecutorLostFailure
(executor lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

=
Any fixes?


Running Spark on SPARC64 X+

2014-11-10 Thread Greg Jennings
Hello all, I'm hoping someone can help me with this hardware question. We have 
an upcoming need to run our machine learning application on physical hardware. 
Up until now, we've just rented a cloud-based high performance cluster, so my 
understanding of the real relative performance tradeoffs between different 
processor architectures is not great. 

Assuming the same memory levels, which do you think is preferable for running a 
stand alone Spark deployment: 

Configuration option 1: 
- Intel Xeon® processor E7-8893v2 (6C/12T, 3.4 GHz, TLC:37.5 MB, Turbo: Yes, 
8.0GT/s, Mem bus: 1600 MHz, 155W), 24 cores 

Configuration option 2: 
- 3.2 GHz (SPARC64™ X+), 4 sockets 4U server with 16 cores/socket (total 64 
cores) 

I guess one concern is that I couldn't find any examples of anyone who has run 
Apache Spark on the Fujitsu SPARC64 X+ architecture. In theory there shouldn't 
be any issues simply running it because of the JVM, but we do link back to some 
numerical computation libraries under the hood built with C++ that get called 
as an external process on the nodes via numpy during the iterations. Does 
anyone think that might cause any issue? 

Any insights are welcome! 

Thanks in advance! 
Greg 

Increase Executor Memory on YARN

2014-11-10 Thread Mudassar Sarwar
Hi,

How can we increase the executor memory of a running spark cluster on YARN?
We want to increase the executor memory on the addition of new nodes in the
cluster. We are running spark version 1.0.2.

Thanks
Mudassar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Increase-Executor-Memory-on-YARN-tp18489.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Lijun Wang
Hi,
  I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?

 Besides, is there anyone having the experience to do multiplication of two
distributed matrix, e.g., two RowMatrix?

thank you!
Lijun


To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Lijun Wang
Hi,
  I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?

 Besides, is there anyone having the experience to do multiplication of two
distributed matrix, e.g., two RowMatrix?

thank you!
Lijun Wang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/To-generate-IndexedRowMatrix-from-an-RowMatrix-tp18491.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-10 Thread lev
I see, thanks.

I'm not running on ec2, and I wouldn't like to start copying jars on all the
servers in the cluster.
Any ideas of how I can add this jar in a simple way?

Here are my failed attempts so far:
 - adding the math3 jar the lib folder in my project root. The math3 classes
did appear in the compiled jar, but the error is still there. it's wired
that the class is not found even when it's in the jar.

 - adding the math3 jar to a dir that in oozie.libpath. I'm running the
spark jar with oozie, but that also didn't solve it.

Thanks,
Lev. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18492.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Cheng Lian

You may use |RDD.zipWithIndex|.

On 11/10/14 10:03 PM, Lijun Wang wrote:


Hi,
   I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?

  Besides, is there anyone having the experience to do multiplication of two
distributed matrix, e.g., two RowMatrix?

thank you!
Lijun Wang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/To-generate-IndexedRowMatrix-from-an-RowMatrix-tp18491.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​


Re: Executor Lost Failure

2014-11-10 Thread Akhil Das
​Try adding the following configurations also, might work.

 spark.rdd.compress true

  spark.storage.memoryFraction 1
  spark.core.connection.ack.wait.timeout 600
  spark.akka.frameSize 50

Thanks
Best Regards

On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Hi,
>
> I am trying to submit my application using spark-submit, using following
> spark-default.conf params:
>
> spark.master spark://:7077
> spark.eventLog.enabled   true
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
> -Dnumbers="one two three"
>
> ===
> But every time I am getting this error:
>
> 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
> remote Akka client disassociated
> 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:17 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:20 ERROR TaskSchedulerImpl: Lost executor 2 on aa.local:
> remote Akka client disassociated
> 14/11/10 18:39:20 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:20 WARN TaskSetManager: Lost task 1.1 in stage 0.0 (TID 3,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:26 ERROR TaskSchedulerImpl: Lost executor 4 on aa.local:
> remote Akka client disassociated
> 14/11/10 18:39:26 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:26 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:29 ERROR TaskSchedulerImpl: Lost executor 5 on aa.local:
> remote Akka client disassociated
> 14/11/10 18:39:29 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7,
> aa.local): ExecutorLostFailure (executor lost)
> 14/11/10 18:39:29 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4
> times; aborting job
> 14/11/10 18:39:29 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6,
> aa.local): ExecutorLostFailure (executor lost)
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 0.0 (TID 7, gonephishing.local):
> ExecutorLostFailure (executor lost)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> =
> Any fixes?
>


Re: Increase Executor Memory on YARN

2014-11-10 Thread Arun Ahuja
If you are using spark-submit with --master yarn you can also pass as a
flag --executor-memory <>
​

On Mon, Nov 10, 2014 at 8:58 AM, Mudassar Sarwar <
mudassar.sar...@northbaysolutions.net> wrote:

> Hi,
>
> How can we increase the executor memory of a running spark cluster on YARN?
> We want to increase the executor memory on the addition of new nodes in the
> cluster. We are running spark version 1.0.2.
>
> Thanks
> Mudassar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Increase-Executor-Memory-on-YARN-tp18489.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Hao Ren

Hey, guys

Feel free to ask for more details if my questions are not clear.

Any insight here ?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p18496.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib Naive Bayes classifier confidence

2014-11-10 Thread jatinpreet
Thanks, I will try it out and raise a request for making the variables
accessible.

An unrelated question, do you think the probability value thus calculated
will be a good measure of confidence in prediction? I have been reading
mixed opinions about the same.

Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456p18497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Cheng Lian

On 11/6/14 1:39 AM, Hao Ren wrote:


Hi,

I would like to understand the pipeline of spark's operation(transformation
and action) and some details on block storage.

Let's consider the following code:

val rdd1 = SparkContext.textFile("hdfs://...")
rdd1.map(func1).map(func2).count

For example, we have a file in hdfs about 80Gb, already split in 32 files,
each 2.5Gb.

q1) How many partitions will rdd1 have ?
rule 1) Maybe 32, since there are 32 split files ? Because, most of the
case, this rule is true if the file is not big in size.
rule 2) Maybe more, I am not sure whether spark's block store can contain a
2.5Gb partition. Is there some parameter specify the block store size ?
AFAIK, hdfs block size is used to read data from hdfs by spark. So there
will be (80Gb/hdfsBlockSize) partitions in rdd1, right ? Usually, the hdfs
block size is 64Mb, then we will have 80g / 64m = 1280 partitions ? Too many
?

Which criterion will it take ? the number of split files or hdfs block size.


Rule 2 applies here. |textFile| delegates to |hadoopFile|, which 
respects HDFS block size. You may use |RDD.coalesce(n)| to reduce 
partition number if that’s too large for you.



q2) Here, func1 and func2 are sequentially added into DAG. What's the
workflow on the partition level ?
option1: Given a partition, func1 and func2 will be applied to each element
in this partition sequentially. After everything is done, we count the # of
line in the partition and send count result to drive. Then, we take the next
partition and do the same thing?
option2: Or else, we apply func1 to all the partitions first, then apply
func2 to all partitions which have applied func1, count # of line in each
partition and send result to driver ?

I have do some tests, it seems that option1 is correct. Can anyone confirm
this ?
So in option 1, we have 1 job "count" which contains 3 stages: map(func1),
map(func2), count.


Option 1 is correct, however here we only have 1 stage. A stage is only 
introduced when transformations with wide dependencies are used (e.g. 
|reduceByKey|).



q3) What if we run out of memory ?

Suppose we have 12 cores, 15Gb memory in cluster.

Case1 :
For example, the func1 will take one line in file, and create an big object
for each line, then the partition applied func1 will become a large
partition. If we have 12 cores in clusters, that means we may have 12 large
partitions in memory. What if these partitions are much bigger than memory ?
What will happen ? an exception OOM / heap size, etc ?

Case2 :
Suppose the input is 80 GB, but we force RDD to be repartitioned into 6
partitions which is small than the number of core. Normally, each partition
will be send to a core, then all the input will be in memory. However, we
have 15G memory in Cluster. What will happen ? OOM Exception ?
Then, could we just split the RDD into more partitions so that 80GB /
#partition *12(which is # of cores) < 15Gb(memory size) ? Meanwhile, we can
not split too many, which leads to some overhead on task distribution.

If we read data from hdfs using hdfs block size 64MB as partition size, we
will have a formula like:
64Mb * # of cores < Memory
which in most case is true. Could this explain why we reading hdfs using
block size will not leads to OOM like case 2, even if the data is very big
in size.


Wrote an article about this months ago 
http://www.zhihu.com/question/23079001/answer/23569986


The article is in Chinese. I guess you’re Chinese from your name, 
apologize if I’m wrong :)



Sorry for making this post a bit long. Hope I make myself clear.
Any help on any question will be appreciated.

Thank you.

Hao.













--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



​


Re: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Tasks are now getting submitted, but many tasks don't happen.
> Like, after opening the spark-shell, I load a text file from disk and try
> printing its contentsas:
>
> >sc.textFile("/path/to/file").foreach(println)
>
> It does not give me any output. While running this:
>
> >sc.textFile("/path/to/file").count
>
> gives me the right number of lines in the text file.
> Not sure what the error is. But here is the output on the console for
> print case:
>
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
> curMem=709528, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
> memory (estimated size 210.2 KB, free 441.5 MB)
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
> curMem=924758, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
> bytes in memory (estimated size 16.8 KB, free 441.5 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
> broadcast_6_piece0
> 14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
> 14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at :13
> 14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at :13)
> with 2 output partitions (allowLocal=false)
> 14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
> :13)
> 14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
> 14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
> 14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
> MappedRDD[7] at textFile at :13), which has no missing parents
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
> curMem=941997, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
> memory (estimated size 2.4 KB, free 441.4 MB)
> 14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
> curMem=944501, maxMem=463837593
> 14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
> bytes in memory (estimated size 1602.0 B, free 441.4 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
> memory on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
> broadcast_7_piece0
> 14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 3 (Desktop/mnd.txt MappedRDD[7] at textFile at :13)
> 14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
> 14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
> 6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
> 14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
> 7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in
> memory on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
> 14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
> 14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
> 6) in 308 ms on gonephishing.local (1/2)
> 14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at :13)
> finished in 0.321 s
> 14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
> 7) in 315 ms on gonephishing.local (2/2)
> 14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at
> :13, took 0.376602079 s
> 14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
> have all completed, from pool
>
> ===
>
>
>
> On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das 
> wrote:
>
>> ​Try adding the following configurations also, might work.
>>
>>  spark.rdd.compress true
>>
>>   spark.storage.memoryFraction 1
>>   spark.core.connection.ack.wait.timeout 600
>>   spark.akka.frameSize 50
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh <
>> riteshoneinamill...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to submit my application using spark-submit, using following
>>> spark-default.conf params:
>>>
>>> spark.master spark://:7077
>>> spark.eventLog.enabled   true
>>> spark.serializer
>>> org.apache.spark.serializer.KryoSerializer
>>> spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
>>> -Dnumbers="one two three"
>>>
>>> ===
>>> But every time I am getting this error:
>>>
>>> 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
>>> remote Akka client disassociated
>>> 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in s

Fwd: Executor Lost Failure

2014-11-10 Thread Ritesh Kumar Singh
-- Forwarded message --
From: Ritesh Kumar Singh 
Date: Mon, Nov 10, 2014 at 10:52 PM
Subject: Re: Executor Lost Failure
To: Akhil Das 


Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:

>sc.textFile("/path/to/file").foreach(println)

It does not give me any output. While running this:

>sc.textFile("/path/to/file").count

gives me the right number of lines in the text file.
Not sure what the error is. But here is the output on the console for print
case:

14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(215230) called with
curMem=709528, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6 stored as values in
memory (estimated size 210.2 KB, free 441.5 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(17239) called with
curMem=924758, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_6_piece0 stored as
bytes in memory (estimated size 16.8 KB, free 441.5 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:42648 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_6_piece0
14/11/10 22:48:02 INFO FileInputFormat: Total input paths to process : 1
14/11/10 22:48:02 INFO SparkContext: Starting job: foreach at :13
14/11/10 22:48:02 INFO DAGScheduler: Got job 3 (foreach at :13)
with 2 output partitions (allowLocal=false)
14/11/10 22:48:02 INFO DAGScheduler: Final stage: Stage 3(foreach at
:13)
14/11/10 22:48:02 INFO DAGScheduler: Parents of final stage: List()
14/11/10 22:48:02 INFO DAGScheduler: Missing parents: List()
14/11/10 22:48:02 INFO DAGScheduler: Submitting Stage 3 (Desktop/mnd.txt
MappedRDD[7] at textFile at :13), which has no missing parents
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(2504) called with
curMem=941997, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7 stored as values in
memory (estimated size 2.4 KB, free 441.4 MB)
14/11/10 22:48:02 INFO MemoryStore: ensureFreeSpace(1602) called with
curMem=944501, maxMem=463837593
14/11/10 22:48:02 INFO MemoryStore: Block broadcast_7_piece0 stored as
bytes in memory (estimated size 1602.0 B, free 441.4 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:42648 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerMaster: Updated info of block
broadcast_7_piece0
14/11/10 22:48:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
3 (Desktop/mnd.txt MappedRDD[7] at textFile at :13)
14/11/10 22:48:02 INFO TaskSchedulerImpl: Adding task set 3.0 with 2 tasks
14/11/10 22:48:02 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
7, gonephishing.local, PROCESS_LOCAL, 1216 bytes)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory
on gonephishing.local:48857 (size: 1602.0 B, free: 442.3 MB)
14/11/10 22:48:02 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory
on gonephishing.local:48857 (size: 16.8 KB, free: 442.3 MB)
14/11/10 22:48:02 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
6) in 308 ms on gonephishing.local (1/2)
14/11/10 22:48:02 INFO DAGScheduler: Stage 3 (foreach at :13)
finished in 0.321 s
14/11/10 22:48:02 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
7) in 315 ms on gonephishing.local (2/2)
14/11/10 22:48:02 INFO SparkContext: Job finished: foreach at :13,
took 0.376602079 s
14/11/10 22:48:02 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks
have all completed, from pool

===



On Mon, Nov 10, 2014 at 8:01 PM, Akhil Das 
wrote:

> ​Try adding the following configurations also, might work.
>
>  spark.rdd.compress true
>
>   spark.storage.memoryFraction 1
>   spark.core.connection.ack.wait.timeout 600
>   spark.akka.frameSize 50
>
> Thanks
> Best Regards
>
> On Mon, Nov 10, 2014 at 6:51 PM, Ritesh Kumar Singh <
> riteshoneinamill...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to submit my application using spark-submit, using following
>> spark-default.conf params:
>>
>> spark.master spark://:7077
>> spark.eventLog.enabled   true
>> spark.serializer
>> org.apache.spark.serializer.KryoSerializer
>> spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value
>> -Dnumbers="one two three"
>>
>> ===
>> But every time I am getting this error:
>>
>> 14/11/10 18:39:17 ERROR TaskSchedulerImpl: Lost executor 1 on aa.local:
>> remote Akka client disassociated
>> 14/11/10 18:39:17 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
>> aa.local): ExecutorLostFailure (executor lost)
>> 14/11/10 18:39:17 WARN TaskSetManager: Lost ta

Question about textFileStream

2014-11-10 Thread Saiph Kappa
Hi,

In my application I am doing something like this "new
StreamingContext(sparkConf, Seconds(10)).textFileStream("logs/")", and I
get some unknown exceptions when I copy a file with about 800 MB to that
folder ("logs/"). I have a single worker running with 512 MB of memory.

Anyone can tell me if every 10 seconds spark reads parts of that big file,
or if it attempts to read the entire file in a single window? How does it
work?

Thanks.


spark SNAPSHOT repo

2014-11-10 Thread jamborta
Hi,

Just wondering if there is a public repo where I can pull the latest
snapshot or spark (ie to add to my build.sbt)?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka version dependency in Spark 1.2

2014-11-10 Thread Bhaskar Dutta
Hi,

Is there any plan to bump the Kafka version dependency in Spark 1.2 from
0.8.0 to 0.8.1.1?

Current dependency is still on Kafka 0.8.0
https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml

Thanks
Bhaskie


Re: spark SNAPSHOT repo

2014-11-10 Thread Sean Owen
I don't think there are any regular SNAPSHOT builds published to Maven
Central. You can always mvn install the build in your local repo or any
shares repo you want.

If you just want a recentish build of 1.2.0 without rolling your own you
could point to
https://repository.cloudera.com/cloudera/cloudera-repos/org/apache/spark/spark-core_2.10/1.2.0-cdh5.3.0-SNAPSHOT/
for now. It's not actually CDH specific.
Hi,

Just wondering if there is a public repo where I can pull the latest
snapshot or spark (ie to add to my build.sbt)?

thanks,




--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


Re: spark SNAPSHOT repo

2014-11-10 Thread jamborta
thanks, that looks good.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question about textFileStream

2014-11-10 Thread Soumitra Kumar
Entire file in a window.

On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa  wrote:

> Hi,
>
> In my application I am doing something like this "new
> StreamingContext(sparkConf, Seconds(10)).textFileStream("logs/")", and I
> get some unknown exceptions when I copy a file with about 800 MB to that
> folder ("logs/"). I have a single worker running with 512 MB of memory.
>
> Anyone can tell me if every 10 seconds spark reads parts of that big file,
> or if it attempts to read the entire file in a single window? How does it
> work?
>
> Thanks.
>
>


which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
I have some previous experience with Apache Oozie while I was developing in
Apache Pig. Now, I am working explicitly with Apache Spark and I am looking
for a tool with similar functionality. Is Oozie recommended? What about
Luigi? What do you use \ recommend?


Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Jimmy McErlain
I have used Oozie for all our workflows with Spark apps but you will have
to use a java event as the workflow element.   I am interested in anyones
experience with Luigi and/or any other tools.


On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais <
adamantios.cor...@gmail.com> wrote:

> I have some previous experience with Apache Oozie while I was developing
> in Apache Pig. Now, I am working explicitly with Apache Spark and I am
> looking for a tool with similar functionality. Is Oozie recommended? What
> about Luigi? What do you use \ recommend?
>



-- 


"Nothing under the sun is greater than education. By educating one person
and sending him/her into the society of his/her generation, we make a
contribution extending a hundred generations to come."
-Jigoro Kano, Founder of Judo-


Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still 
talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version 
exactly?

Matei

> On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta  wrote:
> 
> Hi,
> 
> Is there any plan to bump the Kafka version dependency in Spark 1.2 from 
> 0.8.0 to 0.8.1.1?
> 
> Current dependency is still on Kafka 0.8.0
> https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml 
> 
> 
> Thanks
> Bhaskie



Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Sean McNamara
> Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka

Yes it can.

"0.8.1 is fully compatible with 0.8."  It is buried on this page: 
http://kafka.apache.org/documentation.html

In addition to the pom version bump SPARK-2492 would bring the kafka streaming 
receiver (which was originally based on kafka 0.7)  in line with kafka 0.8:
https://issues.apache.org/jira/browse/SPARK-2492
https://github.com/apache/spark/pull/1420

I will soon test that PR on a spark+yarn cluster

Cheers,

Sean


On Nov 10, 2014, at 11:58 AM, Matei Zaharia 
mailto:matei.zaha...@gmail.com>> wrote:

Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still 
talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version 
exactly?

Matei

On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta 
mailto:bhas...@gmail.com>> wrote:

Hi,

Is there any plan to bump the Kafka version dependency in Spark 1.2 from 0.8.0 
to 0.8.1.1?

Current dependency is still on Kafka 0.8.0
https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml

Thanks
Bhaskie




Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Helena
Version 0.8.2-beta is published. I'd consider waiting on this, it has quite a
few nice changes coming.
https://archive.apache.org/dist/kafka/0.8.2-beta/RELEASE_NOTES.html

I started the 0.8.1.1 upgrade in a branch a few weeks ago but abandoned it
because I wasn't sure if there was interest beyond myself.

- Helena
@helenaedelson



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-version-dependency-in-Spark-1-2-tp18503p18510.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Akshat Aranya
Hi,

Does there exist a way to serialize Row objects to JSON.  In the absence of
such a way, is the right way to go:

* get hold of schema using SchemaRDD.schema
* Iterate through each individual Row as a Seq and use the schema to
convert values in the row to JSON types.

Thanks,
Akshat


Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Tried --driver-java-options and SPARK_JAVA_OPTS, none of them worked

Had to change the default one and rebuilt.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18513.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Status of MLLib exporting models to PMML

2014-11-10 Thread Aris
Hello Spark and MLLib folks,

So a common problem in the real world of using machine learning is that
some data analysis use tools like R, but the more "data engineers" out
there will use more advanced systems like Spark MLLib or even Python Scikit
Learn.

In the real world, I want to have "a system" where multiple different
modeling environments can learn from data / build models, represent the
models in a common language, and then have a layer which just takes the
model and run model.predict() all day long -- scores the models in other
words.

It looks like the project openscoring.io and jpmml-evaluator are some
amazing systems for this, but they fundamentally use PMML as the model
representation here.

I have read some JIRA tickets that Xiangrui Meng is interested in getting
PMML implemented to export MLLib models, is that happening? Further, would
something like Manish Amde's boosted ensemble tree methods be representable
in PMML?

Thank you!!
Aris


MLLib Decision Tress algorithm hangs, others fine

2014-11-10 Thread tsj
Hello all,

I have some text data that I am running different algorithms on. 
I had no problems with LibSVM and Naive Bayes on the same data, 
but when I run Decision Tree, the execution hangs in the middle 
of DecisionTree.trainClassifier(). The only difference from the example 
given on the site is that I am using 6 categories instead of 2, and the 
input is text that is transformed to labeled points using TF-IDF. It
halts shortly after this log output:

spark.SparkContext: Job finished: collect at DecisionTree.scala:1347, took
1.019579676 s

Any ideas as to what could be causing this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tress-algorithm-hangs-others-fine-tp18515.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Custom persist or cache of RDD?

2014-11-10 Thread Benyi Wang
When I have a multi-step process flow like this:

A -> B -> C -> D -> E -> F

I need to store B and D's results into parquet files

B.saveAsParquetFile
D.saveAsParquetFile

If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.

Of course, I'd better cache all steps if I have enough memory to avoid this
re-computation, or persist result to disk. But persisting B and D seems
duplicate with saving B and D as parquet files.

I'm wondering if spark can restore B and D from the parquet files using a
customized persist and restore procedure?


Re: Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Marcelo Vanzin
Hello,

CDH 5.1.3 ships with a version of Hive that's not entirely the same as
the Hive Spark 1.1 supports. So when building your custom Spark, you
should make sure you change all the dependency versions to point to
the CDH versions.

IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have to
change it to something like org.apache.hive:0.12.0-cdh5.1.3. And
you'll probably run into compilation errors at that point (you can
check out cloudera's public repo for the patches needed to make Spark
1.0 compile against CDH's Hive 0.12 in [1]).

If you're still willing to go forward at this point, feel free to ask
questions, although CDH-specific questions would probably be better
asked on our mailing list instead (cdh-us...@cloudera.org).

[1] https://github.com/cloudera/spark/commits/cdh5-1.0.0_5.1.0

On Mon, Nov 10, 2014 at 3:58 AM, Zalzberg, Idan (Agoda)
 wrote:
> Hello,
>
> I have a big cluster running CDH 5.1.3 which I can’t upgrade to 5.2.0 at the
> current time.
>
> I would like to run Spark-On-Yarn in that cluster.
>
>
>
> I tried to compile spark with CDH-5.1.3 and I got HDFS to work but I am
> having problems with the connection to hive:
>
>
>
> java.sql.SQLException: Could not establish connection to
> jdbc:hive2://localhost.localdomain:1/: Required field
> 'serverProtocolVersion' is unset!
> Struct:TOpenSessionResp(status:TStatus(statusCode:SUCCESS_STATUS),
> serverProtocolVersion:null,
> sessionHandle:TSessionHandle(sessionId:THandleIdentifier(guid:C7 86 85 3D 38
> 91 41 A1 AF 02 83 DA 80 74 A5 B1, secret:62 80 00
>
> 99 D6 73 48 9B 81 13 FB D7 DB 32 32 26)), configuration:{})
>
> [info]   at
> org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:246)
>
> [info]   at
> org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:132)
>
> [info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
>
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.getHiveConnection(HiveTools.scala:135)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.withConnection(HiveTools.scala:19)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.withStatement(HiveTools.scala:30)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.copyFileToHdfsThenRunQuery(HiveTools.scala:110)
>
> [info]   at
> SparkAssemblyTest$$anonfun$4.apply$mcV$sp(SparkAssemblyTest.scala:41)
>
>
>
> This happens when I try to create a hive connection myself, using the
> hive-jdbc-cdh5.1.3 package ( I can connect if I don’t have the spark in the
> classpath)
>
>
>
> How can I get spark jar to be consistent with hive-jdbc for CDH5.1.3?
>
> Thanks
>
>
> 
>
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy this
> message or disclose its content to anyone. Any confidentiality or privilege
> is not waived or lost by any mistaken delivery or unauthorized disclosure of
> the message. All messages sent to and from Agoda may be monitored to ensure
> compliance with company policies, to protect the company's interests and to
> remove potential malware. Electronic messages may be intercepted, amended,
> lost or deleted, or contain viruses.



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Even after changing
core/src/main/resources/org/apache/spark/log4j-defaults.properties to WARN
followed by a rebuild, the log level is still INFO.

Any other suggestions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18518.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Srinivas Chamarthi
I am trying to use spark with spray and I have the  dependency problem with
quasiquotes. The issue comes up only when I include spark dependencies. I
am not sure how this one can be excluded.

Jianshi: can you let me know what version of spray + akka + spark are you
using ?

[error]org.scalamacros:quasiquotes _2.10, _2.10.3
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) Conflicting cross-version suffixes in:
org.scalamacros:quasiq
uotes


On Thu, Oct 30, 2014 at 9:50 PM, Jianshi Huang 
wrote:

> Hi Preshant, Chester, Mohammed,
>
> I switched to Spark's Akka and now it works well. Thanks for the help!
>
> (Need to exclude Akka from Spray dependencies, or specify it as provided)
>
>
> Jianshi
>
>
> On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller 
> wrote:
>
>>  I am not sure about that.
>>
>>
>>
>> Can you try a Spray version built with 2.2.x along with Spark 1.1 and
>> include the Akka dependencies in your project’s sbt file?
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Tuesday, October 28, 2014 8:58 PM
>> *To:* Mohammed Guller
>> *Cc:* user
>> *Subject:* Re: Spray client reports Exception:
>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext
>>
>>
>>
>> I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4,
>> right?
>>
>>
>>
>> Jianshi
>>
>>
>>
>> On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller 
>> wrote:
>>
>> Try a version built with Akka 2.2.x
>>
>>
>>
>> Mohammed
>>
>>
>>
>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>> *Sent:* Tuesday, October 28, 2014 3:03 AM
>> *To:* user
>> *Subject:* Spray client reports Exception:
>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext
>>
>>
>>
>> Hi,
>>
>>
>>
>> I got the following exceptions when using Spray client to write to
>> OpenTSDB using its REST API.
>>
>>
>>
>>   Exception in thread "pool-10-thread-2" java.lang.NoSuchMethodError:
>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;
>>
>>
>>
>> It worked locally in my Intellij but failed when I launch it from
>> Spark-submit.
>>
>>
>>
>> Google suggested it's a compatibility issue in Akka. And I'm using latest
>> Spark built from the HEAD, so the Akka used in Spark-submit is 2.3.4-spark.
>>
>>
>>
>> I tried both Spray 1.3.2 (built for Akka 2.3.6) and 1.3.1 (built for
>> 2.3.4). Both failed with the same exception.
>>
>>
>>
>> Anyone has idea what went wrong? Need help!
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>


Re: disable log4j for spark-shell

2014-11-10 Thread hmxxyy
Some console messages:

14/11/10 20:04:33 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:46713
14/11/10 20:04:33 INFO util.Utils: Successfully started service 'HTTP file
server' on port 46713.
14/11/10 20:04:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/10 20:04:34 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/11/10 20:04:34 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
14/11/10 20:04:34 INFO netty.NettyBlockTransferService: Server created on
46997
14/11/10 20:04:34 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/11/10 20:04:34 INFO storage.BlockManagerMasterActor: Registering block
manager localhost:46997 with 265.0 MB RAM, BlockManagerId(,
localhost, 46997)
14/11/10 20:04:35 INFO storage.BlockManagerMaster: Registered BlockManager

and the log4j-default.properties looks like:

cat core/src/main/resources/org/apache/spark/log4j-defaults.properties
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=WARN

Any suggestions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spray with Spark-sql build fails with Incompatible dependencies

2014-11-10 Thread Srinivas Chamarthi
I am trying to use spark with spray and I have the  dependency problem with
quasiquotes. The issue comes up only when I include spark dependencies. I
am not sure how this one can be excluded.

does anyone tried this before and it works ?

[error] Modules were resolved with conflicting cross-version suffixes in
{file:/
[error]org.scalamacros:quasiquotes _2.10, _2.10.3
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) Conflicting cross-version suffixes in:
org.scalamacros:quasiq
uotes

thx
srinivas


streaming linear regression is not building the model

2014-11-10 Thread Bui, Tri
Hi,

The model weight is not updating for streaming linear regression.  The code and 
data below is what I am running.

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

val conf = new SparkConf().setMaster("local[1]").setAppName("1feature")
val ssc = new StreamingContext(conf, Seconds(25))
val trainingData = 
ssc.textFileStream("file:///data/TrainStreamDir").map(LabeledPoint.parse)
val testData = 
ssc.textFileStream("file:///data/TestStreamDir").map(LabeledPoint.parse)
val numFeatures = 3
val model = new 
StreamingLinearRegressionWithSGD().setInitialWeights(Vectors.zeros(numFeatures))
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()

sample Data in the TrainStreamDir:

(10240,[1,21,0])
(9936,[2,21,15])
(10118,[3,21,30])
(10174,[4,21,45])
(10460,[5,22,0])
(9961,[6,22,15])
(10372,[7,22,30])
(10666,[8,22,45])
(10300,[9,23,0])

Sample of output results:
14/11/10 15:52:55 INFO scheduler.JobScheduler: Added jobs for time 
1415652775000 ms
14/11/10 15:52:55 INFO scheduler.JobScheduler: Starting job streaming job 
1415652775000 ms.0 from job set of time 141565
2775000 ms
14/11/10 15:52:55 INFO spark.SparkContext: Starting job: count at 
GradientDescent.scala:162
14/11/10 15:52:55 INFO spark.SparkContext: Job finished: count at 
GradientDescent.scala:162, took 3.1689E-5 s
14/11/10 15:52:55 INFO optimization.GradientDescent: 
GradientDescent.runMiniBatchSGD returning initial weights, no data
found
14/11/10 15:52:55 INFO regression.StreamingLinearRegressionWithSGD: Model 
updated at time 1415652775000 ms
14/11/10 15:52:55 INFO regression.StreamingLinearRegressionWithSGD: Current 
model: weights, [0.0,0.0,0.0]

Thanks
Tri



Re: Fwd: Executor Lost Failure

2014-11-10 Thread Ankur Dave
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh 
 wrote:
> Tasks are now getting submitted, but many tasks don't happen.
> Like, after opening the spark-shell, I load a text file from disk and try
> printing its contentsas:
>
>>sc.textFile("/path/to/file").foreach(println)
>
> It does not give me any output.

That's because foreach launches tasks on the slaves. When each task tries to 
print its lines, they go to the stdout file on the slave rather than to your 
console at the driver. You should see the file's contents in each of the 
slaves' stdout files in the web UI.

This only happens when running on a cluster. In local mode, all the tasks are 
running locally and can output to the driver, so foreach(println) is more 
useful.

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node 
with the ec2 script, and I'm currently breaking the job into many small parts 
(~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The 
job consists of loading a file from S3, performing minor parsing, storing the 
results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe 
timeout errors - and for over half of the jobs that fail, when they are re-run 
they succeed. Still, a task failing shouldn't crash the entire job: it should 
just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that 
when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a 
SparkException and brings down Spark Master. If it was a slave, it would be OK 
- it could either re-register and continue, or not, but the entire job would 
continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on 
when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't 
see that option. Does that exist? Can I exclude Master from accepting tasks in 
Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what 
file in S3) is causing the error and fixing it. But right now I'd just like to 
get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael


Re: Custom persist or cache of RDD?

2014-11-10 Thread Sean Owen
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.

On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang  wrote:
> When I have a multi-step process flow like this:
>
> A -> B -> C -> D -> E -> F
>
> I need to store B and D's results into parquet files
>
> B.saveAsParquetFile
> D.saveAsParquetFile
>
> If I don't cache/persist any step, spark might recompute from A,B,C,D and E
> if something is wrong in F.
>
> Of course, I'd better cache all steps if I have enough memory to avoid this
> re-computation, or persist result to disk. But persisting B and D seems
> duplicate with saving B and D as parquet files.
>
> I'm wondering if spark can restore B and D from the parquet files using a
> customized persist and restore procedure?
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark Master crashes job on task failure

2014-11-10 Thread Griffiths, Michael (NYC-RPM)
Nevermind - I don't know what I was thinking with the below. It's just 
maxTaskFailures causing the job to failure.

From: Griffiths, Michael (NYC-RPM) [mailto:michael.griffi...@reprisemedia.com]
Sent: Monday, November 10, 2014 4:48 PM
To: user@spark.apache.org
Subject: Spark Master crashes job on task failure

Hi,

I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node 
with the ec2 script, and I'm currently breaking the job into many small parts 
(~2,000) to better examine progress and failure.

Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The 
job consists of loading a file from S3, performing minor parsing, storing the 
results in a RDD. The results are then saveAsTextFile to Hadoop.

Unfortunately, it keeps crashing. A small number of the jobs fail - I believe 
timeout errors - and for over half of the jobs that fail, when they are re-run 
they succeed. Still, a task failing shouldn't crash the entire job: it should 
just retry up to four times, and then give up.

However, the entire job does crash. I was wondering why, but I believe that 
when a job is assigned to SPARK_MASTER and it fails multiple times, it throws a 
SparkException and brings down Spark Master. If it was a slave, it would be OK 
- it could either re-register and continue, or not, but the entire job would 
continue (to completion).

I've run the job a few times now, and the point at which it crashes depends on 
when one of the failing jobs gets assigned to master.

The short-term solution would be exclude Master from running jobs, but I don't 
see that option. Does that exist? Can I exclude Master from accepting tasks in 
Spark standalone mode?

The long term solution, of course, is figuring what part of the job (or what 
file in S3) is causing the error and fixing it. But right now I'd just like to 
get the first results back, knowing I'll be missing 0.25% of data.

Thanks,
Michael


convert List to dstream

2014-11-10 Thread Josh J
Hi,

I have some data generated by some utilities that returns the results as
a List. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors

val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray())
   list_queue.add(list_scalaconverted)
val list_stream = ssc.queueStream(list_queue)

 found   : org.apache.spark.rdd.RDD[Object]

 required: org.apache.spark.rdd.RDD[String]

Note: Object >: String, but class RDD is invariant in type T.

You may wish to define T as -T instead. (SLS 4.5)


and


 found   : java.util.LinkedList[org.apache.spark.rdd.RDD[String]]

 required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]]


Thanks,

Josh


JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Something Something
I am embarrassed to admit but I can't get a basic 'word count' to work
under Kafka/Spark streaming.  My code looks like this.  I  don't see any
word counts in console output.  Also, don't see any output in UI.  Needless
to say, I am newbie in both 'Spark' as well as 'Kafka'.

Please help.  Thanks.

Here's the code:

public static void main(String[] args) {
if (args.length < 4) {
System.err.println("Usage: JavaKafkaWordCount 
  ");
System.exit(1);
}

//StreamingExamples.setStreamingLogLevels();
//SparkConf sparkConf = new
SparkConf().setAppName("JavaKafkaWordCount");

// Location of the Spark directory
String sparkHome = "/opt/mapr/spark/spark-1.0.2/";

// URL of the Spark cluster
String sparkUrl = "spark://mymachine:7077";

// Location of the required JAR files
String jarFiles =
"./spark-streaming-kafka_2.10-1.1.0.jar,./DlSpark-1.0-SNAPSHOT.jar,./zkclient-0.3.jar,./kafka_2.10-0.8.1.1.jar,./metrics-core-2.2.0.jar";

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("JavaKafkaWordCount");
sparkConf.setJars(new String[]{jarFiles});
sparkConf.setMaster(sparkUrl);
sparkConf.set("spark.ui.port", "2348");
sparkConf.setSparkHome(sparkHome);

Map kafkaParams = new HashMap();
kafkaParams.put("zookeeper.connect", "myedgenode:2181");
kafkaParams.put("group.id", "1");
kafkaParams.put("metadata.broker.list", "myedgenode:9092");
kafkaParams.put("serializer.class",
"kafka.serializer.StringEncoder");
kafkaParams.put("request.required.acks", "1");

// Create the context with a 1 second batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new
Duration(2000));

int numThreads = Integer.parseInt(args[3]);
Map topicMap = new HashMap();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}

//JavaPairReceiverInputDStream messages =
//KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaPairDStream messages =
KafkaUtils.createStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicMap,
StorageLevel.MEMORY_ONLY_SER());


JavaDStream lines = messages.map(new
Function, String>() {
@Override
public String call(Tuple2 tuple2) {
return tuple2._2();
}
});

JavaDStream words = lines.flatMap(new
FlatMapFunction() {
@Override
public Iterable call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});

JavaPairDStream wordCounts = words.mapToPair(
new PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}).reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});

wordCounts.print();
jssc.start();
jssc.awaitTermination();


Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
Hello,

I am trying to build Spark from source using the following:

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m"

mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean
package

this works OK with branch-1.1, when I switch to branch-1.2, I get the
following error while building spark-core:

[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING]
/home/spark/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala:50:
inferred existential type (org.apache.spark.scheduler.DirectTaskResult[_$1],
Int) forSome { type _$1 }, which cannot be expressed by wildcards,  should
be enabled
by making the implicit value scala.language.existentials visible.
This can be achieved by adding the import clause 'import
scala.language.existentials'
or by setting the compiler option -language:existentials.
See the Scala docs for value scala.language.existentials for a discussion
why the feature should be explicitly enabled.
[WARNING]   val (result, size) =
serializer.get().deserialize[TaskResult[_]](serializedData) match {
[WARNING]   ^
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[ERROR] 
 while compiling:
/home/spark/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala
during phase: typer
 library version: version 2.10.4
compiler version: version 2.10.4

...

Exception in thread "main" java.lang.AssertionError: assertion failed:
org.eclipse.jetty.server.DispatcherType
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1212)
at
scala.reflect.internal.Types$ClassTypeRef$class.baseType(Types.scala:2186)
at 
scala.reflect.internal.Types$TypeRef$$anon$6.baseType(Types.scala:2544)
at scala.reflect.internal.Types$class.firstTry$1(Types.scala:6043)
at scala.reflect.internal.Types$class.isSubType2(Types.scala:6207)
at scala.reflect.internal.Types$class.isSubType(Types.scala:5816)
at scala.reflect.internal.SymbolTable.isSubType(SymbolTable.scala:13)
at scala.reflect.internal.Types$class.isSubArg$1(Types.scala:6005)
at
scala.reflect.internal.Types$$anonfun$isSubArgs$2.apply(Types.scala:6007)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread sadhan
I ran into the same issue, reverting this commit seems to work
https://github.com/apache/spark/commit/bd86cb1738800a0aa4c88b9afdba2f97ac6cbf25



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Tathagata Das
What is the Spark master that you are using. Use local[4], not local
if you are running locally.

On Mon, Nov 10, 2014 at 3:01 PM, Something Something
 wrote:
> I am embarrassed to admit but I can't get a basic 'word count' to work under
> Kafka/Spark streaming.  My code looks like this.  I  don't see any word
> counts in console output.  Also, don't see any output in UI.  Needless to
> say, I am newbie in both 'Spark' as well as 'Kafka'.
>
> Please help.  Thanks.
>
> Here's the code:
>
> public static void main(String[] args) {
> if (args.length < 4) {
> System.err.println("Usage: JavaKafkaWordCount  
>  ");
> System.exit(1);
> }
>
> //StreamingExamples.setStreamingLogLevels();
> //SparkConf sparkConf = new
> SparkConf().setAppName("JavaKafkaWordCount");
>
> // Location of the Spark directory
> String sparkHome = "/opt/mapr/spark/spark-1.0.2/";
>
> // URL of the Spark cluster
> String sparkUrl = "spark://mymachine:7077";
>
> // Location of the required JAR files
> String jarFiles =
> "./spark-streaming-kafka_2.10-1.1.0.jar,./DlSpark-1.0-SNAPSHOT.jar,./zkclient-0.3.jar,./kafka_2.10-0.8.1.1.jar,./metrics-core-2.2.0.jar";
>
> SparkConf sparkConf = new SparkConf();
> sparkConf.setAppName("JavaKafkaWordCount");
> sparkConf.setJars(new String[]{jarFiles});
> sparkConf.setMaster(sparkUrl);
> sparkConf.set("spark.ui.port", "2348");
> sparkConf.setSparkHome(sparkHome);
>
> Map kafkaParams = new HashMap();
> kafkaParams.put("zookeeper.connect", "myedgenode:2181");
> kafkaParams.put("group.id", "1");
> kafkaParams.put("metadata.broker.list", "myedgenode:9092");
> kafkaParams.put("serializer.class",
> "kafka.serializer.StringEncoder");
> kafkaParams.put("request.required.acks", "1");
>
> // Create the context with a 1 second batch size
> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new
> Duration(2000));
>
> int numThreads = Integer.parseInt(args[3]);
> Map topicMap = new HashMap();
> String[] topics = args[2].split(",");
> for (String topic: topics) {
> topicMap.put(topic, numThreads);
> }
>
> //JavaPairReceiverInputDStream messages =
> //KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
> JavaPairDStream messages =
> KafkaUtils.createStream(jssc,
> String.class,
> String.class,
> StringDecoder.class,
> StringDecoder.class,
> kafkaParams,
> topicMap,
> StorageLevel.MEMORY_ONLY_SER());
>
>
> JavaDStream lines = messages.map(new Function String>, String>() {
> @Override
> public String call(Tuple2 tuple2) {
> return tuple2._2();
> }
> });
>
> JavaDStream words = lines.flatMap(new
> FlatMapFunction() {
> @Override
> public Iterable call(String x) {
> return Lists.newArrayList(SPACE.split(x));
> }
> });
>
> JavaPairDStream wordCounts = words.mapToPair(
> new PairFunction() {
> @Override
> public Tuple2 call(String s) {
> return new Tuple2(s, 1);
> }
> }).reduceByKey(new Function2() {
> @Override
> public Integer call(Integer i1, Integer i2) {
> return i1 + i2;
> }
> });
>
> wordCounts.print();
> jssc.start();
> jssc.awaitTermination();
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
ah, thanks, reverted a few days, works now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Something Something
I am not running locally.  The Spark master is:

"spark://:7077"



On Mon, Nov 10, 2014 at 3:47 PM, Tathagata Das 
wrote:

> What is the Spark master that you are using. Use local[4], not local
> if you are running locally.
>
> On Mon, Nov 10, 2014 at 3:01 PM, Something Something
>  wrote:
> > I am embarrassed to admit but I can't get a basic 'word count' to work
> under
> > Kafka/Spark streaming.  My code looks like this.  I  don't see any word
> > counts in console output.  Also, don't see any output in UI.  Needless to
> > say, I am newbie in both 'Spark' as well as 'Kafka'.
> >
> > Please help.  Thanks.
> >
> > Here's the code:
> >
> > public static void main(String[] args) {
> > if (args.length < 4) {
> > System.err.println("Usage: JavaKafkaWordCount 
> 
> >  ");
> > System.exit(1);
> > }
> >
> > //StreamingExamples.setStreamingLogLevels();
> > //SparkConf sparkConf = new
> > SparkConf().setAppName("JavaKafkaWordCount");
> >
> > // Location of the Spark directory
> > String sparkHome = "/opt/mapr/spark/spark-1.0.2/";
> >
> > // URL of the Spark cluster
> > String sparkUrl = "spark://mymachine:7077";
> >
> > // Location of the required JAR files
> > String jarFiles =
> >
> "./spark-streaming-kafka_2.10-1.1.0.jar,./DlSpark-1.0-SNAPSHOT.jar,./zkclient-0.3.jar,./kafka_2.10-0.8.1.1.jar,./metrics-core-2.2.0.jar";
> >
> > SparkConf sparkConf = new SparkConf();
> > sparkConf.setAppName("JavaKafkaWordCount");
> > sparkConf.setJars(new String[]{jarFiles});
> > sparkConf.setMaster(sparkUrl);
> > sparkConf.set("spark.ui.port", "2348");
> > sparkConf.setSparkHome(sparkHome);
> >
> > Map kafkaParams = new HashMap();
> > kafkaParams.put("zookeeper.connect", "myedgenode:2181");
> > kafkaParams.put("group.id", "1");
> > kafkaParams.put("metadata.broker.list", "myedgenode:9092");
> > kafkaParams.put("serializer.class",
> > "kafka.serializer.StringEncoder");
> > kafkaParams.put("request.required.acks", "1");
> >
> > // Create the context with a 1 second batch size
> > JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
> new
> > Duration(2000));
> >
> > int numThreads = Integer.parseInt(args[3]);
> > Map topicMap = new HashMap();
> > String[] topics = args[2].split(",");
> > for (String topic: topics) {
> > topicMap.put(topic, numThreads);
> > }
> >
> > //JavaPairReceiverInputDStream messages =
> > //KafkaUtils.createStream(jssc, args[0], args[1],
> topicMap);
> > JavaPairDStream messages =
> > KafkaUtils.createStream(jssc,
> > String.class,
> > String.class,
> > StringDecoder.class,
> > StringDecoder.class,
> > kafkaParams,
> > topicMap,
> > StorageLevel.MEMORY_ONLY_SER());
> >
> >
> > JavaDStream lines = messages.map(new
> Function > String>, String>() {
> > @Override
> > public String call(Tuple2 tuple2) {
> > return tuple2._2();
> > }
> > });
> >
> > JavaDStream words = lines.flatMap(new
> > FlatMapFunction() {
> > @Override
> > public Iterable call(String x) {
> > return Lists.newArrayList(SPACE.split(x));
> > }
> > });
> >
> > JavaPairDStream wordCounts = words.mapToPair(
> > new PairFunction() {
> > @Override
> > public Tuple2 call(String s) {
> > return new Tuple2(s, 1);
> > }
> > }).reduceByKey(new Function2 Integer>() {
> > @Override
> > public Integer call(Integer i1, Integer i2) {
> > return i1 + i2;
> > }
> > });
> >
> > wordCounts.print();
> > jssc.start();
> > jssc.awaitTermination();
> >
>


thrift jdbc server probably running queries as hive query

2014-11-10 Thread Sadhan Sood
I was testing out the spark thrift jdbc server by running a simple query in
the beeline client. The spark itself is running on a yarn cluster.

However, when I run a query in beeline -> I see no running jobs in the
spark UI(completely empty) and the yarn UI seem to indicate that the
submitted query is being run as a map reduce job. This is probably also
being indicated from the spark logs but I am not completely sure:

2014-11-11 00:19:00,492 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1

2014-11-11 00:19:00,877 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

2014-11-11 00:19:04,152 INFO  ql.Context
(Context.java:getMRScratchDir(267)) - New scratch dir is
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2

2014-11-11 00:19:04,425 INFO  Configuration.deprecation
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.submit.replication
is deprecated. Instead, use mapreduce.client.submit.file.replication

2014-11-11 00:19:04,516 INFO  client.RMProxy
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
at :8032

2014-11-11 00:19:04,607 INFO  client.RMProxy
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager
at :8032

2014-11-11 00:19:04,639 WARN  mapreduce.JobSubmitter
(JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option
parsing not performed. Implement the Tool interface and execute your
application with ToolRunner to remedy this

2014-11-11 00:00:08,806 INFO  input.FileInputFormat
(FileInputFormat.java:listStatus(287)) - Total input paths to process :
14912

2014-11-11 00:00:08,864 INFO  lzo.GPLNativeCodeLoader
(GPLNativeCodeLoader.java:(34)) - Loaded native gpl library

2014-11-11 00:00:08,866 INFO  lzo.LzoCodec (LzoCodec.java:(76)) -
Successfully loaded & initialized native-lzo library [hadoop-lzo rev
8e266e052e423af592871e2dfe09d54c03f6a0e8]

2014-11-11 00:00:09,873 INFO  input.CombineFileInputFormat
(CombineFileInputFormat.java:createSplits(413)) - DEBUG: Terminated node
allocation with : CompletedNodes: 1, size left: 194541317

2014-11-11 00:00:10,017 INFO  mapreduce.JobSubmitter
(JobSubmitter.java:submitJobInternal(396)) - number of splits:615

2014-11-11 00:00:10,095 INFO  mapreduce.JobSubmitter
(JobSubmitter.java:printTokens(479)) - Submitting tokens for job:
job_1414084656759_0115

2014-11-11 00:00:10,241 INFO  impl.YarnClientImpl
(YarnClientImpl.java:submitApplication(167)) - Submitted application
application_1414084656759_0115


It seems like the query is being run as a hive query instead of spark
query. The same query works fine when run from spark-sql cli.


Re: disable log4j for spark-shell

2014-11-10 Thread lordjoe
public static void main(String[] args) throws Exception {
 System.out.println("Set Log to Warn");
Logger rootLogger = Logger.getRootLogger();
rootLogger.setLevel(Level.WARN);
...
 works for me




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: convert List to dstream

2014-11-10 Thread Tobias Pfeiffer
Josh,

On Tue, Nov 11, 2014 at 7:43 AM, Josh J  wrote:
>
> I have some data generated by some utilities that returns the results as
> a List. I would like to join this with a Dstream of strings. How
> can I do this? I tried the following though get scala compiler errors
>
> val list_scalaconverted =
> ssc.sparkContext.parallelize(listvalues.toArray())
>

Your `listvalues` seems to be a java.util.List, not a
scala.collection.immutable.List, right? In that case, toArray() will return
a Array[Object], not an Array[String], which leads to the error you see.
Have a look at
http://www.scala-lang.org/api/current/index.html#scala.collection.JavaConversions$
and convert your Java list to a Scala list.

Tobias


Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Tobias Pfeiffer
Akshat

On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya  wrote:
>
> Does there exist a way to serialize Row objects to JSON.
>

I can't think of any other way than the one you proposed.  A Row is more or
less an Array[Object], so you need to read JSON key and data type from the
schema.

Tobias


Re: closure serialization behavior driving me crazy

2014-11-10 Thread Tobias Pfeiffer
Sandy,

On Mon, Nov 10, 2014 at 6:01 PM, Sandy Ryza  wrote:
>
> The result array is 1867 x 5. It serialized is 80k bytes, which seems
> about right:
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
>   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
> cap=80027]
>
> If I reference it from a simple function:
>   scala> def func(x: Long) => arr.length
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
> I get a NotSerializableException.
>

(Are you in fact doing this from the Scala shell? Maybe try it from
compiled code.)

I may be wrong on that, but my understanding is that a "def" is a lot
different from a "val" (holding a function) in terms of serialization.  A
"def" (such as yours defined above) that is a method of a class does not
have a class file of its own, so it must be serialized with the whole
object that it belongs to (which can fail easily if you reference a
SparkContext etc.).
A "val" (such as `val func: Long => Long = x => arr.length`) will be
compiled into a class file of its own, so only that (pretty small) function
object will have to be serialized.  That may explain why some functions are
serializable while others with the same content aren't, and also the
difference in size.

Tobias


Question about RDD Union and SubtractByKey

2014-11-10 Thread Darin McBeath
I have the following code where I'm using RDD 'union' and 'subtractByKey' to 
create a new baseline RDD.  All of my RDDs are a key pair with the 'key' a 
String and the 'value' a String (xml document).
// **// Merge the daily 
deletes/updates/adds with the baseline// 
** // Concat the Updates, 
Deletes into one PairRDDJavaPairRDD updateDeletePairRDD = 
updatePairRDD.union(deletePairRDD); // Remove the update/delete  keys from the 
baselineJavaPairRDD workSubtractBaselinePairRDD = 
baselinePairRDD.subtractByKey(updateDeletePairRDD); // Add in the 
AddsJavaPairRDD workAddBaselinePairRDD = 
workSubtractBaselinePairRDD.union(addPairRDD);

// Add in the UpdatesJavaPairRDD newBaselinePairRDD = 
workAddBaselinePairRDD.union(updatePairRDD);
When I go to 'count' the newBaselinePairRDD 
// Output count for new baseline log.info("Number of new baseline records: " + 
newBaselinePairRDD.count());
I'm getting the following exception (the above log.info is SparkSync.java:785). 
 What I find odd is the reference to spark sql.  So, I'm curious as to whether 
under the covers the RDD union and/or subtractByKey are implemented as spark 
sql. I wouldn't think so but thought I would ask.  I'm also suspicious to the 
reference to the '<' and whether that is because of my xml document in the 
value portion of the key pair.  Any insights would be appreciated.  If there 
are thoughts for how to better approach my problem (even debugging), I would be 
interested in that as well.  The updatePairRDD, deletePairRDD, baselinePairRDD, 
addPairRDD, and updateDeletePairRDD are all 'hashPartitioned'.
It's also a bit difficult to trace things because my application is a 'java' 
application and the stack references a lot of scala and very few references to 
my application other than one (SparkSync.java:785).  My application is using 
Spark SQL for some other tasks so perhaps an RDD (part) is being re-calculated 
and is resulting in this error.  But, based on other logging statements 
throughout my application, I don't believe this is the case.
Thanks.
Darin.
14/11/10 22:35:27 INFO scheduler.DAGScheduler: Failed to run count at 
SparkSync.java:78514/11/10 22:35:27 WARN scheduler.TaskSetManager: Lost task 
0.3 in stage 40.0 (TID 10674, ip-10-149-76-35.ec2.internal): 
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 
60)): expected a valid value (number, String, array, object, 'true', 'false' or 
'null') at [Source: java.io.StringReader@e8f759e; line: 1, column: 2]        
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1524)     
   
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:557)
        
com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:475)
        
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1415)
        
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:679)
        
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3024)
        
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)
        
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)   
     
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:275)
        
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(JsonRDD.scala:274)
        scala.collection.Iterator$$anon$11.next(Iterator.scala:328)        
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)        
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:62)
        
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)
        
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)        
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)        
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)        
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)        
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)        
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)        
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)        
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)        
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)        
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)      
  org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)    
    org.apache.spark.scheduler.Task.run(Task.scala:54)        
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)        
java.ut

inconsistent edge counts in GraphX

2014-11-10 Thread Buttler, David
Hi,
I am building a graph from a large CSV file.  Each record contains a couple of 
nodes and about 10 edges.  When I try to load a large portion of the graph, 
using multiple partitions, I get inconsistent results in the number of edges 
between different runs.  However, if I use a single partition, or a small 
portion of the CSV file (say 1000 rows), then I get a consistent number of 
edges.  Is there anything I should be aware of as to why this could be 
happening in GraphX?

Thanks,
Dave



Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread buring
You should supply more information about your input data.
For example ,I generate a IndexRowMatrix from ALS algorithm input data
format,my code like this:

val inputData = sc.textFile(fname).map{
  line=>
val parts = line.trim.split(' ')
(parts(0).toLong,parts(1).toInt,parts(2).toDouble)
}   

val ncol = inputData.map(_._2).max()+1
val nrows = inputData.map(_._1).max()+1
logInfo(s"rows:$nrows,columns:$ncol")

val dataRows = inputData.groupBy(_._1).map[IndexedRow]{ row =>
  val (indices, values) = row._2.map(e => (e._2, e._3)).unzip
  new IndexedRow(row._1, new SparseVector(ncol, indices.toArray,
values.toArray))
}   

val svd  = new
IndexedRowMatrix(dataRows.persist(),nrows,ncol).computeSVD(rank,computeU =
true)
 
If your input data has no index information,I think you should care
about the sort of rows in your RowMatrix, your matrix multiply should not
rely on assumption rowmatrix ordered.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/To-generate-IndexedRowMatrix-from-an-RowMatrix-tp18490p18541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mapping SchemaRDD/Row to JSON

2014-11-10 Thread Michael Armbrust
There is a JIRA for adding this:
https://issues.apache.org/jira/browse/SPARK-4228

Your described approach sounds reasonable.

On Mon, Nov 10, 2014 at 5:10 PM, Tobias Pfeiffer  wrote:

> Akshat
>
> On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya  wrote:
>>
>> Does there exist a way to serialize Row objects to JSON.
>>
>
> I can't think of any other way than the one you proposed.  A Row is more
> or less an Array[Object], so you need to read JSON key and data type from
> the schema.
>
> Tobias
>
>
>


Checkpoint bugs in GraphX

2014-11-10 Thread Xu Lijie
Hi, all. I'm not sure whether someone has reported this bug:


There should be a checkpoint() method in EdgeRDD and VertexRDD as follows:

override def checkpoint(): Unit = { partitionsRDD.checkpoint() }


Current EdgeRDD and VertexRDD use *RDD.checkpoint()*, which only checkpoint
the edges/vertices but not the critical partitionsRDD.


Also, the variables (partitionsRDD and targetStroageLevel) in EdgeRDD and
VertexRDD should be transient.

class EdgeRDD[@specialized ED: ClassTag, VD: ClassTag]( @transient val
partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])], @transient val
targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) extends
RDD[Edge[ED]](partitionsRDD.context, List(new
OneToOneDependency(partitionsRDD))) {


class VertexRDD[@specialized VD: ClassTag]( @transient val partitionsRDD:
RDD[ShippableVertexPartition[VD]], @transient val targetStorageLevel:
StorageLevel = StorageLevel.MEMORY_ONLY) extends RDD[(VertexId,
VD)](partitionsRDD.context, List(new OneToOneDependency(partitionsRDD))) {


These two bugs usually lead to stackoverflow error in iterative application
written by GraphX.


Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian

Hey Sadhan,

I really don't think this is Spark log... Unlike Shark, Spark SQL 
doesn't even provide a Hive mode to let you execute queries against 
Hive. Would you please check whether there is an existing HiveServer2 
running there? Spark SQL HiveThriftServer2 is just a Spark port of 
HiveServer2, and they share the same default listening port. I guess the 
Thrift server didn't start successfully because the HiveServer2 occupied 
the port, and your Beeline session was probably linked against HiveServer2.


Cheng

On 11/11/14 8:29 AM, Sadhan Sood wrote:
I was testing out the spark thrift jdbc server by running a simple 
query in the beeline client. The spark itself is running on a yarn 
cluster.


However, when I run a query in beeline -> I see no running jobs in the 
spark UI(completely empty) and the yarn UI seem to indicate that the 
submitted query is being run as a map reduce job. This is probably 
also being indicated from the spark logs but I am not completely sure:


2014-11-11 00:19:00,492 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-1


2014-11-11 00:19:00,877 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2


2014-11-11 00:19:04,152 INFO  ql.Context 
(Context.java:getMRScratchDir(267)) - New scratch dir is 
hdfs://:9000/tmp/hive-ubuntu/hive_2014-11-11_00-19-00_367_3847629323646885865-2


2014-11-11 00:19:04,425 INFO Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - 
mapred.submit.replication is deprecated. Instead, use 
mapreduce.client.submit.file.replication


2014-11-11 00:19:04,516 INFO  client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager 
at :8032


2014-11-11 00:19:04,607 INFO  client.RMProxy 
(RMProxy.java:createRMProxy(92)) - Connecting to ResourceManager 
at :8032


2014-11-11 00:19:04,639 WARN mapreduce.JobSubmitter 
(JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line 
option parsing not performed. Implement the Tool interface and execute 
your application with ToolRunner to remedy this


2014-11-11 00:00:08,806 INFO  input.FileInputFormat 
(FileInputFormat.java:listStatus(287)) - Total input paths to process 
: 14912


2014-11-11 00:00:08,864 INFO  lzo.GPLNativeCodeLoader 
(GPLNativeCodeLoader.java:(34)) - Loaded native gpl library


2014-11-11 00:00:08,866 INFO  lzo.LzoCodec 
(LzoCodec.java:(76)) - Successfully loaded & initialized 
native-lzo library [hadoop-lzo rev 
8e266e052e423af592871e2dfe09d54c03f6a0e8]


2014-11-11 00:00:09,873 INFO  input.CombineFileInputFormat 
(CombineFileInputFormat.java:createSplits(413)) - DEBUG: Terminated 
node allocation with : CompletedNodes: 1, size left: 194541317


2014-11-11 00:00:10,017 INFO  mapreduce.JobSubmitter 
(JobSubmitter.java:submitJobInternal(396)) - number of splits:615


2014-11-11 00:00:10,095 INFO  mapreduce.JobSubmitter 
(JobSubmitter.java:printTokens(479)) - Submitting tokens for job: 
job_1414084656759_0115


2014-11-11 00:00:10,241 INFO  impl.YarnClientImpl 
(YarnClientImpl.java:submitApplication(167)) - Submitted application 
application_1414084656759_0115



It seems like the query is being run as a hive query instead of spark 
query. The same query works fine when run from spark-sql cli.






Discuss how to do checkpoint more efficently

2014-11-10 Thread Xu Lijie
Hi, all. I want to seek suggestions on how to do checkpoint more
efficiently, especially for iterative applications written by GraphX.


For iterative applications, the lineage of a job can be very long, which is
easy to cause statckoverflow error. A solution is to do checkpoint.
However, checkpoint is time-consuming and not easy for ordinary users to
perform (e.g., which RDDs need checkpoint and when to checkpoint them).
Moreover, to shorten the linage, iterative applications need to do
checkpoint frequently (e.g., every 10 iterations). As a result, checkpoint
is too heavy for iterative applications especially written by GraphX.

I'm wondering if there is an elegant way to solve the problem: shortening
the lineage and also saving the intermediate data/results in a lightweight
way.

Maybe we can develop a new API like checkpoint(StorageLevel), which has the
feature of both cache() and current checkpoint().




Examples:

The lineage is very long without checkpoint even in the first iteration in
GraphX job.

[Iter 1][DEBUG] (2) EdgeRDD[33] at RDD at EdgeRDD.scala:35
 |  EdgeRDD ZippedPartitionsRDD2[32] at zipPartitions at
ReplicatedVertexView.scala:114
 |  EdgeRDD MapPartitionsRDD[12] at mapPartitionsWithIndex at
EdgeRDD.scala:169
 |  MappedRDD[11] at map at Graph.scala:392
 |  MappedRDD[10] at distinct at KCoreCommonDebug.scala:115
 |  ShuffledRDD[9] at distinct at KCoreCommonDebug.scala:115
 +-(2) MappedRDD[8] at distinct at KCoreCommonDebug.scala:115
|  FilteredRDD[7] at filter at KCoreCommonDebug.scala:112
|  MappedRDD[6] at map at KCoreCommonDebug.scala:102
|  MappedRDD[5] at repartition at KCoreCommonDebug.scala:101
|  CoalescedRDD[4] at repartition at KCoreCommonDebug.scala:101
|  ShuffledRDD[3] at repartition at KCoreCommonDebug.scala:101
+-(2) MapPartitionsRDD[2] at repartition at KCoreCommonDebug.scala:101
   |  D:\graphData\verylarge.txt MappedRDD[1] at textFile at
KCoreCommonDebug.scala:100
   |  D:\graphData\verylarge.txt HadoopRDD[0] at textFile at
KCoreCommonDebug.scala:100
 |  ShuffledRDD[31] at partitionBy at ReplicatedVertexView.scala:112
 +-(2) ReplicatedVertexView.updateVertices - shippedVerts false false
(broadcast) MapPartitionsRDD[30] at mapPartitions at VertexRDD.scala:347
|  VertexRDD ZippedPartitionsRDD2[28] at zipPartitions at
VertexRDD.scala:174
|  VertexRDD, VertexRDD MapPartitionsRDD[18] at mapPartitions at
VertexRDD.scala:441
|  MapPartitionsRDD[17] at mapPartitions at VertexRDD.scala:457
|  ShuffledRDD[16] at ShuffledRDD at RoutingTablePartition.scala:36
+-(2) VertexRDD.createRoutingTables - vid2pid (aggregation)
MapPartitionsRDD[15] at mapPartitions at VertexRDD.scala:452
   |  EdgeRDD MapPartitionsRDD[12] at mapPartitionsWithIndex at
EdgeRDD.scala:169
   |  MappedRDD[11] at map at Graph.scala:392
   |  MappedRDD[10] at distinct at KCoreCommonDebug.scala:115
   |  ShuffledRDD[9] at distinct at KCoreCommonDebug.scala:115
   +-(2) MappedRDD[8] at distinct at KCoreCommonDebug.scala:115
  |  FilteredRDD[7] at filter at KCoreCommonDebug.scala:112
  |  MappedRDD[6] at map at KCoreCommonDebug.scala:102
  |  MappedRDD[5] at repartition at KCoreCommonDebug.scala:101
  |  CoalescedRDD[4] at repartition at KCoreCommonDebug.scala:101
  |  ShuffledRDD[3] at repartition at KCoreCommonDebug.scala:101
  +-(2) MapPartitionsRDD[2] at repartition at
KCoreCommonDebug.scala:101
 |  D:\graphData\verylarge.txt MappedRDD[1] at textFile at
KCoreCommonDebug.scala:100
 |  D:\graphData\verylarge.txt HadoopRDD[0] at textFile at
KCoreCommonDebug.scala:100
|  VertexRDD ZippedPartitionsRDD2[26] at zipPartitions at
VertexRDD.scala:200
|  VertexRDD, VertexRDD MapPartitionsRDD[18] at mapPartitions at
VertexRDD.scala:441
|  MapPartitionsRDD[17] at mapPartitions at VertexRDD.scala:457
|  ShuffledRDD[16] at ShuffledRDD at RoutingTablePartition.scala:36
+-(2) VertexRDD.createRoutingTables - vid2pid (aggregation)
MapPartitionsRDD[15] at mapPartitions at VertexRDD.scala:452
   |  EdgeRDD MapPartitionsRDD[12] at mapPartitionsWithIndex at
EdgeRDD.scala:169
   |  MappedRDD[11] at map at Graph.scala:392
   |  MappedRDD[10] at distinct at KCoreCommonDebug.scala:115
   |  ShuffledRDD[9] at distinct at KCoreCommonDebug.scala:115
   +-(2) MappedRDD[8] at distinct at KCoreCommonDebug.scala:115
  |  FilteredRDD[7] at filter at KCoreCommonDebug.scala:112
  |  MappedRDD[6] at map at KCoreCommonDebug.scala:102
  |  MappedRDD[5] at repartition at KCoreCommonDebug.scala:101
  |  CoalescedRDD[4] at repartition at KCoreCommonDebug.scala:101
  |  ShuffledRDD[3] at repartition at KCoreCommonDebug.scala:101
  +-(2) MapPartitionsRDD[2] at repartition at
KCoreCommonDebug.scala:101
 |  D:\graphData\verylarge.txt MappedRDD[1] at textFile at
KCoreCommonDebug.scala:100
 |  D:\graphData\ver

Re: Checkpoint bugs in GraphX

2014-11-10 Thread Xu Lijie
Nice, we currently encounter a stackoverflow error caused by this bug.

We also found that "val partitionsRDD: RDD[(PartitionID, EdgePartition[ED,
VD])],
val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)" will not
be serialized even without adding @transient.

However, transient can affect the JVM stack. Our guess is that:

If we do not add @transient, the pointers of "partitionsRDD" and
"targetStorageLevel"
will be kept in the stack.
Or else, the stack will not keep any information of the two variables
during serialization/deserialization.

I'm wondering whether the guess is right.

2014-11-11 11:16 GMT+08:00 GuoQiang Li :

> I have been trying to fix this bug.‍
> The related PR:
> https://github.com/apache/spark/pull/2631‍
>
> -- Original --
> *From: * "Xu Lijie";;
> *Date: * Tue, Nov 11, 2014 10:19 AM
> *To: * "user"; "dev";
> *Subject: * Checkpoint bugs in GraphX
>
> Hi, all. I'm not sure whether someone has reported this bug:
>
>
> There should be a checkpoint() method in EdgeRDD and VertexRDD as follows:
>
> override def checkpoint(): Unit = { partitionsRDD.checkpoint() }
>
>
> Current EdgeRDD and VertexRDD use *RDD.checkpoint()*, which only checkpoint
> the edges/vertices but not the critical partitionsRDD.
>
>
> Also, the variables (partitionsRDD and targetStroageLevel) in EdgeRDD and
> VertexRDD should be transient.
>
> class EdgeRDD[@specialized ED: ClassTag, VD: ClassTag]( @transient val
> partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])], @transient val
> targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) extends
> RDD[Edge[ED]](partitionsRDD.context, List(new
> OneToOneDependency(partitionsRDD))) {
>
>
> class VertexRDD[@specialized VD: ClassTag]( @transient val partitionsRDD:
> RDD[ShippableVertexPartition[VD]], @transient val targetStorageLevel:
> StorageLevel = StorageLevel.MEMORY_ONLY) extends RDD[(VertexId,
> VD)](partitionsRDD.context, List(new OneToOneDependency(partitionsRDD))) {
>
>
> These two bugs usually lead to stackoverflow error in iterative application
> written by GraphX.
>
>


Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
Hi Srinivas,

Here's the versions I'm using.

1.2.0-SNAPSHOT
1.3.2
1.3.0
org.spark-project.akka
2.3.4-spark

I'm using Spark built from master. so it's 1.2.0-SNAPSHOT.

Jianshi



On Tue, Nov 11, 2014 at 4:06 AM, Srinivas Chamarthi <
srinivas.chamar...@gmail.com> wrote:

> I am trying to use spark with spray and I have the  dependency problem
> with quasiquotes. The issue comes up only when I include spark
> dependencies. I am not sure how this one can be excluded.
>
> Jianshi: can you let me know what version of spray + akka + spark are you
> using ?
>
> [error]org.scalamacros:quasiquotes _2.10, _2.10.3
> [trace] Stack trace suppressed: run last *:update for the full output.
> [error] (*:update) Conflicting cross-version suffixes in:
> org.scalamacros:quasiq
> uotes
>
>
> On Thu, Oct 30, 2014 at 9:50 PM, Jianshi Huang 
> wrote:
>
>> Hi Preshant, Chester, Mohammed,
>>
>> I switched to Spark's Akka and now it works well. Thanks for the help!
>>
>> (Need to exclude Akka from Spray dependencies, or specify it as provided)
>>
>>
>> Jianshi
>>
>>
>> On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller 
>> wrote:
>>
>>>  I am not sure about that.
>>>
>>>
>>>
>>> Can you try a Spray version built with 2.2.x along with Spark 1.1 and
>>> include the Akka dependencies in your project’s sbt file?
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Tuesday, October 28, 2014 8:58 PM
>>> *To:* Mohammed Guller
>>> *Cc:* user
>>> *Subject:* Re: Spray client reports Exception:
>>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext
>>>
>>>
>>>
>>> I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4,
>>> right?
>>>
>>>
>>>
>>> Jianshi
>>>
>>>
>>>
>>> On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller 
>>> wrote:
>>>
>>> Try a version built with Akka 2.2.x
>>>
>>>
>>>
>>> Mohammed
>>>
>>>
>>>
>>> *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com]
>>> *Sent:* Tuesday, October 28, 2014 3:03 AM
>>> *To:* user
>>> *Subject:* Spray client reports Exception:
>>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I got the following exceptions when using Spray client to write to
>>> OpenTSDB using its REST API.
>>>
>>>
>>>
>>>   Exception in thread "pool-10-thread-2" java.lang.NoSuchMethodError:
>>> akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;
>>>
>>>
>>>
>>> It worked locally in my Intellij but failed when I launch it from
>>> Spark-submit.
>>>
>>>
>>>
>>> Google suggested it's a compatibility issue in Akka. And I'm using
>>> latest Spark built from the HEAD, so the Akka used in Spark-submit is
>>> 2.3.4-spark.
>>>
>>>
>>>
>>> I tried both Spray 1.3.2 (built for Akka 2.3.6) and 1.3.1 (built for
>>> 2.3.4). Both failed with the same exception.
>>>
>>>
>>>
>>> Anyone has idea what went wrong? Need help!
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Strange behavior of spark-shell while accessing hdfs

2014-11-10 Thread hmxxyy
I am trying spark-shell on a single host and got some strange behavior of
spark-shell.

If I run bin/spark-shell without connecting a master, it can access a hdfs
file on a remote cluster with kerberos authentication.

scala> val textFile =
sc.textFile("hdfs://*.*.*.*:8020/user/lih/drill_test/test.csv")
scala> textFile.count()
res0: Long = 9

However, if I start the master and slave on the same host and using 
bin/spark-shell --master spark://*.*.*.*:7077
run the same commands

scala> textFile.count()
14/11/11 05:00:23 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
stgace-launcher06.diy.corp.ne1.yahoo.com): java.io.IOException: Failed on
local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
"*.*.*.*.com/98.138.236.95"; destination host is: "*.*.*.*":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1375)
at org.apache.hadoop.ipc.Client.call(Client.java:1324)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:225)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy20.getBlockLocations(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1165)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1155)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1145)
at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:268)
at 
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:235)
at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:228)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1318)
at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:293)
at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:289)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:289)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:233)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:210)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:621)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1423)
at org.apache.hadoop.ipc.Client.call(Client.java:1342)
... 38 more
Caused by: org.apache.ha

Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy,

Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to 
print the contents of the objects. In addition, something else that helps is to 
do the following:

{
  val  _arr = arr
  models.map(... _arr ...)
}

Basically, copy the global variable into a local one. Then the field access 
from outside (from the interpreter-generated object that contains the line 
initializing arr) is no longer required, and the closure no longer has a 
reference to that.

I'm really confused as to why Array.ofDim would be so large by the way, but are 
you sure you haven't flipped around the dimensions (e.g. it should be 5 x 
1800)? A 5-double array will consume more than 5*8 bytes (probably something 
like 60 at least), and an array of those will still have a pointer to each one, 
so I'd expect that many of them to be more than 80 MB (which is very close to 
1867*5*8).

Matei

> On Nov 10, 2014, at 1:01 AM, Sandy Ryza  wrote:
> 
> I'm experiencing some strange behavior with closure serialization that is 
> totally mind-boggling to me.  It appears that two arrays of equal size take 
> up vastly different amount of space inside closures if they're generated in 
> different ways.
> 
> The basic flow of my app is to run a bunch of tiny regressions using Commons 
> Math's OLSMultipleLinearRegression and then reference a 2D array of the 
> results from a transformation.  I was running into OOME's and 
> NotSerializableExceptions and tried to get closer to the root issue by 
> calling the closure serializer directly.
>   scala> val arr = models.map(_.estimateRegressionParameters()).toArray
> 
> The result array is 1867 x 5. It serialized is 80k bytes, which seems about 
> right:
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
>   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 
> cap=80027]
> 
> If I reference it from a simple function:
>   scala> def func(x: Long) => arr.length
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
> I get a NotSerializableException.
> 
> If I take pains to create the array using a loop:
>   scala> val arr = Array.ofDim[Double](1867, 5)
>   scala> for (s <- 0 until models.length) {
>   | factorWeights(s) = models(s).estimateRegressionParameters()
>   | }
> Serialization works, but the serialized closure for the function is a 
> whopping 400MB.
> 
> If I pass in an array of the same length that was created in a different way, 
> the size of the serialized closure is only about 90K, which seems about right.
> 
> Naively, it seems like somehow the history of how the array was created is 
> having an effect on what happens to it inside a closure.
> 
> Is this expected behavior?  Can anybody explain what's going on?
> 
> any insight very appreciated,
> Sandy


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to use JNI in spark?

2014-11-10 Thread tangweihan
You just need to add --driver-library-path the directory in you submit
command. And in your worker node, add the lib in the right work directory



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-JNI-in-spark-tp530p18551.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-10 Thread Adamantios Corais
Hi again,

As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
seems that Oozie is the best and only choice here. Is that right?

On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain 
wrote:

> I have used Oozie for all our workflows with Spark apps but you will have
> to use a java event as the workflow element.   I am interested in anyones
> experience with Luigi and/or any other tools.
>
>
> On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais <
> adamantios.cor...@gmail.com> wrote:
>
>> I have some previous experience with Apache Oozie while I was developing
>> in Apache Pig. Now, I am working explicitly with Apache Spark and I am
>> looking for a tool with similar functionality. Is Oozie recommended? What
>> about Luigi? What do you use \ recommend?
>>
>
>
>
> --
>
>
> "Nothing under the sun is greater than education. By educating one person
> and sending him/her into the society of his/her generation, we make a
> contribution extending a hundred generations to come."
> -Jigoro Kano, Founder of Judo-
>