Re: Spark: Could not load native gpl library

2014-08-12 Thread Andrew Ash
Hi Jikai, The reason I ask is because your stacktrace has this section in it: com.hadoop.compression.lzo.GPLNativeCodeLoader.( GPLNativeCodeLoader.java:32) at com.hadoop.compression.lzo.LzoCodec.(LzoCodec.java:71) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(C

Re: Parallelizing a task makes it freeze

2014-08-12 Thread sparkuser2345
Actually the program hangs just by calling dataAllRDD.count(). I suspect creating the RDD is not successful when its elements are too big. When nY = 3000, dataAllRDD.count() works (each element of dataAll = 3000*400*64 bits = 9.6 MB), but when nY = 4000, it hangs (4000*400*64 bits = 12.8 MB). Wha

Re: Transform RDD[List]

2014-08-12 Thread Sean Owen
-incubator, +user So, are you trying to "transpose" your data? val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))).repartition(2) First you could pair each value with its position in its list: val withIndex = rdd.flatMap(_.zipWithIndex) then group by that position, and discard the

Killing spark app problem

2014-08-12 Thread Grzegorz Białek
Hi, when I run some spark application on my local machine using spark-submit: $SPARK_HOME/bin/spark-submit --driver-memory 1g When I want to interrupt computing by ctrl-c it interrupt current stage but later it waits and exit after around 5min and sometimes doesn't exit at all, and the only way

Re: java.lang.StackOverflowError when calling count()

2014-08-12 Thread Tathagata Das
The long lineage causes a long/deep Java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error. TD On Mon, Aug 11, 2014 at 7:14 PM, randylu wrote: > hi, TD. I

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of part

Re: How to save mllib model to hdfs and reload it

2014-08-12 Thread Xiangrui Meng
For linear models, the constructors are now public. You can save the weights to HDFS, then load the weights back and use the constructor to create the model. -Xiangrui On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu wrote: > hello: > > I want to know,if I use history data to training model and I want

Re: Using very large files for KMeans training -- cluster centers size?

2014-08-12 Thread Xiangrui Meng
What did you set for driver memory? The default value is 256m or 512m, which is too small. Try to set "--driver-memory 10g" with spark-submit or spark-shell and see whether it works or not. -Xiangrui On Mon, Aug 11, 2014 at 6:26 PM, durin wrote: > I'm trying to apply KMeans training to some text

Re: Transform RDD[List]

2014-08-12 Thread Kevin Jung
Thanks for your answer. Yes, I want to transpose data. At this point, I have one more question. I tested it with RDD1 List(1, 2, 3, 4, 5) List(6, 7, 8, 9, 10) List(11, 12, 13, 14, 15) List(16, 17, 18, 19, 20) And the result is... ArrayBuffer(11, 1, 16, 6) ArrayBuffer(2, 12, 7, 17) ArrayBuffer(3,

RE: [spark-streaming] kafka source and flow control

2014-08-12 Thread Gwenhael Pasquiers
I was hoping I could make the system behave as a blocking queue : if the outputs is too slow, buffers (storing space for RDDs) fills up, then blocks instead of dropping existing rdds, until the input itself blocks (slows down it’s consumption). On a side note I was wondering: is there the same

Re: Transform RDD[List]

2014-08-12 Thread Sean Owen
Sure, just add ".toList.sorted" in there. Putting together in one big expression: val rdd = sc.parallelize(List(List(1,2,3,4,5),List(6,7,8,9,10))) val result = rdd.flatMap(_.zipWithIndex).groupBy(_._2).values.map(_.map(_._1).toList.sorted) List(2, 7) List(1, 6) List(4, 9) List(3, 8) List(5, 10)

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
spark.speculation was not set, any speculative execution on tachyon side? tachyon-env.sh only changed following export TACHYON_MASTER_ADDRESS=test01.zala #export TACHYON_UNDERFS_ADDRESS=$TACHYON_HOME/underfs export TACHYON_UNDERFS_ADDRESS=hdfs://test01.zala:8020 export TACHYON_WORKER_MEMORY_SIZE=

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-12 Thread chutium
more interesting is if spark-shell started on master node (test01) then parquetFile.saveAsParquetFile("tachyon://test01.zala:19998/parquet_tablex") 14/08/12 11:42:06 INFO : initialize(tachyon://... ... ... 14/08/12 11:42:06 INFO : File does not exist: tachyon://test01.zala:19998/parquet_tablex/_

RE: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-12 Thread Ge, Yao (Y.)
I figured it out. My indices parameters for the sparse vector are messed up. It is a good learning for me: When use the Vectors.sparse(int size, int[] indices, double[] values) to generate a vector, size is the size of the whole vector, not just the size of the elements with value. The indices a

Re: java.lang.StackOverflowError when calling count()

2014-08-12 Thread randylu
hi, TD. Thanks very much! I got it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p11980.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

LDA in MLBase

2014-08-12 Thread Aslan Bekirov
Hi All, I have a question regarding LDA topic modeling. I need to do topic modeling on ad data. Does MLBase supports LDA topic modeling? Or any stable, tested LDA implementation on Spark? BR, Aslan

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread ZHENG, Xu-dong
Hi Xiangrui, Thanks for your reply! Yes, our data is very sparse, but RDD.repartition invoke RDD.coalesce(numPartitions, shuffle = true) internally, so I think it has the same effect with #2, right? For CombineInputFormat, although I haven't tried it, but it sounds that it will combine multiple

Fwd: how to split RDD by key and save to different path

2014-08-12 Thread Fengyun RAO
1. be careful, HDFS are better for large files, not bunches of small files. 2. if that's really what you want, roll it your own. def writeLines(iterator: Iterator[(String, String)]) = { val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map try { while (iterator.hasN

Re: Spark SQL JDBC

2014-08-12 Thread John Omernik
Yin helped me with that, and I appreciate the onlist followup. A few questions: Why is this the case? I guess, does building it with thriftserver add much more time/size to the final build? It seems that unless documented well, people will miss that and this situation would occur, why would we no

Re: saveAsTextFiles file not found exception

2014-08-12 Thread Chen Song
Thanks for putting this together, Andrew. On Tue, Aug 12, 2014 at 2:11 AM, Andrew Ash wrote: > Hi Chen, > > Please see the bug I filed at > https://issues.apache.org/jira/browse/SPARK-2984 with the > FileNotFoundException on _temporary directory issue. > > Andrew > > > On Mon, Aug 11, 2014 at 1

Re: anaconda and spark integration

2014-08-12 Thread Oleg Ruchovets
Hello. Is there an integration spark ( pyspark) with anaconda? I googled a lot and didn't find relevant information. Could you please pointing me on tutorial or simple example. Thanks in advance Oleg.

RE: Benchmark on physical Spark cluster

2014-08-12 Thread Mozumder, Monir
An on-list follow up: http://prof.ict.ac.cn/BigDataBench/#Benchmarks looks promising as it has spark as one of the platforms used. Bests, -Monir From: Mozumder, Monir Sent: Monday, August 11, 2014 7:18 PM To: user@spark.apache.org Subject: Benchmark on physical Spark cluster I am trying to get

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Sorry, I missed #2. My suggestion is the same as #2. You need to set a bigger numPartitions to avoid hitting integer bound or 2G limitation, at the cost of increased shuffle size per iteration. If you use a CombineInputFormat and then cache, it will try to give you roughly the same size per partiti

Re: Spark SQL JDBC

2014-08-12 Thread Michael Armbrust
Hive pulls in a ton of dependencies that we were afraid would break existing spark applications. For this reason all hive submodules are optional. On Tue, Aug 12, 2014 at 7:43 AM, John Omernik wrote: > Yin helped me with that, and I appreciate the onlist followup. A few > questions: Why is th

Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
I have a class defining an inner static class (map function). The inner class tries to refer the variable instantiated in the outside class, which results in a NullPointerException. Sample Code as follows: class SampleOuterClass { private static ArrayList someVariable; SampleOuterClass

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sean Owen
I don't think static members are going to be serialized in the closure? the instance of Parse will be looking at its local SampleOuterClass, which is maybe not initialized on the remote JVM. On Tue, Aug 12, 2014 at 6:02 PM, Sunny Khatri wrote: > I have a class defining an inner static class (map

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
Are there any other workarounds that could be used to pass in the values from *someVariable *to the transformation function ? On Tue, Aug 12, 2014 at 10:48 AM, Sean Owen wrote: > I don't think static members are going to be serialized in the > closure? the instance of Parse will be looking at i

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Marcelo Vanzin
You could create a copy of the variable inside your "Parse" class; that way it would be serialized with the instance you create when calling map() below. On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri wrote: > Are there any other workarounds that could be used to pass in the values > from someVar

DistCP - Spark-based

2014-08-12 Thread Gary Malouf
We are probably still the minority, but our analytics platform based on Spark + HDFS does not have map/reduce installed. I'm wondering if there is a distcp equivalent that leverages Spark to do the work. Our team is trying to find the best way to do cross-datacenter replication of our HDFS data t

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Sunny Khatri
That should work. Gonna give it a try. Thanks ! On Tue, Aug 12, 2014 at 11:01 AM, Marcelo Vanzin wrote: > You could create a copy of the variable inside your "Parse" class; > that way it would be serialized with the instance you create when > calling map() below. > > On Tue, Aug 12, 2014 at 10:

Jobs get stuck at reduceByKey stage with spark 1.0.1

2014-08-12 Thread Shivani Rao
Hello spark aficionados, We upgraded from spark 1.0.0 to 1.0.1 when the new release came out and started noticing some weird errors. Even a simple operation like "reduceByKey" or "count" on an RDD gets stuck in "cluster mode". This issue does not occur with spark 1.0.0 (in cluster or local mode)

Re: Spark Hbase job taking long time

2014-08-12 Thread Amit Singh Hora
Hi , Today i created a table with 3 regions and 2 jobtrackers but still the spark job is taking lot of time I also noticed one thing that is the memory of client was increasing linearly is it like spark job was first bringing the complete data in memory? On Thu, Aug 7, 2014 at 7:31 PM, Ted Yu [vi

Spark Streaming example on your mesos cluster

2014-08-12 Thread Zia Syed
Hi, I'm trying to run streaming.NetworkWordCount example on the Mesos Cluster (0.19.1). I am able to run SparkPi examples on my mesos cluster, can run the streaming example in local[n] mode, but no luck so far with the Spark streaming examples running my mesos cluster. I dont particularly see any

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Yin Huai
Hi Jenny, Have you copied hive-site.xml to spark/conf directory? If not, can you put it in conf/ and try again? Thanks, Yin On Mon, Aug 11, 2014 at 8:57 PM, Jenny Zhao wrote: > > Thanks Yin! > > here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't > experience problem conne

Re: DistCP - Spark-based

2014-08-12 Thread Matei Zaharia
Good question; I don't know of one but I believe people at Cloudera had some thoughts of porting Sqoop to Spark in the future, and maybe they'd consider DistCP as part of this effort. I agree it's missing right now. Matei On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com) wr

Task closures and synchronization

2014-08-12 Thread Tom Vacek
This is a back-to-basics question. How do we know when Spark will clone an object and distribute it with task closures versus synchronize access to it. For example, the old rookie mistake of random number generation: import scala.util.Random val randRDD = sc.parallelize(0 until 1000).map(ii => R

how to access workers from spark context

2014-08-12 Thread S. Zhou
I tried to access worker info from spark context but it seems spark context does no expose such API. The reason of doing that is: it seems spark context itself does not have logic to detect if its workers are in dead status. So I like to add such logic by myself.  BTW, it seems spark web UI has

Re: how to split RDD by key and save to different path

2014-08-12 Thread 诺铁
understand, thank you small file is a problem, I am considering process data before put them in hdfs. On Tue, Aug 12, 2014 at 9:37 PM, Fengyun RAO wrote: > 1. be careful, HDFS are better for large files, not bunches of small files. > > 2. if that's really what you want, roll it your own. > > de

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Jenny Zhao
Hi Yin, hive-site.xml was copied to spark/conf and the same as the one under $HIVE_HOME/conf. through hive cli, I don't see any problem. but for spark on yarn-cluster mode, I am not able to switch to a database other than the default one, for Yarn-client mode, it works fine. Thanks! Jenny On

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-12 Thread Marcelo Vanzin
Hi, sorry for the delay. Would you have yarn available to test? Given the discussion in SPARK-2878, this might be a different incarnation of the same underlying issue. The option in Yarn is spark.yarn.user.classpath.first On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom wrote: > I'm currently running

Fwd: Task closures and synchronization

2014-08-12 Thread Tobias Pfeiffer
Uh, for some reason I don't seem to automatically reply to the list any more. Here is again my message to Tom. -- Forwarded message -- Tom, On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek wrote: > This is a back-to-basics question. How do we know when Spark will clone > an object a

Re: Spark Streaming example on your mesos cluster

2014-08-12 Thread Tobias Pfeiffer
Hi, On Wed, Aug 13, 2014 at 4:24 AM, Zia Syed wrote: > > I dont particularly see any errors on my logs, either on console, or on > slaves. I see slave downloads the spark-1.0.2-bin-hadoop1.tgz file and > unpacks them as well. Mesos Master shows quiet alot of Tasks created and > Finished. I dont

Running time bottleneck on a few worker

2014-08-12 Thread Bin
Hi All, I met a problem that for each stage, most workers finished fast (around 1min), but a few workers spent like 7min to finish, which significantly slow down the process. As shown below, the running time is very unbalancedly distributed over workers. I wonder whether this is normal? Is

Re: Mllib : Save SVM model to disk

2014-08-12 Thread Hoai-Thu Vuong
you should try watching this video https://www.youtube.com/watch?v=sPhyePwo7FA, for more details, please search in the archives, I've got a same kind of question and other guys helped me to solve the problem. On Tue, Aug 12, 2014 at 12:36 PM, XiaoQinyu wrote: > Have you solved this problem?? > >

training recsys model

2014-08-12 Thread Hoai-Thu Vuong
In MLLib, I found the method to train matrix factorization model to predict the taste of user. In this function, there are some parameters such as lambda, and rank, I can not find the best value to set these parameters and how to optimize this value. Could you please give me some recommends? -- T

OutOfMemor​yError when spark streaming receive flume events

2014-08-12 Thread jason chen
I checked javacore file, there is: Dump Event "systhrow" (0004) Detail "java/lang/OutOfMemoryError" "Java heap space" received After checking the failure thread, I found it occur in SparkFlumeEvent.readExternal() method: 71 for (i <- 0 until numHeaders) { 72 val keyLength = in.rea

Re: how to access workers from spark context

2014-08-12 Thread S. Zhou
Sometimes workers are dead but spark context does not know it and still send jobs. On Tuesday, August 12, 2014 7:14 PM, Stanley Shi wrote: Why do you need to detect the worker status in the application? you application generally don't need to know where it is executed. On Wed, Aug 13, 2

Re: how to access workers from spark context

2014-08-12 Thread Stanley Shi
This seems a bug, right? It's not the user's responsibility to manage the workers. On Wed, Aug 13, 2014 at 11:28 AM, S. Zhou wrote: > Sometimes workers are dead but spark context does not know it and still > send jobs. > > > On Tuesday, August 12, 2014 7:14 PM, Stanley Shi > wrote: > > > Why

Re: how to access workers from spark context

2014-08-12 Thread S. Zhou
actually if you search the spark mail archives you will find many similar topics. At this time, I just want to manage it by myself. On Tuesday, August 12, 2014 8:46 PM, Stanley Shi wrote: This seems a bug, right? It's not the user's responsibility to manage the workers. On Wed, Aug 13,

Re: anaconda and spark integration

2014-08-12 Thread Oleg Ruchovets
Hi Nick , Thank you for the link , Do I understand correct that AMI is for cloud only. Currently we are NOT on cloud? Is there an option to use anaconda with spark not on Amazon cloud? Thanks On Wed, Aug 13, 2014 at 12:38 AM, Nick Pentreath wrote: > You may want to take a look at this thread

Re: Spark SQL JDBC

2014-08-12 Thread ZHENG, Xu-dong
Hi Cheng, I also meet some issues when I try to start ThriftServer based a build from master branch (I could successfully run it from the branch-1.0-jdbc branch). Below is my build command: ./make-distribution.sh --skip-java-test -Phadoop-2.4 -Phive -Pyarn -Dyarn.version=2.4.0 -Dhadoop.version=2.

Re: Spark SQL JDBC

2014-08-12 Thread ZHENG, Xu-dong
Just find this is because below lines in make_distribution.sh doesn't work: if [ "$SPARK_HIVE" == "true" ]; then cp "$FWDIR"/lib_managed/jars/datanucleus*.jar "$DISTDIR/lib/" fi There is no definition of $SPARK_HIVE in make_distribution.sh. I should set it explicitly. On Wed, Aug 13, 2014 at

Re: Spark SQL JDBC

2014-08-12 Thread Cheng Lian
Oh, thanks for reporting this. This should be a bug since SPARK_HIVE was deprecated, we shouldn’t rely on it any more. ​ On Wed, Aug 13, 2014 at 1:23 PM, ZHENG, Xu-dong wrote: > Just find this is because below lines in make_distribution.sh doesn't work: > > if [ "$SPARK_HIVE" == "true" ]; then

Re: training recsys model

2014-08-12 Thread Xiangrui Meng
You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html -Xiangrui On Tue, Aug 12, 2014 at 8:01 PM,

Re: Running time bottleneck on a few worker

2014-08-12 Thread Akhil Das
This happens when you are playing around with sortByKey, mapPartition, groupBy, reduceByKey like Operations. One thing you can try is providing the number of partition (possibly ​> ​ 2x number of CPUs) while doing these operations. Thanks Best Regards On Wed, Aug 13, 2014 at 7:54 AM, Bin wrote: