RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Judy Nash
This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class I seem to have the line that reported missing, but I am missing this fi

Re: How to insert complex types like map> in spark sql

2014-11-24 Thread critikaled
Thanks for the reply Micheal here is the stack trace org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost): scala.MatchError: MapType(StringType,StringType,true) (of class org.apache.

Re: Spark performance optimization examples

2014-11-24 Thread Akhil Das
Here's the tuning guidelines if you haven't seen it already. http://spark.apache.org/docs/latest/tuning.html You could try the following to get it loaded: - Use kryo Serialization - Enable RDD Compression - Set Storage level to

Re: New Codes in GraphX

2014-11-24 Thread Deep Pradhan
Could it be because my edge list file is in the form (1 2), where there is an edge between node 1 and node 2? On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave wrote: > At 2014-11-18 15:51:52 +0530, Deep Pradhan > wrote: > > Yes the above command works, but there is this problem. Most of the > ti

Edge List File in GraphX

2014-11-24 Thread Deep Pradhan
Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., <1 2> means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use th

Re: How to access application name in the spark framework code.

2014-11-24 Thread Kartheek.R
Hi Deng, Thank you. That works perfectly:) Regards Karthik. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-application-name-in-the-spark-framework-code-tp19719p19723.html Sent from the Apache Spark User List mailing list archive at Nabble.co

Re: How to access application name in the spark framework code.

2014-11-24 Thread Deng Ching-Mallete
Hi, I think it should be accessible via the SparkConf in the SparkContext. Something like sc.getConf().get("spark.app.name")? Thanks, Deng On Tue, Nov 25, 2014 at 12:40 PM, rapelly kartheek wrote: > Hi, > > When I submit a spark application like this: > > ./bin/spark-submit --class org.apache.

Re: advantages of SparkSQL?

2014-11-24 Thread Cheng Lian
For the “never register a table” part, actually you /can/ use Spark SQL without registering a table via its DSL. Say you’re going to extract an |Int| field named |key| from the table and double it: |import org.apache.sql.catalyst.dsl._ val data = sqc.parquetFile(path) val double = (i:Int

How to access application name in the spark framework code.

2014-11-24 Thread rapelly kartheek
Hi, When I submit a spark application like this: ./bin/spark-submit --class org.apache.spark.examples.SparkKMeans --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /k-means 4 0.001 Which part of the spark framework code deals with the name of t

Control number of parquet generated from JavaSchemaRDD

2014-11-24 Thread tridib
Hello, I am reading around 1000 input files from disk in an RDD and generating parquet. It always produces same number of parquet files as number of input files. I tried to merge them using rdd.coalesce(n) and/or rdd.repatition(n). also tried using: int MB_128 = 128*1024*1024; sc

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Cheng Lian
Hm, I tried exactly the same commit and the build command locally, but couldn’t reproduce this. Usually this kind of errors are caused by classpath misconfiguration. Could you please try this to ensure corresponding Guava classes are included in the assembly jar you built? |jar tf assembly/

Re: Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Harihar Nahak
You can set the same parameter when launching an application, if you use sppar-submit tried --conf to give those variables or from SparkConfig also you can set the logs for both driver and workers. - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Marcelo Vanzin
Hello, What exactly are you trying to see? Workers don't generate any events that would be logged by enabling that config option. Workers generate logs, and those are captured and saved to disk by the cluster manager, generally, without you having to do anything. On Mon, Nov 24, 2014 at 7:46 PM,

Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Xuelin Cao
Hi,      I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the worker node.      I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir  parameters on the worker node. However, it doesn't work.      I do ha

Re: Spark saveAsText file size

2014-11-24 Thread Yanbo Liang
In memory cache may be blow up the size of RDD. It's general condition that RDD will take more space in memory than disk. There are options to configure and optimize storage space efficiency in Spark, take a look at this https://spark.apache.org/docs/latest/tuning.html 2014-11-25 10:38 GMT+08:00

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just bu

Re: Negative Accumulators

2014-11-24 Thread Peter Thai
Great! Worked like a charm :) On Mon, Nov 24, 2014 at 9:56 PM, Shixiong Zhu wrote: > int overflow? If so, you can use BigInt like this: > > scala> import org.apache.spark.AccumulatorParamimport > org.apache.spark.AccumulatorParam > > scala> :paste// Entering paste mode (ctrl-D to finish) > impl

Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-24 Thread Harihar Nahak
Hi All, I started exploring Spark from past 2 months. I'm looking for some concrete features from both Spark and GraphX so that I'll take some decisions what to use, based upon who get highest performance. According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank a

Re: Negative Accumulators

2014-11-24 Thread Shixiong Zhu
int overflow? If so, you can use BigInt like this: scala> import org.apache.spark.AccumulatorParamimport org.apache.spark.AccumulatorParam scala> :paste// Entering paste mode (ctrl-D to finish) implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] { def addInPlace(t1: BigInt,

Spark saveAsText file size

2014-11-24 Thread Alan Prando
Hi Folks! I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB RAM, 32 cores each and 1TB of storage each). This job generates 1.200 TB of data on a RDD with 1200 partitions. When I call saveAsTextFile("hdfs://..."), spark creates 1200 files named "part-000*" on HDFS's folder. H

Spark performance optimization examples

2014-11-24 Thread SK
Hi, Is there any document that provides some guidelines with some examples that illustrate when different performance optimizations would be useful? I am interested in knowing the guidelines for using optimizations like cache(), persist(), repartition(), coalesce(), and broadcast variables. I stu

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread agg212
I am running it in local. How can I use the built version (in local mode) so that I can use the native libraries? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19705.html Sent from the Apache Spark User List maili

Negative Accumulators

2014-11-24 Thread Peter Thai
Hello! Does anyone know why I may be receiving negative final accumulator values? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Negative-Accumulators-tp19706.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer wrote: > Hi, > > On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact > wrote: >> >> I seemed to read somewhere that spark is still batch learning, but spark >> streaming could allow online learning. >> > > Spark doesn't do Machine L

Re: Setup Remote HDFS for Spark

2014-11-24 Thread Tobias Pfeiffer
Hi, On Sat, Nov 22, 2014 at 12:13 AM, EH wrote: > Unfortunately whether it is possible to have both Spark and HDFS running on > the same machine is not under our control. :( Right now we have Spark and > HDFS running in different machines. In this case, is it still possible to > hook up a rem

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Tobias Pfeiffer
Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact wrote: > > I seemed to read somewhere that spark is still batch learning, but spark > streaming could allow online learning. > Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning only for linear regr

Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Hi Gurus, Sorry for my naive question. I am new. I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. I could not find this on the website now. http://spark.apache.org/docs/latest/streaming-programming-guide.html I know MLLib uses incr

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
Neat hack! This is cute and actually seems to work. The fact that it works is a little surprising and somewhat unintuitive. On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell wrote: > > object MyCoreNLP { > @transient lazy val coreNLP = new coreNLP() > } > > and then refer to it from your map/redu

Re: Python Scientific Libraries in Spark

2014-11-24 Thread Davies Liu
These libraries could be used in PySpark easily. For example, MLlib uses Numpy heavily, it can accept np.array or sparse matrix in SciPy as vectors. On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari wrote: > Hello Folks: > > Since spark exposes python bindings and allows you to express your logic in

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
That's an interesting question for which I do not know the answer. Probably a question for someone with more knowledge of the internals of the shell interpreter... On Mon, Nov 24, 2014 at 2:19 PM, aecc wrote: > Ok, great, I'm gonna do do it that way, thanks :). However I still don't > understand

Spark SQL - Any time line to move beyond Alpha version ?

2014-11-24 Thread Manoj Samel
Is there any timeline where Spark SQL goes beyond alpha version? Thanks,

Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe wrote: > Andrei, Ashish, > > To be clear, I don't think it's *counting* the entire file. It just seems > from

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Judy Nash
Thank you Cheng for responding. Here is the commit SHA1 on the 1.2 branch I saw this failure in: commit 6f70e0295572e3037660004797040e026e440dbd Author: zsxwing Date: Fri Nov 21 00:42:43 2014 -0800 [SPARK-4472][Shell] Print "Spark context available as sc." only when SparkContext is create

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Ok, great, I'm gonna do do it that way, thanks :). However I still don't understand why this object should be serialized and shipped? aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881 However this : aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Needs to be ser

Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-24 Thread Natu Lauchande
Hi all, I am getting started with Spark. I would like to use for a spike on anomaly detection in a massive stream of metrics. Can Spark easily handle this use case ? Thanks, Natu

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
On Mon, Nov 24, 2014 at 1:56 PM, aecc wrote: > I checked sqlContext, they use it in the same way I would like to use my > class, they make the class Serializable with transient. Does this affects > somehow the whole pipeline of data moving? I mean, will I get performance > issues when doing this b

Re: Spark S3 Performance

2014-11-24 Thread Nitay Joffe
Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It just seems from the logging and the timing that it is doing a get of the entire file, then figures out it only needs some certain blocks, does another get of only the specific block. Regarding # partitions - I think I

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Yes, I'm running this in the Shell. In my compiled Jar it works perfectly, the issue is I need to do this on the shell. Any available workarounds? I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects som

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Shivaram Venkataraman
Can you clarify what is the Spark master URL you are using ? Is it 'local' or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help as Spark is getting pulled in from Maven and that'll just pick up the released artifacts. Shivaram On Mon, Nov 24, 2014 at 1:08 PM, agg212 wrote:

Re: Merging Parquet Files

2014-11-24 Thread Michael Armbrust
Parquet does a lot of serial metadata operations on the driver which makes it really slow when you have a very large number of files (especially if you are reading from something like S3). This is something we are aware of and that I'd really like to improve in 1.3. You might try the (brand new a

Re: How to insert complex types like map> in spark sql

2014-11-24 Thread Michael Armbrust
Can you give the full stack trace. You might be hitting: https://issues.apache.org/jira/browse/SPARK-4293 On Sun, Nov 23, 2014 at 3:00 PM, critikaled wrote: > Hi, > I am trying to insert particular set of data from rdd to a hive table I > have Map[String,Map[String,Int]] in scala which I want

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
Hello, On Mon, Nov 24, 2014 at 12:07 PM, aecc wrote: > This is the stacktrace: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA > - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", typ

Re: How can I read this avro file using spark & scala?

2014-11-24 Thread Michael Armbrust
Thanks for the feedback, I filed a couple of issues: https://github.com/databricks/spark-avro/issues On Fri, Nov 21, 2014 at 5:04 AM, thomas j wrote: > I've been able to load a different avro file based on GenericRecord with: > > val person = sqlContext.avroFile("/tmp/person.avro") > > When I tr

Re: advantages of SparkSQL?

2014-11-24 Thread Michael Armbrust
Akshat is correct about the benefits of parquet as a columnar format, but I'll add that some of this is lost if you just use a lambda function to process the data. Since your lambda function is a black box Spark SQL does not know which columns it is going to use and thus will do a full tablescan.

Re: How does Spark SQL traverse the physical tree?

2014-11-24 Thread Michael Armbrust
You are pretty close. The QueryExecution is what drives the phases from parsing to execution. Once we have a final SparkPlan (the physical plan), toRdd just calls execute() which recursively calls execute() on children until we hit a leaf operator. This gives us an RDD[Row] that will compute the

Unable to use Kryo

2014-11-24 Thread Daniel Haviv
Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error: java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking it up. Thanks for your help, Daniel

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread agg212
I tried building Spark from the source, by downloading it and running: mvn -Pnetlib-lgpl -DskipTests clean package I then installed OpenBLAS by doing the following: - Download and unpack .tar from http://www.openblas.net/ - Run `make` I then linked /usr/lib/libblas.so.3 to /usr/lib/libopenblas.

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
If I actually instead of using myNumber I use the 5 value, the exception is not given. E.g: aaa.s.parallelize(1 to 10).filter(_ == 5).count Works perfectly -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-canno

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Marcelo Vanzin wrote > Do you expect to be able to use the spark context on the remote task? Not At all, what I want to create is a wrapper of the SparkContext, to be used only on the driver node. I would like to have in this "AAA" wrapper several attributes, such as the SparkContext and other con

Building Yarn mode with sbt

2014-11-24 Thread Akshat Aranya
Is it possible to enable the Yarn profile while building Spark with sbt? It seems like yarn project is strictly a Maven project and not something that's known to the sbt parent project. -Akshat

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs. OpenBLA

Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-11-24 Thread Bui, Tri
Hi, I am getting incorrect weights model from StreamingLinearRegressionwith SGD. One feature Input data is: (1,[1]) (2,[2]) ... . (20,[20]) The result from the Current model: weights is [-4.432]which is not correct. Also, how do I turn on the intercept value for the StreamingLinearRegressi

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Xiangrui Meng
Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon, No

Re: Store kmeans model

2014-11-24 Thread Xiangrui Meng
KMeansModel is serializable. So you can use Java serialization, try sc.parallelize(Seq(model)).saveAsObjectFile(outputDir) sc.objectFile[KMeansModel](outputDir).first() We will try to address model export/import more formally in 1.3, e.g., https://www.github.com/apache/spark/pull/3062 -Xiangrui

Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng
The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')] Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils examp

RE: Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
Did the workaround work for you? Doesn't seem to work for me. Date: Mon, 24 Nov 2014 16:44:17 +0100 Subject: Re: Spark Cassandra Guava version issues From: shahab.mok...@gmail.com To: as...@live.com CC: user@spark.apache.org I faced same problem, and s work around solution is here : https://gi

Python Scientific Libraries in Spark

2014-11-24 Thread Rohit Pujari
Hello Folks: Since spark exposes python bindings and allows you to express your logic in Python, Is there a way to leverage some of the sophisticated libraries like NumPy, SciPy, Scikit in spark job and run at scale? What's been your experience, any insights you can share in terms of what's possi

Re: Spark streaming job failing after some time.

2014-11-24 Thread pankaj channe
I have figured out the problem here. Turned out that there was a problem with my SparkConf when I was running my application with yarn in cluster mode. I was setting my master to be local[4] inside my application, whereas I was setting it to yarn-cluster with spark-submit. Now I have changed my Spa

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
Do you expect to be able to use the spark context on the remote task? If you do, that won't work. You'll need to rethink what it is you're trying to do, since SparkContext is not serializable and it doesn't make sense to make it so. If you don't, you could mark the field as @transient. But the tw

Spark error in execution

2014-11-24 Thread Blackeye
I created an application in spark. When I run it with spark, everything works fine. But when I export my application with the libraries (via sbt), and trying to run it as an executable jar, I get the following error: 14/11/24 20:06:11 ERROR OneForOneStrategy: exception during creation akka.actor.A

Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Hello guys, I'm using Spark 1.0.0 and Kryo serialization In the Spark Shell, when I create a class that contains as an attribute the SparkContext, in this way: class AAA(val s: SparkContext) { } val aaa = new AAA(sc) and I execute any action using that attribute like: val myNumber = 5 aaa.s.tex

How does Spark SQL traverse the physical tree?

2014-11-24 Thread Tim Chou
Hi All, I'm learning the code of Spark SQL. I'm confused about how SchemaRDD executes each operator. I'm tracing the code. I found toRDD() function in QueryExecution is the start for running a query. toRDD function will run SparkPlan, which is a tree structure. However, I didn't find any iterat

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
This is probably not the right venue for general questions on CoreNLP - the project website (http://nlp.stanford.edu/software/corenlp.shtml) provides documentation and links to mailing lists/stack overflow topics. On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wr

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Madabhattula Rajesh Kumar
Hello, I'm new to Stanford CoreNLP. Could any one share good training material and examples(java or scala) on NLP. Regards, Rajesh On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell wrote: > > object MyCoreNLP { > @transient lazy val coreNLP = new coreNLP() > } > > and then refer to it from your

Re: advantages of SparkSQL?

2014-11-24 Thread Akshat Aranya
Parquet is a column-oriented format, which means that you need to read in less data from the file system if you're only interested in a subset of your columns. Also, Parquet pushes down selection predicates, which can eliminate needless deserialization of rows that don't match a selection criterio

Mllib native netlib-java/OpenBLAS

2014-11-24 Thread agg212
Hi, i'm trying to improve performance for Spark's Mllib, and I am having trouble getting native netlib-java libraries installed/recognized by Spark. I am running on a single machine, Ubuntu 14.04 and here is what I've tried: sudo apt-get install libgfortran3 sudo apt-get install libatlas3-base li

advantages of SparkSQL?

2014-11-24 Thread mrm
Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised th

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001
I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the s

Re: How to keep a local variable in each cluster?

2014-11-24 Thread Yanbo
发自我的 iPad > 在 2014年11月24日,上午9:41,zh8788 <78343...@qq.com> 写道: > > Hi, > > I am new to spark. This is the first time I am posting here. Currently, I > try to implement ADMM optimization algorithms for Lasso/SVM > Then I come across a problem: > > Since the training data(label, feature) is larg

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Ian O'Connell
object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks wrote: > We ha

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at sistanlp,

Re: MLLib: LinearRegressionWithSGD performance

2014-11-24 Thread Yanbo
From the metrics page, it reveals that only two executors work parallel for each iteration. You need to improve parallel threads numbers. Some tips maybe helpful: Increase "spark.default.parallelism"; Use repartition() or coalesce() to increase partition number. > 在 2014年11月22日,上午3:18,Sameer Ti

Re: Writing collection to file error

2014-11-24 Thread Akhil Das
To get the results in a single file, you could do a repartition(1) and then save it. ratesAndPreds.repartition(1).saveAsTextFile("/path/CFOutput") Thanks Best Regards On Mon, Nov 24, 2014 at 8:32 PM, Saurabh Agrawal wrote: > > > Thanks for your help Akhil, however, this is creating an output

Spark and Stanford CoreNLP

2014-11-24 Thread tvas
Hello, I was wondering if anyone has gotten the Stanford CoreNLP Java library to work with Spark. My attempts to use the parser/annotator fail because of task serialization errors since the class StanfordCoreNLP cannot be serialized. I've tried the remedies of registering StanfordCoreNLP throug

Re: Spark Cassandra Guava version issues

2014-11-24 Thread shahab
I faced same problem, and s work around solution is here : https://github.com/datastax/spark-cassandra-connector/issues/292 best, /Shahab On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab wrote: > I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using > sbt-assembly to create a uber

Re: Converting a column to a map

2014-11-24 Thread Yanbo
jsonFiles in your code is schemaRDD rather than RDD[Array]. If it is a column in schemaRDD, you can first use Spark SQL query to get a certain column. Or schemaRDD support some SQL like operation such as select / where can also get specific column. > 在 2014年11月24日,上午4:01,Daniel Haviv 写道: > > H

Spark SQL (1.0)

2014-11-24 Thread david
Hi, I build 2 tables from files. Table F1 join with table F2 on c5=d4. F1 has 46730613 rows F2 has 3386740 rows All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(s

RE: Writing collection to file error

2014-11-24 Thread Saurabh Agrawal
Thanks for your help Akhil, however, this is creating an output folder and storing the result sets in multiple files. Also the record count in the result set seems to have multiplied!! Is there any other way to achieve this? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boul

RE: ClassNotFoundException in standalone mode

2014-11-24 Thread Benoit Pasquereau
I finally managed to get the example working, here are the details that may help other users. I have 2 windows nodes for the test system, PN01 and PN02. Both have the same shared drive S: (it is mapped to C:\source on PN02). If I run the worker and master from S:\spark-1.1.0-bin-hadoop2.4, then

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav
We keep conf as symbolic link so that upgrade is as simple as drop-in replacement On Monday, November 24, 2014, riginos wrote: > OK thank you very much for that! > On 23 Nov 2014 21:49, "Denny Lee [via Apache Spark User List]" <[hidden > email]

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-24 Thread Josh Mahonin
Hi Alaa Ali, That's right, when using the PhoenixInputFormat, you can do simple 'WHERE' clauses and then perform any aggregate functions you'd like from within Spark. Any aggregations you run won't be quite as fast as running the native Spark queries, but once it's available as an RDD you can also

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

2014-11-24 Thread Romi Kuntsman
Hello, I have a large data calculation in Spark, distributed across serveral nodes. In the end, I want to write to a single output file. For this I do: output.coalesce(1, false).saveAsTextFile(filename). What happens is all the data from the workers flows to a single worker, and that one writ

Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do sc.CassandraTable(...) I get an error that's likely to be a Guava versioning issu

Re: Use case question

2014-11-24 Thread Gordon Benjamin
Great thanks On Monday, November 24, 2014, Akhil Das wrote: > I'm not quiet sure if i understood you correctly, but here's the thing, if > you use sparkstreaming, it is more likely to refresh your dashboard for > each batch. So for every batch your dashboard will be updated with the new > data.

spark broadcast error

2014-11-24 Thread Ke Wang
I want to ran my spark program on a YARN cluster. But when I tested broadcast function in my program, I got an error. Exception in thread "main" org.apache.spark.SparkException: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId(, in160-011.byted.org, 19704, 0

Re: EC2 cluster with SSD ebs

2014-11-24 Thread Hao Ren
Hi, I found that the ec2 script has been improved a lot. And the option "ebs-vol-type" is added to specify ebs type. However, it seems that the option does not work, the cmd I used is the following: $SPARK_HOME/ec2/spark-ec2 -k sparkcv -i spark.pem -m r3.4xlarge -s 3 -t r3.2xlarge --ebs-vol-typ

Re: Writing collection to file error

2014-11-24 Thread Akhil Das
Hi Saurabh, Here your ratesAndPreds is a RDD of type [((int, int), (Double, Double))] not an Array. Now, if you want to save it on disk, then you can simply call the saveAsTextFile and provide the location. So change your last line from this: ratesAndPreds.foreach(pw.println) to this ratesAnd

Re: Use case question

2014-11-24 Thread Akhil Das
I'm not quiet sure if i understood you correctly, but here's the thing, if you use sparkstreaming, it is more likely to refresh your dashboard for each batch. So for every batch your dashboard will be updated with the new data. And yes, the end use won't feel anything while you do the coalesce/repa

Writing collection to file error

2014-11-24 Thread Saurabh Agrawal
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("/path/CFReady.txt") val ratings = data.map(_.split('\t') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread riginos
OK thank you very much for that! On 23 Nov 2014 21:49, "Denny Lee [via Apache Spark User List]" < ml-node+s1001560n19598...@n3.nabble.com> wrote: > It sort of depends on your environment. If you are running on your local > environment, I would just download the latest Spark 1.1 binaries and you'l

Re: issue while running the code in standalone mode: "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

2014-11-24 Thread vdiwakar.malladi
Thanks for your response. I gave correct master url. Moreover as i mentioned in my post, i could able to run the sample program by using spark-submit. But it is not working when i'm running from my machine. Any clue on this? Thanks in advance. -- View this message in context: http://apache-sp

Re: issue while running the code in standalone mode: "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

2014-11-24 Thread Sean Owen
Wouldn't it likely be the opposite? Too much memory / too many cores being requested relative to the resource that YARN makes available? On Nov 24, 2014 11:00 AM, "Akhil Das" wrote: > This can happen mainly because of the following: > > - Wrong master url (Make sure you give the master url which

Re: Use case question

2014-11-24 Thread Gordon Benjamin
Thanks. Yes d3 ones. Just to clarify--we could take our current system, which is incrementally adding partitions and overlay an Apache streaming layer to ingest these partitions? Then nightly, we could coalesce these partitions for example? I presume that while we are carrying out a coalesce, the e

Re: Use case question

2014-11-24 Thread Akhil Das
Streaming would be easy to implement, all you have to do is to create the stream, do some transformation (depends on your usecase) and finally write it to your dashboards backend. What kind of dashboards are you building? For d3.js based ones, you can have websocket and write the stream output to t

Use case question

2014-11-24 Thread Gordon Benjamin
hi, We are building an analytics dashboard. Data will be updated every 5 minutes for now and eventually every 1 minute, maybe more frequent. The amount of data coming is not huge, per customer maybe 30 records per minute although we could have 500 customers. Is streaming correct for this I nstead

Re: issue while running the code in standalone mode: "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

2014-11-24 Thread Akhil Das
This can happen mainly because of the following: - Wrong master url (Make sure you give the master url which is listed on top left corner of the webui - running on 8080) - Allocated more memory/cores while creating the sparkContext. Thanks Best Regards On Mon, Nov 24, 2014 at 4:13 PM, vdiwaka

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Akhil Das
Not sure if it will work, but you can try creating a dummy hadoop conf directory and put those files (*-site.xml) files inside it and hopefully spark will pick it up and submit it on that remote cluster. (If there isn't any network/firewall issues). Thanks Best Regards On Mon, Nov 24, 2014 at 4:1

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi Akhil, But driver and yarn both are in different networks, How to specify (export HADOOP_CONF_DIR=XXX) path. Like driver is from my windows machine and yarn is some unix machine on different network. -Naveen. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, November 24, 20

issue while running the code in standalone mode: "Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory"

2014-11-24 Thread vdiwakar.malladi
Hi, When i trying to execute the program from my laptop by connecting to HDP environment (on which Spark also configured), i'm getting the warning ("Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory") and Job is being

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Akhil Das
You can export the hadoop configurations dir (export HADOOP_CONF_DIR=XXX) in the environment and then submit it like: ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executor

  1   2   >