pamameter passed for AppendOnlyMap initialCapacity

2015-02-09 Thread fightf...@163.com
Hi, all Any experts can show me what can be done to change the initialCapacity of the following ? org.apache.spark.util.collection.AppendOnlyMap Cause we had caught problems in using spark to process large data sets during sort shuffle. Does spark offer a configurable parameter for suppo

Re: ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Mohit Singh
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py instead of normal "python pi.py" On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar wrote: > *Command:* > sudo python ./examples/src/main/python/pi.py > > *Error:* > Traceback (most recent call last): > File "./examples/src/m

ImportError: No module named pyspark, when running pi.py

2015-02-09 Thread Ashish Kumar
*Command:* sudo python ./examples/src/main/python/pi.py *Error:* Traceback (most recent call last): File "./examples/src/main/python/pi.py", line 22, in from pyspark import SparkContext ImportError: No module named pyspark

Re: textFile partitions

2015-02-09 Thread Kostas Sakellis
The partitions parameter to textFile is the "minPartitions". So there will be at least that level of parallelism. Spark delegates to Hadoop to create the splits for that file (yes, even for a text file on disk and not hdfs). You can take a look at the code in FileInputFormat - but briefly it will c

Re: running spark project using java -cp command

2015-02-09 Thread Akhil Das
Yes like this: /usr/lib/jvm/java-7-openjdk-i386/bin/java -cp ::/home/akhld/mobi/localcluster/spark-1/conf:/home/akhld/mobi/localcluster/spark-1/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/da

running spark project using java -cp command

2015-02-09 Thread Hafiz Mujadid
hi experts! Is there any way to run spark application using java -cp command ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/running-spark-project-using-java-cp-command-tp21567.html Sent from the Apache Spark User List mailing list archive at Na

textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple: I have a standalone cluster with default parallelism of 2, spark-shell running with 2 cores sc.textFile("README.md").partitions.size returns 2 (this makes sense) sc.textFile("README.md").coalesce(100,true).partitions.size returns 100, also makes sense

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Yitong Zhou
Hi Marcelo, Thanks for the explanation! So you mean in this way, actually only the output of the map closure would need to be serialized so that it could be passed further for other operations (maybe reduce or else)? And we don't have to worry about Utils.funcX because for each closure instance we

Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
OK, good luck! On Mon Feb 09 2015 at 6:41:14 PM Guodong Wang wrote: > Hi Nicholas, > > Thanks for your quick reply. > > I'd like to try to build a image with create_image.sh. Then let's see how > we can launch spark cluster in region cn-north-1. > > > > Guodong > > On Tue, Feb 10, 2015 at 3:59 A

Re: How to create spark AMI in AWS

2015-02-09 Thread Guodong Wang
Hi Nicholas, Thanks for your quick reply. I'd like to try to build a image with create_image.sh. Then let's see how we can launch spark cluster in region cn-north-1. Guodong On Tue, Feb 10, 2015 at 3:59 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Guodong, > > spark-ec2 does n

Can spark job server be used to visualize streaming data?

2015-02-09 Thread Su She
Hello Everyone, I was reading this blog post: http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/ and was wondering if this approach can be taken to visualize streaming data...not just historical data? Thank you! -Suh

Re: SparkSQL DateTime

2015-02-09 Thread Michael Armbrust
The standard way to add timestamps is java.sql.Timestamp. On Mon, Feb 9, 2015 at 3:23 PM, jay vyas wrote: > Hi spark ! We are working on the bigpetstore-spark implementation in > apache bigtop, and want to implement idiomatic date/time usage for SparkSQL. > > It appears that org.joda.time.DateTi

SparkSQL DateTime

2015-02-09 Thread jay vyas
Hi spark ! We are working on the bigpetstore-spark implementation in apache bigtop, and want to implement idiomatic date/time usage for SparkSQL. It appears that org.joda.time.DateTime isnt in SparkSQL's rolodex of reflection types. I'd rather not force an artificial dependency on hive dates j

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You could add a new ColumnType . PRs welcome :) On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel wrote: > Hi Michael, > > As a test, I have same data loaded as another parquet - excep

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, As a test, I have same data loaded as another parquet - except with the 2 decimal(14,4) replaced by double. With this, the on disk size is ~345MB, the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the time of uncached query. Would it be possible for Spark to sto

RE: Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Awesome...thanks Sean. > From: so...@cloudera.com > Date: Mon, 9 Feb 2015 22:43:45 + > Subject: Re: Check if spark was built with hive > To: as...@live.com > CC: user@spark.apache.org > > https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217 > > Yes all releas

Re: Check if spark was built with hive

2015-02-09 Thread Sean Owen
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217 Yes all releases are built with -Phive except the 'without-hive' build. On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab wrote: > Is there an easy way to check if a spark binary release was built with Hive > suppo

Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support? Thanks,Ashic.

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Could you share which data types are optimized in the in-memory storage and how are they optimized ? On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust wrote: > You'll probably only get good compression for strings when dictionary > encoding works. We don't optimize decimals in the in-memory colu

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
You'll probably only get good compression for strings when dictionary encoding works. We don't optimize decimals in the in-memory columnar storage, so you are paying expensive serialization there likely. On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel wrote: > Flat data of types String, Int and cou

Re: Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Marcelo Vanzin
`func1` and `func2` never get serialized. They must exist on the other end in the form of a class loaded by the JVM. What gets serialized is an instance of a particular closure (the argument to your "map" function). That's a separate class. The instance of that class that is serialized contains re

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Flat data of types String, Int and couple of decimal(14,4) On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust wrote: > Is this nested data or flat data? > > On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel > wrote: > >> Hi Michael, >> >> The storage tab shows the RDD resides fully in memory (10 partit

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Michael Armbrust
Is this nested data or flat data? On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel wrote: > Hi Michael, > > The storage tab shows the RDD resides fully in memory (10 partitions) with > zero disk usage. Tasks for subsequent select on this table in cache shows > minimal overheads (GC, queueing, shuffle

Will Spark serialize an entire Object or just the method referred in an object?

2015-02-09 Thread Yitong Zhou
If we define an Utils object: object Utils { def func1 = {..} def func2 = {..} } And then in a RDD we refer to one of the function: rdd.map{r => Utils.func1(r)} Will Utils.func2 also get serialized or not? Thanks, Yitong -- View this message in context: http://apache-spark-user-list.10

Re: SQL group by on Parquet table slower when table cached

2015-02-09 Thread Manoj Samel
Hi Michael, The storage tab shows the RDD resides fully in memory (10 partitions) with zero disk usage. Tasks for subsequent select on this table in cache shows minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is not issue. However, it is still twice as slow as reading uncach

[ANNOUNCE] Apache Spark 1.2.1 Released

2015-02-09 Thread Patrick Wendell
Hi All, I've just posted the 1.2.1 maintenance release of Apache Spark. We recommend all 1.2.0 users upgrade to this release, as this release includes stability fixes across all components of Spark. - Download this release: http://spark.apache.org/downloads.html - View the release notes: http://s

Re: SparkSQL 1.2 and ElasticSearch-Spark 1.4 not working together, NoSuchMethodError problems

2015-02-09 Thread Aris
Thank you Costin, yes your solution worked. Just to be explicit - I used the development snapshot and put that dependency in my build.sbt. This should help people. The dependency: "org.elasticsearch" %% "elasticsearch-spark" % "2.1.0.BUILD-SNAPSHOT" With the resolver: Resolver.sonatypeRepo(

RE: no space left at worker node

2015-02-09 Thread ey-chih chow
I change to submit command to the following, but the jar file is still copied to the directory of ./spark/work/app-xx-xx. /root/spark/bin/spark-submit --class com.crowdstar.etl.ParseAndClean --master spark://ec2-54-213-73-150.us-west-2.compute.amazonaws.com:7077 local:///root/etl-admin/jar/s

Re: External Data Source in SPARK

2015-02-09 Thread Michael Armbrust
You need to pass the fully qualified class name as the argument to USING. Nothing special should be required to make it work for python. On Mon, Feb 9, 2015 at 10:21 AM, Addanki, Santosh Kumar < santosh.kumar.adda...@sap.com> wrote: > Hi, > > > > We implemented an External Data Source by extendi

RE: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Christopher Thom
I haven't been able to see any evidence from the logs that there are RDDs being excluded. This is a test dataset, so quite small (<100k rows), so I'd be shocked if it was an OOM error. Where should I look in the UI to see whether RDDs are being excluded? In case it helpshere's the full log

Re: python api and gzip compression

2015-02-09 Thread Kane Kim
Found it - used saveAsHadoopFile On Mon, Feb 9, 2015 at 9:11 AM, Kane Kim wrote: > Hi, how to compress output with gzip using python api? > > Thanks! >

Re: SparkSQL 1.2 and ElasticSearch-Spark 1.4 not working together, NoSuchMethodError problems

2015-02-09 Thread Costin Leau
Hi, Spark 1.2 changed the APIs a bit which is what's causing the problem with es-spark 2.1.0.Beta3. This has been addressed a while back in es-spark proper; you can get a hold of the dev build (the upcoming 2.1.Beta4) here [1]. P.S. Do note that a lot of things have happened in es-hadoop/spark

RE: no space left at worker node

2015-02-09 Thread ey-chih chow
In other words, the working command is: /root/spark/bin/spark-submit --class com.crowdstar.etl.ParseAndClean --master spark://ec2-54-213-73-150.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --total-executor-cores 4 file:///root/etl-admin/jar/spark-etl-0.0.1-SNAPSHOT.jar s3://pixlog

Re: [MLlib] Performance issues when building GBM models

2015-02-09 Thread Xiangrui Meng
Could you check the Spark UI and see whether there are RDDs being kicked out during the computation? We cache the residual RDD after each iteration. If we don't have enough memory/disk, it gets recomputed and results something like `t(n) = t(n-1) + const`. We might cache the features multiple times

Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng
`mean()` and `variance()` are not defined in `Vector`. You can use the mean and variance implementation from commons-math3 (http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html) if you don't want to implement them. -Xiangrui On Fri, Feb 6, 2015 at 12:50 PM, SK wrote: > Hi,

Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
Guodong, spark-ec2 does not currently support the cn-north-1 region, but you can follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to find out when it does. The base AMI used to generate the current Spark AMIs is very old. I'm not sure anyone knows what it is anymore. What I k

Re: naive bayes text classifier with tf-idf in pyspark

2015-02-09 Thread Xiangrui Meng
On Fri, Feb 6, 2015 at 2:08 PM, Imran Akbar wrote: > Hi, > > I've got the following code that's almost complete, but I have 2 questions: > > 1) Once I've computed the TF-IDF vector, how do I compute the vector for > each string to feed into the LabeledPoint? > If I understand your code correctly

Re: no option to add intercepts for StreamingLinearAlgorithm

2015-02-09 Thread Xiangrui Meng
No particular reason. We didn't add it in the first version. Let's add it in 1.4. -Xiangrui On Thu, Feb 5, 2015 at 3:44 PM, jamborta wrote: > hi all, > > just wondering if there is a reason why it is not possible to add intercepts > for streaming regression models? I understand that run method in

Re: Number of goals to win championship

2015-02-09 Thread Xiangrui Meng
Logistic regression outputs probabilities if the data fits the model assumption. Otherwise, you might need to calibrate its output to correctly read it. You may be interested in reading this: http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/. We have isotonic reg

Re: word2vec more distributed

2015-02-09 Thread Xiangrui Meng
The C implementation of Word2Vec updates the model using multi-threads without locking. It is hard to implement it in a distributed way. In the MLlib implementation, each work holds the entire model in memory and output the part of model that gets updated. The driver still need to collect and aggre

SparkSQL 1.2 and ElasticSearch-Spark 1.4 not working together, NoSuchMethodError problems

2015-02-09 Thread Aris
Hello Spark community and Holden, I am trying to follow Holden Karau's SparkSQL and ElasticSearch tutorial from Spark Summit 2014. I am trying to use elasticsearch-spark 2.1.0.Beta3 and SparkSQL 1.2 together. https://github.com/holdenk/elasticsearchspark *(Side Note: This very nice tutorial does

Re: rdd filter

2015-02-09 Thread Xiangrui Meng
How was this RDD generated? Any randomness involved? -Xiangrui On Mon, Feb 9, 2015 at 10:47 AM, SK wrote: > Hi, > > I am using the filter() method to separate the rdds based on a predicate,as > follows: > > val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a > Double > val rdd

Re: word2vec: how to save an mllib model and reload it?

2015-02-09 Thread Xiangrui Meng
We are working on import/export for MLlib models. The umbrella JIRA is https://issues.apache.org/jira/browse/SPARK-4587. In 1.3, we are going to have save/load for linear models, naive Bayes, ALS, and tree models. I created a JIRA for Word2Vec and set the target version to 1.4. If anyone is interes

RE: no space left at worker node

2015-02-09 Thread ey-chih chow
Thanks. But, in spark-submit, I specified the jar file in the form of local:/spark-etl-0.0.1-SNAPSHOT.jar. It comes back with the following. What's wrong with this? Ey-Chih Chow === Date: Sun, 8 Feb 2015 22:27:17 -0800Sending launch command to spark://ec2-54-213-73-150.us-west-2.c

rdd filter

2015-02-09 Thread SK
Hi, I am using the filter() method to separate the rdds based on a predicate,as follows: val rdd1 = data.filter (t => { t._2 >0.0 && t._2 <= 1.0}) // t._2 is a Double val rdd2 = data.filter (t => { t._2 >1.0 && t._2 <= 4.0}) val rdd3 = data.filter (t => { t._2 >0.0 && t._2 <= 4.0}) // this sho

External Data Source in SPARK

2015-02-09 Thread Addanki, Santosh Kumar
Hi, We implemented an External Data Source by extending the TableScan . We added the classes to the classpath The data source works fine when run in Spark Shell . But currently we are unable to use this same data source in Python Environment. So when we execute the following below in an Ipython

Re: Spark streaming app shutting down

2015-02-09 Thread Mukesh Jha
Thanks for the info guys. For now I'm using the high level consumer i will give this one a try. As far as the queries are concerned, check pointing helps. I'm still no t sure whats the best way to gracefully stop the application in yarn cluster mode. On 5 Feb 2015 09:38, "Dibyendu Bhattacharya"

sum of columns in rowMatrix and linear regression

2015-02-09 Thread Donbeo
I have a matrix X of type: res39: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@6cfff1d3 with n rows and p columns I would like to obtain an array S of size n*1 defined as the sum of the columns of X. S will then be replaced by val s2

python api and gzip compression

2015-02-09 Thread Kane Kim
Hi, how to compress output with gzip using python api? Thanks!

spark and breeze random number generator ( ClassNotFoundException)

2015-02-09 Thread Donbeo
Hi, I receive a dependency error when I try to use breeze.stats.distributions.Uniform() in spark. Here there is the full description of my problem http://stackoverflow.com/questions/28414224/spark-and-breeze-random-number-generator-classnotfoundexception I think I have to include somehow the

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
If the application has failed/succeeded, the logs get pushed to HDFS and can be accessed by following command :- yarn logs --applicationId If it's still running, you can find executors' logs on corresponding data nodes in hadoop logs directory. PAth should be something like :- /data/hadoop_logs

Re: Spark Driver Host under Yarn

2015-02-09 Thread Al M
Yarn-cluster. When i run in yarn-client the driver is just run on the machine that runs spark-submit. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21558.html Sent from the Apache Spark User List mailing list archive a

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Burak Yavuz
Sorry about that, yes, it should be uniformVectorRDD. Thanks Sean! Burak On Mon, Feb 9, 2015 at 2:05 AM, Sean Owen wrote: > Yes the example given here should have used uniformVectorRDD. Then it's > correct. > > On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini wrote: > > Thanks a lot! > > Can I ask

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
Have you checked the corresponding executor logs as well? I think information provided by you here is less to actually understand your issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-yarn-client-mode-Hangs-in-final-stages-of-Collect-or-Reduce-tp

Re: Spark Driver Host under Yarn

2015-02-09 Thread nitin
Are you running in yarn-cluster or yarn-client mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin
Hi All, I have a use case where I have cached my schemaRDD and I want to launch executors just on the partition which I know of (prime use-case of PartitionPruningRDD). I tried something like following :- val partitionIdx = 2 val schemaRdd = hiveContext.table("myTable") //myTable is cached in me

Re: saveAsTextFile of RDD[Array[Any]]

2015-02-09 Thread Jong Wook Kim
If you have `RDD[Array[Any]]` you can do rdd.map(_.mkString("\t")) or with some other delimiter to make it `RDD[String]`, and then call `saveAsTextFile`. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html Se

Re: Custom streaming receiver slow on YARN

2015-02-09 Thread Jong Wook Kim
replying to my own thread; I realized that this only happens when the replication level is 1. Regardless of whether setting memory_only or disk or deserialized, I had to make the replication level >= 2 to make the streaming work properly on YARN. I still don't get it why, because intuitively less

Need a spark application.

2015-02-09 Thread Kartheek.R
Hi, Can someone please suggest some real life application implemented in spark ( things like gene sequencing) that is of type below code. Basically, the application should have jobs submitted via as many threads as possible. I need similar kind of spark application for benchmarking. val threadA

[MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-09 Thread Josh Devins
I've been looking into a performance problem when using LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm). Here's an outline of what I've figured out so far and it would be great to get some confirmation of the problem, some input on how wide-spread this problem might be and any

Re: Spark certifications

2015-02-09 Thread Paco Nathan
Great question! O'Reilly Media and Databricks partnered to create the professionally recognized certification program for Apache Spark developers. A landing page with additional more info is at: http://go.databricks.com/spark-certified-developer Then the registration is at http://www.oreilly.com

Re: How to create spark AMI in AWS

2015-02-09 Thread Franc Carter
Hi, I'm very new to Spark, but experienced with AWS - so take that in to account with my suggestions. I started with an AWS base image and then added the pre-built Spark-1.2. I then added made a 'Master' version and a 'Worker' versions and then made AMIs for them. The Master comes up with a sta

Re: Installing a python library along with ec2 cluster

2015-02-09 Thread gen tang
Hi, Please take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html Cheers Gen On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu wrote: > Hi I am very new both in spark and aws stuff.. > Say, I want to install pandas on ec2.. (pip install pandas) > How do I create

How to create spark AMI in AWS

2015-02-09 Thread Guodong Wang
Hi guys, I want to launch spark cluster in AWS. And I know there is a spark_ec2.py script. I am using the AWS service in China. But I can not find the AMI in the region of China. So, I have to build one. My question is 1. Where is the bootstrap script to create the Spark AMI? Is it here( https:/

using spark in web services

2015-02-09 Thread Hafiz Mujadid
Hi experts! I am trying to use spark in my restful webservices.I am using scala lift frramework for writing web services. Here is my boot class class Boot extends Bootable { def boot { Constants.loadConfiguration val sc=new SparkContext(new SparkConf().setMaster("local").setAppName("servi

Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-09 Thread Marius Soutier
Hi there, I’m trying to improve performance on a job that has GC troubles and takes longer to compute simply because it has to recompute failed tasks. After deferring object creation as much as possible, I’m now trying to improve memory usage with StorageLevel.MEMORY_AND_DISK_SER and a custom K

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Sean Owen
Yes the example given here should have used uniformVectorRDD. Then it's correct. On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini wrote: > Thanks a lot! > Can I ask why this code generates a uniform distribution? > > If dist is N(0,1) data should be N(-1, 2). > > Let me know. > Thanks, > Luca > > 20

OutofMemoryError: Java heap space

2015-02-09 Thread Yifan LI
Hi, I just found the following errors during computation(graphx), anyone has ideas on this? thanks so much! (I think the memory is sufficient, spark.executor.memory 30GB ) 15/02/09 00:37:12 ERROR Executor: Exception in task 162.0 in stage 719.0 (TID 7653) java.lang.OutOfMemoryError: Java hea

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Luca Puggini
Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks, Luca 2015-02-07 3:00 GMT+00:00 Burak Yavuz : > Hi, > > You can do the following: > ``` > import org.apache.spark.mllib.linalg.distributed.RowMatrix > import o

Re: Spark certifications

2015-02-09 Thread Emre Sevinc
Hello, Please see the following certification: http://www.oreilly.com/data/sparkcert.html -- Emre Sevinç On Mon, Feb 9, 2015 at 10:42 AM, Saurabh Agrawal wrote: > > > Can somebody please suggest the best (and professionally recognized) > training and certification programs in Apache Spark

Spark certifications

2015-02-09 Thread Saurabh Agrawal
Can somebody please suggest the best (and professionally recognized) training and certification programs in Apache Spark across industry? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boulevard B-9A, Tower C 3rd Floor, Sector - 62, Noida 201301, India +91 120 611 8274 Office

Re: Error when running example (pi.py)

2015-02-09 Thread Akhil Das
It says permission denied, just make sure the user ashish has permission over the directory /home/ashish/Downloads/spark-1.1.0-bin-hadoop2.4. chown -R ashish:ashish /home/ashish/Downloads/spark-1.1.* would do. Thanks Best Regards On Mon, Feb 9, 2015 at 12:47 PM, Ashish Kumar wrote: > > Trace

Re: Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-02-09 Thread Arush Kharbanda
Is this what you are looking for 1. Build Spark with the YARN profile . Skip this step if you are using a pre-packaged distribution. 2. Locate the spark--yarn-shuffle.jar. This should be under $SPARK_HOME/network/yarn/target/s

Re: getting error when submit spark with master as yarn

2015-02-09 Thread Al M
Open up 'yarn-site.xml' in your hadoop configuration. You want to create configuration for yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb. Have a look here for details on how they work: https://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default