Unsupported language features in query

2014-09-02 Thread centerqi hu
hql("""CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' as SELECT SUM(uv) as uv, round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat """) 14/09/02 15:32:28 INFO ParseDriver: Parse Completed java.lang.RuntimeExce

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work around is create the table first. -Original Message- From: centerqi hu [mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:35 PM To: user@spark.apache.org Subject: Unsupported language features in query

Re: Unsupported language features in query

2014-09-02 Thread centerqi hu
Thanks Cheng Hao Have a way of obtaining spark support hive statement list? Thanks 2014-09-02 15:39 GMT+08:00 Cheng, Hao : > Currently SparkSQL doesn’t support the row format/serde in CTAS. The work > around is create the table first. > > -Original Message- > From: centerqi hu [mailto:c

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
I am afraid no, but you can report that in Jira (https://issues.apache.org/jira/browse/SPARK) if you meet the missing functionalities in SparkSQL. SparkSQL aims to support all of the Hive functionalities (at least most of it) for HQL dialect. -Original Message- From: centerqi hu [mailt

Re: Unsupported language features in query

2014-09-02 Thread centerqi hu
Thanks Chen Hao. This bug have been submited. https://issues.apache.org/jira/browse/SPARK-3343 2014-09-02 15:55 GMT+08:00 Cheng, Hao : > I am afraid no, but you can report that in Jira ( > https://issues.apache.org/jira/browse/SPARK) if you meet the missing > functionalities in SparkSQL. > > Sp

New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread filipus
is there any news about Discretization in spark? is there anything on git? i didnt find yet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256.html Sent from the Apache Spark User List mailing list a

pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi , I've installed pyspark on hpd hortonworks cluster. Executing pi example: command: spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563]# ./bin/spark-submit --master spark://10.193.1.71:7077 examples/src/main/python/pi.py 1000 exception: 14/09/02 17:34:02 INFO SecurityManager: U

save schemardd to hive

2014-09-02 Thread centerqi hu
I want to save schemardd to hive val usermeta = hql(" SELECT userid,idlist from usermeta WHERE day='2014-08-01' limit 1000") case class SomeClass(name:String,idlist:String) val schemardd = usermeta.map(t=>{SomeClass(t(0).toString,t(1).toString)}) How to save schemardd to hive? Thanks -- ce

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread filipus
i guess i found it https://github.com/LIDIAgroup/SparkFeatureSelection -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/New-features-Discretization-for-v1-x-in-xiangrui-pdf-tp13256p13261.html Sent from the Apache Spark User List mailing list archive at Nabbl

Re: save schemardd to hive

2014-09-02 Thread centerqi hu
I got it import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ val usermeta = hql(" SELECT userid,idlist from meta WHERE day='2014-08-01' limit 1000") case class SomeClass(name:String,i

Re: save schemardd to hive

2014-09-02 Thread Silvio Fiorito
You can use saveAsTable or do an INSERT SparkSQL statement as well in case you need other Hive query features, like partitioning. On 9/2/14, 6:54 AM, "centerqi hu" wrote: >I got it > >import org.apache.spark.{SparkConf, SparkContext} >import org.apache.spark.sql._ >val hiveContext = new org.apac

Re: save schemardd to hive

2014-09-02 Thread centerqi
>>> schemardd can exec SQL stamens? > 在 2014年9月2日,20:42,Silvio Fiorito 写道: > > You can use saveAsTable or do an INSERT SparkSQL statement as well in case > you need other Hive query features, like partitioning. > >> On 9/2/14, 6:54 AM, "centerqi hu" wrote: >> >> I got it >> >> import org.

Re: save schemardd to hive

2014-09-02 Thread Silvio Fiorito
Once you’ve registered an RDD as a table, you can use it in SparkSQL statements: myrdd.registerAsTable(“my_table”) hql(“FROM my_table INSERT INTO TABLE my_other_table”) On 9/2/14, 9:18 AM, "centerqi" wrote: schemardd can exec SQL stamens? > > > >> 在 2014年9月2日,20:42,Silvio Fiorito 写道: >>

Re: save schemardd to hive

2014-09-02 Thread centerqi
Thank you very much, also can do this, it seems that I know too little about RDD > 在 2014年9月2日,21:22,Silvio Fiorito 写道: > > Once you’ve registered an RDD as a table, you can use it in SparkSQL > statements: > > myrdd.registerAsTable(“my_table”) > > hql(“FROM my_table INSERT INTO TABLE my_o

Re: [Streaming] Cannot get executors to stay alive

2014-09-02 Thread Yana
Trying to bump this -- I'm basically asking if anyone has noticed the executor leaking memory. I have a large key space but no churn in the RDD so I don't understand why memory consumption grows with time. Any experiences with streaming welcomed -- I'm hoping I'm doing something wrong -- V

Spark Java Configuration.

2014-09-02 Thread pcsenthil
Team, I am new to Apache Spark and I didn't have much knowledge on hadoop or big data. I need clarifications on the below, How does Spark Configuration works, from a tutorial i got the below /SparkConf conf = new SparkConf().setAppName("Simple application") .setMaste

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
I don't have any personal experience with Spark Streaming. Whether you store your data in HDFS or a database or something else probably depends on the nature of your use case. On Fri, Aug 29, 2014 at 10:38 AM, huylv wrote: > Hi Daniel, > > Your suggestion is definitely an interesting approach.

Using Spark's ActionSystem for performing analytics using Akka

2014-09-02 Thread Aniket Bhatnagar
Sorry about the noob question, but I was just wondering if we use Spark's ActorSystem (SparkEnv.actorSystem), would it distribute actors across worker nodes or would the actors only run in driver JVM?

Re: pyspark yarn got exception

2014-09-02 Thread Andrew Or
Hi Oleg, If you are running Spark on a yarn cluster, you should set --master to yarn. By default this runs in client mode, which redirects all output of your application to your console. This is failing because it is trying to connect to a standalone master that you probably did not start. I am so

Re: Spark-shell return results when the job is executing?

2014-09-02 Thread Andrew Or
Spark-shell, or any other Spark application, returns the full results of the job until it has finished executing. You could add a hook for it to write partial results to a file, but you may want to do so sparingly to incur fewer I/Os. If you have a large file and the result contains many lines, it

Re: Spark Java Configuration.

2014-09-02 Thread Yana Kadiyska
JavaSparkContext java_SC = new JavaSparkContext(conf); is the spark context. An application has a single spark context -- you won't be able to "keep calling this" -- you'll see an error if you try to create a second such object from the same application. Additionally, depending on your configurati

Spark on YARN question

2014-09-02 Thread Greg Hill
I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs. It seems like everything is working. Unless I'm misunderstan

Re: Spark on YARN question

2014-09-02 Thread Matt Narrell
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar. I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or ma

Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on

Re: Spark on YARN question

2014-09-02 Thread Greg Hill
Thanks. That sounds like how I was thinking it worked. I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit. Greg From: And

Spark on Mesos: Pyspark python libraries

2014-09-02 Thread Daniel Rodriguez
Hi all, I am getting started with spark and mesos, I already have spark running on a mesos cluster and I am able to start the scala spark and pyspark shells, yay!. I still have questions on how to distribute 3rd party python libraries since i want to use stuff like nltk and mlib on pyspark that re

Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic. When I include these 3rd party drivers, I usually get a NotSerializableException due to the

Re: saveAsTextFile makes no progress without caching RDD

2014-09-02 Thread jerryye
As an update. I'm still getting the same issue. I ended up doing a coalesce instead of a cache to get around the memory issue but saveAsTextFile still won't proceed without the coalesce or cache first. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTe

Re: Serialized 3rd party libs

2014-09-02 Thread Sean Owen
The problem is not using the drivers per se, but writing your functions in a way that you are trying to serialize them. You can't serialize them, and indeed don't want to. Instead your code needs to reopen connections and so forth when the function is instantiated on the remote worker. static var

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-09-02 Thread Sean Owen
+user@ An executor is specific to an application, but an application can be executing many jobs at once. So as I understand many jobs' tasks can be executing at once on an executor. You may not use your full 80-way parallelism if, for example, your data set doesn't have 80 partitions. I also beli

Regarding function unpersist on rdd

2014-09-02 Thread Zijing Guo
Hello, Can someone enlighten me regarding whether call unpersist on a rdd is expensive? what is the best solution to uncache the cached rdd? Thanks Edwin

Publishing a transformed DStream to Kafka

2014-09-02 Thread Massimiliano Tomassi
Hello all, after having applied several transformations to a DStream I'd like to publish all the elements in all the resulting RDDs to Kafka. What the best way to do that would be? Just using DStream.foreach and then RDD.foreach ? Is there any other built in utility for this use case? Thanks a lot

Re: Publishing a transformed DStream to Kafka

2014-09-02 Thread Tim Smith
I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=>myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) => { rdd.foreach(rec => { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a

pyspark and cassandra

2014-09-02 Thread Oleg Ruchovets
Hi All , Is it possible to have cassandra as input data for PySpark. I found example for java - http://java.dzone.com/articles/sparkcassandra-stack-perform?page=0,0 and I am looking something similar for python. Thanks Oleg.

mllib performance on cluster

2014-09-02 Thread SK
Hi, I evaluated the runtime performance of some of the MLlib classification algorithms on a local machine and a cluster with 10 nodes. I used standalone mode and Spark 1.0.1 in both cases. Here are the results for the total runtime: Local Cluster Logi

Re: Serialized 3rd party libs

2014-09-02 Thread Matt Narrell
Sean, Thanks for point this out. I’d have to experiment with the mapPartitions method, but you’re right, this seems to address this issue directly. I’m also connecting to Zookeeper to retrieve SparkConf parameters. I run into the same issue with my Zookeeper driver, however, this is before a

Re: Spark Streaming : Could not compute split, block not found

2014-09-02 Thread Tim Smith
I am seeing similar errors in my job's logs. TD - Are you still waiting for debug logs? If yes, can you please let me know how to generate debug logs? I am using Spark/Yarn and setting "NodeManager" logs to "DEBUG" level doesn't seem to produce anything but INFO logs. Thanks, Tim >Aaah sorry, I

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overh

Re: Spark on Mesos: Pyspark python libraries

2014-09-02 Thread Davies Liu
PYSPARK_PYTHON may work for you, it's used to specify which Python interpreter should be used in both driver and worker. For example, if anaconda was installed as /anaconda on all the machines, then you can specify PYSPARK_PYTHON=/anaconda/bin/python to use anaconda virtual environment in PySpark.

Re: pyspark and cassandra

2014-09-02 Thread Kan Zhang
In Spark 1.1, it is possible to read from Cassandra using Hadoop jobs. See examples/src/main/python/cassandra_inputformat.py for an example. You may need to write your own key/value converters. On Tue, Sep 2, 2014 at 11:10 AM, Oleg Ruchovets wrote: > Hi All , >Is it possible to have cassand

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Also - what hardware are you running the cluster on? And what is the local machine hardware? On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks wrote: > How many iterations are you running? Can you provide the exact details > about the size of the dataset? (how many data points, how many features)

MLLib decision tree: Weights

2014-09-02 Thread Sameer Tilak
Hi everyone, We are looking to apply a weight to each training example; this weight should be used when computing the penalty of a misclassified example. For instance, without weighting, each example is penalized 1 point when evaluating the model of a classifier, such as a decision tree

Re: spark-ec2 [Errno 110] Connection time out

2014-09-02 Thread Daniil Osipov
Make sure your key pair is configured to access whatever region you're deploying to - it defaults to us-east-1, but you can provide a custom one with parameter --region. On Sat, Aug 30, 2014 at 12:53 AM, David Matheson wrote: > I'm following the latest documentation on configuring a cluster on

Re: mllib performance on cluster

2014-09-02 Thread SK
NUm Iterations: For LR and SVM, I am using the default value of 100. All the other parameters also I am using the default values. I am pretty much reusing the code from BinaryClassification.scala. For Decision Tree, I dont see any parameter for number of iterations inthe example code, so I did

flattening a list in spark sql

2014-09-02 Thread gtinside
Hi , I am using jsonRDD in spark sql and having trouble iterating through array inside the json object. Please refer to the schema below : -- Preferences: struct (nullable = true) ||-- destinations: array (nullable = true) |-- user: string (nullable = true) Sample Data: -- Preferences: st

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-02 Thread Daniil Osipov
What version of sbt are you using? There is a bug in early version of 0.13 that causes assembly to be extremely slow - make sure you're using the latest one. On Fri, Aug 29, 2014 at 1:30 PM, Aris wrote: > Hi folks, > > I am trying to use Kafka with Spark Streaming, and it appears I cannot do >

Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
My code seemed deadlock when I tried to do this: object MoreRdd extends Serializable { def apply(i: Int) = { val rdd2 = sc.parallelize(0 to 10) rdd2.map(j => i*10 + j).collect } } val rdd1 = sc.parallelize(0 to 10) val y = rdd1.map(i => MoreRdd(i)).

Re: Creating an RDD in another RDD causes deadlock

2014-09-02 Thread Sean Owen
Yes, you can't use RDDs inside RDDs. But of course you can do this: val nums = (0 to 10) val y = nums.map(i => MoreRdd(i)).collect On Tue, Sep 2, 2014 at 10:14 PM, cjwang wrote: > My code seemed deadlock when I tried to do this: > > object MoreRdd extends Serializable { > def apply(i: In

Re: Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
I didn't know this restriction. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-an-RDD-in-another-RDD-causes-deadlock-tp13302p13304.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDD ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a Hadoop code a cus

Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread salemi
Hi, I am planing to use a incoming DStream and calculate different measures from the same stream. I was able to calculate the individual measures separately and know I have to merge them and spark streaming doesn't support outer join yet. handlingtimePerWorker List(workerId, hanlingTime) filePr

Re: mllib performance on cluster

2014-09-02 Thread Bharath Mundlapudi
Those are interesting numbers. You haven't mentioned the dataset size in your thread. This is a classic example of scalability and performance assuming your baseline numbers are correct and you tuned correctly everything on your cluster. Putting my outside cap, there are multiple reasons for this,

Re: Spark on YARN question

2014-09-02 Thread Dimension Data, LLC.
Hello friends: I have a follow-up to Andrew's well articulated answer below (thank you for that). (1) I've seen both of these invocations in various places: (a) '--master yarn' (b) '--master yarn-client' the latter of which doesn't appear in '/pyspark//|//spark-submit|spark-

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Check out LATERAL VIEW explode: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView On Tue, Sep 2, 2014 at 1:26 PM, gtinside wrote: > Hi , > > I am using jsonRDD in spark sql and having trouble iterating through array > inside the json object. Please refer to the schema

Re: Spark on YARN question

2014-09-02 Thread Andrew Or
Hi Didata, (1) Correct. The default deploy mode is `client`, so both masters `yarn` and `yarn-client` run Spark in client mode. If you explicitly specify master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly specify one deploy mode through the master (e.g. yarn-client) but se

Re: mllib performance on cluster

2014-09-02 Thread SK
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-lis

RE: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-09-02 Thread Anton Brazhnyk
It works with "spark.executor.extraClassPath" – no exceptions in this case and I’m getting expected results. But to me it limits/complicates usage Akka based receivers a lot. Do you think it should be considered as a bug? Even if it’s not, can it be fixed/worked around by some classloading magic

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your

I am looking for a Java sample of a Partitioner

2014-09-02 Thread Steve Lewis
Assume say JavaWord count I call the equivalent of a Mapper JavaPairRDD ones = words.mapToPair(,,, Now right here I want to guarantee that each word starting with a particular letter is processed in a specific partition - (Don't tell me this is a dumb idea - I know that but in a Hadoop code a cus

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-09-02 Thread Tathagata Das
I am not sure if there is a quick fix for this as the actor is started in the same actorSystem as the Spark's actor system. And since that actor system is started as soon as the executor is launched, even before the application code is launched, there isnt much classloader magic that can be done.

Re: [PySpark] large # of partitions causes OOM

2014-09-02 Thread Matthew Farrellee
On 08/29/2014 06:05 PM, Nick Chammas wrote: Here’s a repro for PySpark: |a = sc.parallelize(["Nick","John","Bob"]) a = a.repartition(24000) a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) | When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is what I get:

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-09-02 Thread Victor Tso-Guillen
I'm pretty sure the issue was an interaction with another subsystem. Thanks for your patience with me! On Tue, Sep 2, 2014 at 10:05 AM, Sean Owen wrote: > +user@ > > An executor is specific to an application, but an application can be > executing many jobs at once. So as I understand many jobs'

Re: zip equal-length but unequally-partition

2014-09-02 Thread Matthew Farrellee
On 09/01/2014 11:39 PM, Kevin Jung wrote: http://www.adamcrume.com/blog/archive/2014/02/19/fixing-sparks-rdd-zip Please check this url . I got same problem in v1.0.1 In some cases, RDD losts several elements after zip so th

Re: Spark-shell return results when the job is executing?

2014-09-02 Thread Matthew Farrellee
what do you think about using a streamrdd in this case? assuming streaming is available for pyspark, and you can collect based on # events best, matt On 09/02/2014 10:38 AM, Andrew Or wrote: Spark-shell, or any other Spark application, returns the full results of the job until it has finis

Re: flattening a list in spark sql

2014-09-02 Thread gtinside
Thanks . I am not using hive context . I am loading data from Cassandra and then converting it into json and then querying it through SQL context . Can I use use hive context to query on a jsonRDD ? Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/fla

Re: Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread Tobias Pfeiffer
Hi, On Wed, Sep 3, 2014 at 6:54 AM, salemi wrote: > I was able to calculate the individual measures separately and know I have > to merge them and spark streaming doesn't support outer join yet. > Can't you assign some dummy key (e.g., index) before your processing and then join on that key usi

What is the appropriate privileges needed for writting files into checkpoint directory?

2014-09-02 Thread Tao Xiao
I tried to run KafkaWordCount in a Spark standalone cluster. In this application, the checkpoint directory was set as follows : val sparkConf = new SparkConf().setAppName("KafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpoint") After sub

Re: Spark Streaming - how to implement multiple calculation using the same data set

2014-09-02 Thread Alireza Salemi
Tobias, That was what I was planing to do and technical lead is the opinion that we should some how process a message only once and calculate all the measures for the worker. I was wondering if there is a solution out there for that? Thanks, Ali > Hi, > > On Wed, Sep 3, 2014 at 6:54 AM, salemi

Re: flattening a list in spark sql

2014-09-02 Thread Michael Armbrust
Yes you can. HiveContext's functionality is a strict superset of SQLContext. On Tue, Sep 2, 2014 at 6:35 PM, gtinside wrote: > Thanks . I am not using hive context . I am loading data from Cassandra and > then converting it into json and then querying it through SQL context . Can > I use use h

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi Andrew. what should I do to set master on yarn, can you please pointing me on command or documentation how to do it? I am doing the following: executed start-all.sh [root@HDOP-B sbin]# ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /root/spark-1.0.1.2.1.3.0-

Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
Hi, I have the following ArrayBuffer: *ArrayBuffer(5,3,1,4)* Now, I want to get the number of elements in this ArrayBuffer and also the first element of the ArrayBuffer. I used .length and .size but they are returning 1 instead of 4. I also used .head and .last for getting the first and the last

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi , I change my command to : ./bin/spark-submit --master spark://HDOP-B.AGT:7077 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/src/main/python/pi.py 1000 and it fixed the problem. I still have couple of questions: PROCESS_LOCAL is not Yarn executi

Re: Number of elements in ArrayBuffer

2014-09-02 Thread Madabhattula Rajesh Kumar
Hi Deep, Please find below results of ArrayBuffer in scala REPL scala> import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala> val a = ArrayBuffer(5,3,1,4) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala> a.head res2: Int = 5

Re: zip equal-length but unequally-partition

2014-09-02 Thread Kevin Jung
I just created it. Here's ticket. https://issues.apache.org/jira/browse/SPARK-3364 Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-equal-length-but-unequally-partition-tp13246p13330.html Sent from the Apache Spark User List mailing list ar

Re: pyspark yarn got exception

2014-09-02 Thread Oleg Ruchovets
Hi, I changed to master to point on yarn and got such exceptions: [root@HDOP-B spark-1.0.1.2.1.3.0-563-bin-2.4.0.2.1.3.0-563]# ./bin/spark-submit --master yarn://HDOP-M.AGT:8032 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/src/main/python/pi.py 1000 /

Re: pyspark yarn got exception

2014-09-02 Thread Sandy Ryza
Hi Oleg. To run on YARN, simply set master to "yarn". The YARN configuration, located in a yarn-site.xml, determines where to look for the YARN ResourceManager. PROCESS_LOCAL is orthogonal to the choice of cluster resource manager. A task is considered PROCESS_LOCAL when the executor it's running

Re: Number of elements in ArrayBuffer

2014-09-02 Thread Deep Pradhan
I have a problem here. When I run the commands that Rajesh has suggested in Scala REPL, they work fine. But, I want to work in a Spark code, where I need to find the number of elements in an ArrayBuffer. In Spark code, these things are not working. How should I do that? On Wed, Sep 3, 2014 at 10:

Re: MLLib decision tree: Weights

2014-09-02 Thread Xiangrui Meng
This is not supported in MLlib. Hopefully, we will add support for weighted examples in v1.2. If you want to train weighted instances with the current tree implementation, please try importance sampling first to adjust the weights. For instance, an example with weight 0.3 is sampled with probabilit

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread Xiangrui Meng
We have a pending PR (https://github.com/apache/spark/pull/216) for discretization but it has performance issues. We will try to spend more time to improve it. -Xiangrui On Tue, Sep 2, 2014 at 2:56 AM, filipus wrote: > i guess i found it > > https://github.com/LIDIAgroup/SparkFeatureSelection > >