SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
hi, I am trying to write some unit test, following spark programming guide . but I observed unit test runs very slow(the code is just a SparkPi), so I turn log level to trace and look through the log output. and found creat

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread linkpatrickliu
Hi, Hao Cheng. I have done other tests. And the result shows the thriftServer can connect to Zookeeper. However, I found some more interesting things. And I think I have found a bug! Test procedure: Test1: (0) Use beeline to connect to thriftServer. (1) Switch database "use dw_op1"; (OK) The log

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Thanks Christian! I tried compiling from source but am still getting the same hadoop client version error when reading from HDFS. Will have to poke deeper... perhaps I've got some classpath issues. FWIW I compiled using: $ MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk
Hi Michael, Please correct me if I am wrong. The error seems to originate from spark only. Please have a look at the stack trace of the error which is as follows: [error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.e

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
I connect my sample project to a hosted CI service, it only takes 3 seconds to run there...while the same tests takes 2minutes on my macbook pro. so maybe this is a mac os specific problem? On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 wrote: > hi, > > I am trying to write some unit test, following spark

Re: Broadcast error

2014-09-16 Thread Chengi Liu
Cool.. While let me try that.. any other suggestion(s) on things I can try? On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu wrote: > I think the 1.1 will be really helpful for you, it's all compatitble > with 1.0, so it's > not hard to upgrade to 1.1. > > On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua
Is 1.0.8 working for you ? You indicated your last known good version is 1.0.0 Maybe we can track down where it broke. > On Sep 16, 2014, at 12:25 AM, Paul Wais wrote: > > Thanks Christian! I tried compiling from source but am still getting the > same hadoop client version error when read

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy
it works, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
sorry for disturb, please ignore this mail in the end, I find it slow because lack of memory in my machine.. sorry again. On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 wrote: > I connect my sample project to a hosted CI service, it only takes 3 > seconds to run there...while the same tests takes 2min

Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all. You can batch up data & store into parquet partitions as we

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao
Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Tuesday, September 16, 2014 3:17 PM To: u...@spark.incubator.apache.org Subject: RE: SparkSQL 1.1 hang when "DROP"

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
Dear Ankur, Thanks! :) - from [1], and my understanding, the existing inactive feature in graphx pregel api is “if there is no in-edges, from active vertex, to this vertex, then we will say this one is inactive”, right? For instance, there is a graph in which every vertex has at least one in-e

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 10:55:37 +0200, Yifan LI wrote: > - from [1], and my understanding, the existing inactive feature in graphx > pregel api is “if there is no in-edges, from active vertex, to this vertex, > then we will say this one is inactive”, right? Well, that's true when messages are only sent

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
Thanks, :) but I am wondering if there is a message(none?) sent to the target vertex(the rank change is less than tolerance) in below dynamic page rank implementation, def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { if (edge.srcAttr._2 > tol) { Iterator((edge.dstI

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 12:23:10 +0200, Yifan LI wrote: > but I am wondering if there is a message(none?) sent to the target vertex(the > rank change is less than tolerance) in below dynamic page rank implementation, > > def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { > if (edge.src

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Sean Owen
>From the caller / application perspective, you don't care what version of Hadoop Spark is running on on the cluster. The Spark API you compile against is the same. When you spark-submit the app, at runtime, Spark is using the Hadoop libraries from the cluster, which are the right version. So when

Re: How to set executor num on spark on yarn

2014-09-16 Thread Sean Owen
How many cores do your machines have? --executor-cores should be the number of cores each executor uses. Fewer cores means more executors in general. From your data, it sounds like, for example, there are 7 nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores available. Hence when

Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I have a standalone spark cluster and from within the same scala application I'm creating 2 different spark context to run two different spark streaming jobs as SparkConf is different for each of them. I'm getting this error that... I don't really understand: 14/09/16 11:51:35 ERROR OneForOneStra

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
It seems that, as I have a single scala application, the scheduler is the same and there is a collision between executors of both spark context. Is there a way to change how the executor ID is generated (maybe an uuid instead of a sequential number..?) 2014-09-16 13:07 GMT+01:00 Luis Ángel Vicente

Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets
I am expand my data set and executing pyspark on yarn: I payed attention that only 2 processes processed the data: 14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7 32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7 *Question:* *how to configure

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
When I said scheduler I meant executor backend. 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez < langel.gro...@gmail.com>: > It seems that, as I have a single scala application, the scheduler is the > same and there is a collision between executors of both spark context. Is > there a way t

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I dug a bit more and the executor ID is a number so it's seems there is not possible workaround. Looking at the code of the CoarseGrainedSchedulerBackend.scala: https://github.com/apache/spark/blob/6324eb7b5b0ae005cb2e913e36b1508bd6f1b9b8/core/src/main/scala/org/apache/spark/scheduler/cluster/Coa

Reduce Tuple2 to Tuple2>>

2014-09-16 Thread Tom
>From my map function I create Tuple2 pairs. Now I want to reduce them, and get something like Tuple2>. The only way I found to do this was by treating all variables as String, and in the reduceByKey do /return a._2 + "," + b._2/ //in which both are numeric values saved in a String After which I

Re: Reduce Tuple2 to Tuple2>>

2014-09-16 Thread Sean Owen
If you mean you have (key,value) pairs, and want pairs with key, and all values for that key, then you're looking for groupByKey On Tue, Sep 16, 2014 at 2:42 PM, Tom wrote: > From my map function I create Tuple2 pairs. Now I want to > reduce them, and get something like Tuple2>. > > The only way

Re: Serving data

2014-09-16 Thread Yana Kadiyska
If your dashboard is doing ajax/pull requests against say a REST API you can always create a Spark context in your rest service and use SparkSQL to query over the parquet files. The parquet files are already on disk so it seems silly to write both to parquet and to a DB...unless I'm missing somethi

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
Hi All, I suspect I am experiencing a bug. I've noticed that while running larger jobs, they occasionally die with the exception "java.util.NoSuchElementException: key not found xyz", where "xyz" denotes the ID of some particular task. I've excerpted the log from one job that died in this way bel

org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li
Hi, I am new to SPARK. I just set up a small cluster and wanted to run some simple MLLIB examples. By following the instructions of https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1, I could successfully run everything until the step of SVMWithSGD, I got error the follow

Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A
Hello, Suppose I want to use Spark from an application that I already submit to run in another container (e.g. Tomcat). Is this at all possible? Or do I have to split the app into two components, and submit one to Spark and one to the other container? In that case, what is the preferred

collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
Hello. I have a hadoopFile RDD and i tried to collect items to driver program, but it returns me an array of identical records (equals to last record of my file). My code is like this: val rdd = sc.hadoopFile( "hdfs:///data.avro", classOf[org.apache.avro.mapred.AvroInputFormat[MyAv

HBase and non-existent TableInputFormat

2014-09-16 Thread Y. Dong
Hello, I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read from hbase. The Java code is attached. However the problem is TableInputFormat does not even exist in hbase-client API, is there any other way I can read from hbase? Thanks SparkConf sconf = new SparkConf().set

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao
This problem was caused by the fact that I used a package jar with a Spark version (0.9.1) different from that of the cluster (0.9.0). When I used the correct package jar (spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the application can run as expected. 2014-09-15 14:57 G

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
bq. TableInputFormat does not even exist in hbase-client API It is in hbase-server module. Take a look at http://hbase.apache.org/book.html#mapreduce.example.read On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong wrote: > Hello, > > I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simp

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
hbase-client module serves client facing APIs. hbase-server module is supposed to host classes used on server side. There is still some work to be done so that the above goal is achieved. On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong wrote: > Thanks Ted. It is indeed in hbase-server. Just curious, w

Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi
Can I load that plugin in spark-shell? Or perhaps due the 2-phase compilation quasiquotes won't work in shell? On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra wrote: > Okay, that's consistent with what I was expecting. Thanks, Matei. > > On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia > wrote: >

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Hi, I had a similar situation in which I needed to read data from HBase and work with the data inside of a spark context. After much ggling, I finally got mine to work. There are a bunch of steps that you need to do get this working - The problem is that the spark context does not know anyt

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas
Btw, there are some examples in the Spark GitHub repo that you may find helpful. Here's one related to HBase. On Tue, Sep 16, 2014 at 1:22 PM, wrote: > *Hi, * > > > > *I had a similar

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local mode), it can probably run in a container. Otherwise you can create a SparkContext there and connect it to a cluster outside. Note that I haven't tried this though, so the security policies of the container might be too

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for me. Let the container handle the HTTP request and then talk to Spark using another HTTP/REST interface. You can use the Spark Job Server for this. Embedding Spark inside the container is not a great long term solution IMO b

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Yes that was very helpful… ☺ Here are a few more I found on my quest to get HBase working with Spark – This one details about Hbase dependencies and spark classpaths http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html This one has a code overview – http://www.abcn.net/2014/

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao wrote: > Thank you for pasting the steps, I will look at this, hopefully come out > with a solution soon. > > -Original Message- > From: linkpatrickliu [mailto:linkpatrick...@liv

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai
I meant it may be a Hive bug since we also call Hive's drop table internally. On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai wrote: > Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? > > On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao wrote: > >> Thank you for pasting the steps, I will

RDD projection and sorting

2014-09-16 Thread Sameer Tilak
Hi All, I have data in for following format:L 1st column is userid and the second column onward are class ids for various products. I want to save this in Libsvm format and an intermediate step is to sort (in ascending manner) the class ids. For example: I/Puid1 12433580 2670122

RE: Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A
Hello, Thanks for the response and great to hear it is possible. But how do I connect to Spark without using the submit script? I know how to start up a master and some workers and then connect to the master by packaging the app that contains the SparkContext and then submitting the

Re: NullWritable not serializable

2014-09-16 Thread Du Li
Hi, The test case is separated out as follows. The call to rdd2.first() breaks when spark version is changed to 1.1.0, reporting exception NullWritable not serializable. However, the same test passed with spark 1.0.2. The pom.xml file is attached. The test data README.md was copied from spark.

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
You can create a new SparkContext inside your container pointed to your master. However, for your script to run you must call addJars to put the code on your workers' classpaths (except when running locally). Hopefully your webapp has some lib folder which you can point to as a source for the jars

R: Spark as a Library

2014-09-16 Thread Paolo Platter
Hi, Spark job server by ooyala is the right tool for the job. It exposes rest api so calling it from a web app is suitable. Is open source, you can find it on github Best Paolo Platter Da: Ruebenacker, Oliver A Inviato:

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Hi Sean, Great catch! Yes I was including Spark as a dependency and it was making its way into my uber jar. Following the advice I just found at Stackoverflow[1], I marked Spark as a provided dependency and that appeared to fix my Hadoop client issue. Thanks for your help!!! Perhaps they mainta

Spark processing small files.

2014-09-16 Thread cem
Hi all, Spark is taking too much time to start the first stage with many small files in HDFS. I am reading a folder that contains RC files: sc.hadoopFile("hdfs://hostname :8020/test_data2gb/", classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRe

Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi, I'm trying to implement a custom RDD that essentially works as a distributed hash table, i.e. the key space is split up into partitions and within a partition, an element can be looked up efficiently by the key. However, the RDD lookup() function (in PairRDDFunctions) is implemented in a way i

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
I think it's on the table but not yet merged? https://issues.apache.org/jira/browse/SPARK-1216 On Tue, Sep 16, 2014 at 10:04 PM, st553 wrote: > Does MLlib provide utility functions to do this kind of encoding? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.

Memory under-utilization

2014-09-16 Thread francisco
Hi, I'm a Spark newbie. We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb memory and 48 cores. Tried to allocate a task with 64gb memory but for whatever reason Spark is only using around 9gb max. Submitted spark job with the following command: " /bin/spark-submit -class Sim

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
Perhaps your job does not use more than 9g. Even though the dashboard shows 64g the process only uses whats needed and grows to 64g max. On Tue, Sep 16, 2014 at 5:40 PM, francisco wrote: > Hi, I'm a Spark newbie. > > We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb > memory

Questions about Spark speculation

2014-09-16 Thread Nicolas Mai
Hi, guys My current project is using Spark 0.9.1, and after increasing the level of parallelism and partitions in our RDDs, stages and tasks seem to complete much faster. However it also seems that our cluster becomes more "unstable" after some time: - stalled stages still showing under "active st

Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the reply. I doubt that's the case though ... the executor kept having to do a file dump because memory is full. ... 14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67 MB to disk (668 times so far) 14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling in-memor

How do I manipulate values outside of a GraphX loop?

2014-09-16 Thread crockpotveggies
Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. I'm aware mapTriplets is really only for modifying values inside the graph. What about using it in conjunction with other computations? See below: def mapTrip

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Aris
Yeah - another vote here to do what's called One-Hot encoding, just convert the single categorical feature into N columns, where N is the number of distinct values of that feature, with a single one and all the other features/columns set to zero. On Tue, Sep 16, 2014 at 2:16 PM, Sean Owen wrote:

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
I see, what does http://localhost:4040/executors/ show for memory usage? I personally find it easier to work with a standalone cluster with a single worker by using the sbin/start-master.sh and then connecting to the master. On Tue, Sep 16, 2014 at 6:04 PM, francisco wrote: > Thanks for the rep

MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-16 Thread Aris
Hello Spark Community - I am using the support vector machine / SVM implementation in MLlib with the standard linear kernel; however, I noticed in the Spark documentation for StandardScaler is *specifically* mentions that SVMs which use the RBF kernel work really well when you have standardized da

Re: org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Aris
This should be a really simple problem, but you haven't shared enough code to determine what's going on here. On Tue, Sep 16, 2014 at 8:08 AM, Hui Li wrote: > Hi, > > I am new to SPARK. I just set up a small cluster and wanted to run some > simple MLLIB examples. By following the instructions of

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
Hello friends: Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems to work, but for the following. Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to '--driver-java-op

partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such: Partition 0: K1 -> [V1, V2] K2 -> [V2] Partition 1: K3 -> [V1] K4 -> [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non uni

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote: > I have a use case where my RDD is set up

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Sandy Ryza
Hi team didata, This doesn't directly answer your question, but with Spark 1.1, instead of user the driver options, it's better to pass your spark properties using the "conf" option. E.g. pyspark --master yarn-client --conf spark.shuffle.spill=true --conf spark.yarn.executor.memoryOverhead=512M

Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the tip. http://localhost:4040/executors/ is showing Executors(1) Memory: 0.0 B used (294.9 MB Total) Disk: 0.0 B Used However, running as standalone cluster does resolve the problem. I can see a worker process running w/ the allocated memory. My conclusion (I may be wrong) is for 'l

how to report documentation bug?

2014-09-16 Thread Andy Davidson
http://spark.apache.org/docs/latest/quick-start.html#standalone-applications Click on java tab There is a bug in the maven section 1.1.0-SNAPSHOT Should be 1.1.0 Hope this helps Andy

RE: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Cheng, Hao
Thank you Yin Huai. This is probably true. I saw in the hive-site.xml, Liu has changed the entry, which is default should be false. hive.support.concurrency Enable Hive's Table Lock Manager Service true Someone is working on upgrading the Hive to 0.13 for SparkSQL (https://gi

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread daijia
Is there some way to ship textfile just like ship python libraries? Thanks in advance Daijia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html Sent from the Apache Spark User List mailing

Re: how to report documentation bug?

2014-09-16 Thread Nicholas Chammas
You can send an email like you just did or open an issue in the Spark issue tracker . This looks like a problem with how the version is generated in this file . On Tue, Sep 16, 2014 at 8:55 PM, And

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
Hi Sandy: Thank you. I have not tried that mechanism (I wasn't are of it). I will try that instead. Is it possible to also represent '--driver-memory' and '--executor-memory' (and basically all properties) using the '--conf' directive? The Reason: I actually discovered the below issue while

CPU RAM

2014-09-16 Thread VJ Shalish
Hi I need to get the CPU utilisation, RAM usage, Network IO and other metrics using Java program. Can anyone help me on this? Thanks Shalish.

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread Davies Liu
Yes, sc.addFile() is what you want: | addFile(self, path) | Add a file to be downloaded with this Spark job on every node. | The C{path} passed can be either a local file, a file in HDFS | (or other Hadoop-supported filesystems), or an HTTP, HTTPS or | FTP URI. | |

Re: CPU RAM

2014-09-16 Thread Amit
Not particularly related to Spark, but you can check out SIGAR API. It let's you get CPU, Memory, Network, Filesystem and process based metrics. Amit On Sep 16, 2014, at 20:14, VJ Shalish wrote: > Hi > > I need to get the CPU utilisation, RAM usage, Network IO and other metrics > using Java

Re: CPU RAM

2014-09-16 Thread VJ Shalish
Thank u for the response Amit. So is it that, we cannot measure the CPU consumption, RAM usage of a spark job through a Java program? On Tue, Sep 16, 2014 at 11:23 PM, Amit wrote: > Not particularly related to Spark, but you can check out SIGAR API. It > let's you get CPU, Memory, Network, Files

Re: CPU RAM

2014-09-16 Thread VJ Shalish
Sorry for the confusion Team. My requirement is to measure the CPU utilisation, RAM usage, Network IO and other metrics of a SPARK JOB using Java program. Please help on the same. On Tue, Sep 16, 2014 at 11:23 PM, Amit wrote: > Not particularly related to Spark, but you can check out SIGAR API.

YARN mode not available error

2014-09-16 Thread Barrington
Hi, I am running Spark in cluster mode with Hadoop YARN as the underlying cluster manager. I get this error when trying to initialize the SparkContext. Exception in thread "main" org.apache.spark.SparkException: YARN mode not available ? at org.apache.spark.SparkContext$.org$apache$spar

The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread edmond_huo
Hi, I am a freshman about spark. I tried to run a job like wordcount example in python. But when I tried to get the top 10 popular words in the file, I got the message:AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey'. So my question is what is the difference between PipelinedRDD

permission denied on local dir

2014-09-16 Thread style95
I am running spark on shared yarn cluster. My user ID is "online", but I found that when I run my spark application, local directories are created by "yarn" user ID. So I am unable to delete local directories and finally application failed. Please refer to my log below: 14/09/16 21:59:02 ERROR Di

Re: CPU RAM

2014-09-16 Thread Akhil Das
Ganglia does give you a cluster wide and per machine utilization of resources, but i don't think it gives your per Spark Job. If you want to build something from scratch then you can follow up like : 1. Login to the machine 2. Get the PIDs 3. For network IO per process, you can have a look at http

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread Davies Liu
PipelinedRDD is an RDD generated by Python mapper/reducer, such as rdd.map(func) will be PipelinedRDD. PipelinedRDD is an subclass of RDD, so it should have all the APIs which RDD has. >>> sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count() 10 I'm wondering that how can you

Re: collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com.