Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
Yes, that is correct. If you are executing a Spark program across multiple machines, that you need to use a distributed file system (HDFS API compatible) for reading and writing data. In your case, your setup is across multiple machines. So what is probably happening is that the the RDD data is bei

How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Adnan
Hello, I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am connecting with Spark Master from scala using SparkContext. I am trying to execute a simple java function from the distributed jar on every Spark Worker but haven't found a way to communicate with each worker or a Spark A

Only TraversableOnce?

2014-04-08 Thread wxhsdp
In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E = A+C

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
so, the data structure looks like: D consists of D1, D2, D3 (DX is partition) and DX consists of d1, d2, d3 (dx is the part in your context)? what you want to do is to transform DX to (d1 + d2, d1 + d3, d2 + d3)? Best, -- Nan Zhu On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D => {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote: > yes, how can i do this convenie

Re: spark-shell on standalone cluster gives error " no mesos in java.library.path"

2014-04-08 Thread Christoph Böhm
Forgot to post the solution. I messed up the master URL. In particular, I gave the host (master), not a URL. My bad. The error message is weird, though. Seems like the URL regex matches master for mesos://... No idea about the Java Runtime Environment Error. On Mar 26, 2014, at 3:52 PM, Chris

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
thank you for your help! let me have a try Nan Zhu wrote > If that’s the case, I think mapPartition is what you need, but it seems > that you have to load the partition into the memory as whole by toArray > > rdd.mapPartition{D => {val p = D.toArray; ...}} > > -- > Nan Zhu > > > > On Tue

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
As requested, here is the script I am running. It is a simple shell script which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory of spark, as usual. The AMI used is the raw one from the AWS Quick Start section. It is the first option (an Amazon Linux paravirtual image). Any id

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've created several of my own AMIs with the following characteristics. None of which worked. 1) Enabling ssh as root as per this guide ( http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/). When doing this, I

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when trying to

Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Hi Spark users, I'm very much hoping someone can help me out. I have a strict performance requirement on a particular query. One of the stages shows great variance in duration -- from 300ms to 10sec. The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark 0.8) I have run the job

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the "workaround" ...around... by overwriting the generated '/root/.ssh/authorized_keys' file with a known good one, in the '/etc/rc.local' file On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini < silvio.costant...@granatads.com> wrote: > Another thing I didn't mention. The AMI

Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together or not? Best regards, Flavio

Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called "Apache Phoenix", which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and t

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily see

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a

Re: Urgently need help interpreting duration

2014-04-08 Thread Aaron Davidson
Off the top of my head, the most likely cause would be driver GC issues. You can diagnose this by enabling GC printing at the driver and you can fix this by increasing the amount of memory your driver program has (see http://spark.apache.org/docs/0.9.0/tuning.html#garbage-collection-tuning). The "

Re: Urgently need help interpreting duration

2014-04-08 Thread Aaron Davidson
Also, take a look at the driver logs -- if there is overhead before the first task is launched, the driver logs would likely reveal this. On Tue, Apr 8, 2014 at 9:21 AM, Aaron Davidson wrote: > Off the top of my head, the most likely cause would be driver GC issues. > You can diagnose this by e

Re: Driver Out of Memory

2014-04-08 Thread Aaron Davidson
The driver does not do processing (for the most part), but it does do scheduling and block management, so it can keep around a significant amount of metadata about all the stored RDD and broadcast blocks, as well as statistics about prior executions. Much of this is bounded in some way, though, so

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

Re: Spark and HBase

2014-04-08 Thread Bin Wang
First, I have not tried it myself. However, what I have heard it has some basic SQL features so you can query you HBase table like query content on HDFS using Hive. So it is not "query a simple column", I believe you can do joins and other SQL queries. Maybe you can wrap up an EMR cluster with Hbas

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit did work for me. Could you confirm the following: 1) After you called cache(), did you make any actions like count() or reduce()? If you don't materialize the RDD, it won't show up in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master? Xia

RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : "For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on" And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html "To minimi

assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of lib_mana

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote: > i tried again with latest master, which includes commit below, bu

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers wrote: > note that for a cached rdd in the

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers wrote

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully cached using context.getRDDStorageInfo which we expose via our own api. i did not run make-distribution.sh, we have our own scripts to build a distribution. however if your question is if i correctly deployed the latest build,

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas
Just took a quick look at the overview here and the quick start guide here . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic p

Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi
Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng wrote: > That commit fixed the exact problem you described. That is why I want to > confirm that you switched to the master branch. bin/spark-shell doesn't > detect code changes, so you need to run ./make-distributio

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;Tac

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api: "storageInfo": { "diskSize": 0, "memSize": 19944, "numCachedPartitions": 1, "numPartitions": 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers wrote: > i put some println statements in BlockManagerUI > > i hav

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson
Yup, sorry about that. This error message should not produce incorrect behavior, but it is annoying. Posted a patch to fix it: https://github.com/apache/spark/pull/361 Thanks for reporting it! On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers wrote: > when i start spark-shell i now see > > ls: can

java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
I am trying to read file from HDFS on Spark Shell and getting error as below. When i create first RDD it works fine but when i try to do count on that RDD, it trows me some connection error. I have single node hdfs setup and on the same machine, i have spark running. Please help. When i run "jps" c

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Thank you -- this actually helped a lot. Strangely it appears that the task detail view is not accurate in 0.8 -- that view shows 425ms duration for one of the tasks, but in the driver log I do indeed see Finished TID 125 in 10940ms. On that "slow" worker I see the following: 14/04/08 18:06:24 IN

Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
So I have a cluster in EC2 doing some work, and when I take a look here http://driver-node:4040/executors/ I see that my driver node is snoozing on the job: No tasks, no memory used, and no RDD blocks cached. I'm assuming that it was a conscious design choice not to have the driver node partake

Re: java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
There are couple of issues here which i was able to find out. 1: We should not use web port which we use to access the web UI. I was usong that initially so it was not working. 2: All request should go to Name node and not anything else. 3: By replacing localhost:9000 in the above request, it

Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Hi All, I want to measure the total network traffic for a Spark Job. But I did not see related information from the log. Does anybody know how to measure it? Thanks very much in advance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-

ETL for postgres to hadoop

2014-04-08 Thread Manas Kar
Hi All, I have some spatial data in postgres machine. I want to be able to move that data to Hadoop and do some geo-processing. I tried using sqoop to move the data to Hadoop but it complained about the position data(which it says can't recognize) Does anyone have any idea as to h

Re: ui broken in latest 1.0.0

2014-04-08 Thread Andrew Or
Hi Koert, Thanks for pointing this out. However, I am unable to reproduce this locally. It seems that there is a discrepancy between what the BlockManagerUI and the SparkContext think is persisted. This is strange because both sources ultimately derive this information from the same place - by doi

Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
Hello Manas, I don't know Sqoop that much but my best guess is that you're probably using Postgis which has specific structures for Geometry and so on. And if you need some spatial operators my gut feeling is that things will be harder ^^ (but a raw import won't need that...). So I did a quick ch

Spark with SSL?

2014-04-08 Thread kamatsuoka
Can Spark be configured to use SSL for all its network communication? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 -> RDD

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka wrote: > Can Spark be configured to use SSL for all its ne

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao wrote: > Hi All, > > I want

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Andrew Ash
One downside I can think of to having the driver node act as a temporary member of the cluster would be that you may have firewalls between the workers and the driver machine that would prevent shuffles from working properly. Now you'd need to poke holes in the firewalls to get the cluster to prop

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Shivaram Venkataraman
Is there any reason why you want to start with a vanilla amazon AMI rather than the ones we build and provide as a part of Spark EC2 scripts ? The AMIs we provide are close to the vanilla AMI but have the root account setup properly and install packages like java that are used by Spark. If you wis

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Sean Owen
If you want the machine that hosts the driver to also do work, you can designate it as a worker too, if I'm not mistaken. I don't think the driver should do work, logically, but, that's not to say that the machine it's on shouldn't do work. -- Sean Owen | Director, Data Science | London On Tue, A

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to b

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
Alright, so I guess I understand now why spark-ec2 allows you to select different instance types for the driver node and worker nodes. If the driver node is just driving and not doing any large collect()s or heavy processing, it can be much smaller than the worker nodes. With regards to data local

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu
may be unrelated to the question itself, just FYI you can run your driver program in worker node with Spark-0.9 http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Best, -- Nan Zhu On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas wrote

Preconditions on RDDs for creating a Graph?

2014-04-08 Thread Adam Novak
Hello, I'm trying to create a GraphX Graph by calling Graph(vertices: RDD[(VertexId, VD)], edges: RDD[Edge[ED]]): Graph[VD, ED]. I'm passing in two RDDs: one with vertices keyed by ID, and one with edges. I make sure to coalesce both these RDDs down to the same number of partitions beforehand; it

Re: Spark with SSL?

2014-04-08 Thread Ognen Duzlevski
Ideally, you just run it in Amazon's VPC or whatever other providers' equivalent is. In this case running things over SSL would be an overkill. On 4/8/14, 3:31 PM, Andrew Ash wrote: Not that I know of, but it would be great if that was supported. The way I typically handle security now is to p

Re: Spark with SSL?

2014-04-08 Thread Benjamin Black
Only if you trust the provider networks and everyone who might have access to them. I don't. On Tuesday, April 8, 2014, Ognen Duzlevski wrote: > Ideally, you just run it in Amazon's VPC or whatever other providers' > equivalent is. In this case running things over SSL would be an overkill. > >

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
Hi guys, We're going to hold a series of meetups about machine learning with Spark in San Francisco. The first one will be on April 24. Xiangrui Meng from Databricks will talk about Spark, Spark/Python, features engineering, and MLlib. See http://www.meetup.com/sfmachinelearning/events/174560212

Re: [BLOG] For Beginners

2014-04-08 Thread weida xu
Dears, I'm very interested in this. However, the links mentioned above are not accessible from China. Is there any other way to read the two blog pagess? Thanks a lot. 2014-04-08 12:54 GMT+08:00 prabeesh k : > Hi all, > > Here I am sharing a blog for beginners, about creating spark streaming >

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Thanks Andrew, I will take a look at it. On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List] < ml-node+s1001560n3920...@n3.nabble.com> wrote: > If you set up Spark's metrics reporting to write to the Ganglia backend > that will give you a good idea of how much network/disk/CPU

Re: Spark with SSL?

2014-04-08 Thread Evan R. Sparks
A bandaid might be to set up ssh tunneling between slaves and master - has anyone tried deploying this way? I would expect it to pretty negatively impact performance on communication-heavy jobs. On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black wrote: > Only if you trust the provider networks and

Re: Spark with SSL?

2014-04-08 Thread Benjamin Black
Although connection setup is expensive, the overhead of AES on any recent Intel processor is almost zero. AES-NI is good stuff. On Tuesday, April 8, 2014, Evan R. Sparks wrote: > A bandaid might be to set up ssh tunneling between slaves and master - has > anyone tried deploying this way? I would

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-08 Thread abhietc31
Anybody, please help for abov e query. It's challanging but will open new horizon for In-Memory analysis. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html Sent from the Apache Spark User List mail

issue of driver's HA

2014-04-08 Thread 林武康
Hi all, We got some troubles on the issue of driver's HA. we run a long-live driver on spark standalone mode which service as server that submit jobs as requests arrived. therefore we come across the issue of driver process's HA problem, like how to resume jobs after the driver process failed.

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo
Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to IDEA 13.1 Make Project and these errors show up: Error:(28, 8) object FileContext is not a member of package org.apache.hadoop.fs import org.apache.hadoop.fs.{FileContext, FileStatus, File

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Sean Owen
I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data Science | London On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo wrote: > Dear list, > > SBT compiles fine, but when I do the following: > sbt/sbt gen-idea > import project as SBT project to IDEA 13.1 > Make

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Xiangrui Meng
After sbt/sbt gen-diea, do not import as an SBT project but choose "open project" and point it to the spark folder. -Xiangrui On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen wrote: > I let IntelliJ read the Maven build directly and that works fine. > -- > Sean Owen | Director, Data Science | London >

Re: Preconditions on RDDs for creating a Graph?

2014-04-08 Thread Ankur Dave
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak wrote: > What, exactly, needs to be true about the RDDs that you pass to Graph() to > be sure of constructing a valid graph? (Do they need to have the same > number of partitions? The same number of partitions and no empty > partitions? Do you need to re