Re: Cartesian issue with user defined objects

2015-02-26 Thread Marco Gaido
Thanks, my issue was exactly that the function to extract the class from the file used the same object, by only changing it. Creating a new object for each item solved the issue. Thank you very much for your reply. Best regards. > Il giorno 26/feb/2015, alle ore 22:25, Imran Rashid ha > scritt

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
Here is my understanding. When running on top of yarn, the cores means the number of tasks can run in one executor. But all these cores are located in the same JVM. Parallelism typically control the balance of tasks. For example, if you have 200 cores, but only 50 partitions. There will be 150

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and

Error: no snappyjava in java.library.path

2015-02-26 Thread Dan Dong
Hi, All, When I run a small program in spark-shell, I got the following error: ... Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886) at java.lang.Runtime.loadLibrary0(Runtime.java:849) at java.lang

Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we should use spark.executor.extraClassPath instead. But the online documentation states that spark.executor.extraClassPath is only meant for backward compatibility. https://spark.apache.org/docs/1.2.0/configuration.html#execu

Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26, 20

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, I think it might be helpful to point out that I'm trying to run the RDDs in different threads to maximize the amount of work that can be done concurrently. Unfortunately, right now if I had something like this: val rdd1 = ..cache() val rdd2 = rdd1.map().() future { rdd1.saveAsHaso

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am only use one box cluster, with one executor. So if I put 10 cores, then 10 concurrent task will be run within this one executor, which will handle more data than 4 core case, then leaded to OOM. I

Re: Error: no snappyjava in java.library.path

2015-02-26 Thread Marcelo Vanzin
Hi Dan, This is a CDH issue, so I'd recommend using cdh-u...@cloudera.org for those questions. This is an issue with fixed in recent CM 5.3 updates; if you're not using CM, or want a workaround, you can manually configure "spark.driver.extraLibraryPath" and "spark.executor.extraLibraryPath" to in

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2015-02-26 Thread hnahak
I can able to run it without any issue from standalone as well as in cluster. spark-submit --class org.graphx.test.GraphFromVerteXEdgeArray --executor-memory 1g --driver-memory 6g --master spark://VM-Master:7077 spark-graphx.jar code is exact same as above -- View this message in context:

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even whe

Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread mobsniuk
I've been searching around and see others have asked similar questions. Given a schemaRDD I extract a restless that contains numbers, both Int and Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a textile and then read them back in splitting them with some code I found in

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
What confused me is the statement of "The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen mailto:so...@cloudera.com>> wrote: To distill this a bit further, I don't think you actually want rdd2 to wait on r

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
The issue is that both RDDs are being evaluated at once. rdd1 is cached, which means that as its partitions are evaluated, they are persisted. Later requests for the partition hit the cached partition. But we have two threads causing two jobs to evaluate partitions of rdd1 at the same time. If they

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which calculates a bunch of really really complex joins and as I move down the tree, I'm creating reports from the data that's already b

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
> What confused me is the statement of *"The final result is that rdd1 is calculated twice.” *Is it the expected behavior? To be perfectly honest, performing an action on a cached RDD in two different threads and having them (at the partition level) block until the parent are cached would be the

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers for

Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). I'm wondering at which point it is worth the overhead of adding the Spark infrastructure (deployment scripts, monitoring, etc).

RE: Apache Ignite vs Apache Spark

2015-02-26 Thread nate
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides here: https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x Couple main pts around comments made during the preso.., although incubating apache (first code drop was last week I believe).., tech is battle tested with

Spark+Cassandra and how to visualize results

2015-02-26 Thread Jan Algermissen
Hi, I am planning to process an event stream in the following way: - write the raw stream through spark streaming to cassandra for later analytics use cases - ‘fork of’ the stream and do some stream analysis and make that information available to build dashboards. Since I am having ElasticSear

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish probably due to writing to hdfs. a walk around for this particular case may be as follows. val rdd1 = ..cache() val rdd2 = rdd1.map().() rdd1.count future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoo

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, I believe you should be able to use the --jars for this when invoke the spark-shell or perform a spark-submit. Per docs: --jars JARSComma-separated list of local jars to include on the driver and executor classpaths. HTH. -Todd On Thu, Feb

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, This is exactly what I'm trying to do except, as I metnioned in my first message, I am being given rdd1 and rdd2 only and I don't necessarily know at that point whether or not rdd1 is a cached rdd. Further, I don't know at that point whether or not rdd2 depends on rdd1. On Thu, Feb 26, 2015

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
Yeah, I believe Corey knows that much and is using foreachPartition(i => None) to materialize. The question is, how would you do this with an arbitrary DAG? in this simple example we know what the answer is but he's trying to do it programmatically. On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang wr

Non-deterministic Accumulator Values

2015-02-26 Thread Peter Thai
Hi all, I'm incrementing several accumulators inside a foreach. Most of the time, the accumulators will return the same value for the same dataset. However, they sometimes differ. I'm not sure how accumulators are implemented. Could this behavior be caused by data not arriving before I print out

Re: How to tell if one RDD depends on another

2015-02-26 Thread Ted Yu
bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi d

Re: How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread Xiangrui Meng
It may take some work to do online updates with an MatrixFactorizationModel because you need to update some rows of the user/item factors. You may be interested in spark-indexedrdd (http://spark-packages.org/package/amplab/spark-indexedrdd). We support save/load in Scala/Java. We are going to add

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
SPARK_CLASSPATH is definitely deprecated, but my understanding is that spark.executor.extraClassPath is not, so maybe the documentation needs fixing. I'll let someone who might know otherwise comment, though. On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah wrote: > SparkConf.scala logs a warning s

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 PM, "Ted Yu" wrote: > bq. whether or not rdd1 is a cached rdd > > RDD has getStorageLevel method which would return the RDD's current > storage level. > > SparkContext has this method: >* Return informat

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
Try the following: df.map { case Row(id: Int, num: Int, value: Double, x: Float) => // replace those with your types (id, Vectors.dense(num, value, x)) }.toDF("id", "features") -Xiangrui On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk wrote: > I've been searching around and see others have asked si

Re: spark streaming: stderr does not roll

2015-02-26 Thread Tathagata Das
If the mentioned conf is enabled, the rolling of the stderr should work. If it is not, then there is probably some bug. Take a look at the Worker's logs and see if there is any error about rolling of the Executor's stderr. If there is a bug, then it needs to be fixed (maybe you can take a crack at

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
I miss that part. Thanks for the explanation. It is a challenging problem implementation wise. To do it programmatically, 1. pre-analyze all DAGs to form a complete DAG with root as the source, and leaf as all actions. 2. Any RDD(node) that has more than one downstream nodes needs to be marked

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet mailto:cjno...@gmail.com>> wrote: Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 P

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-26 Thread Tristan Blakers
Hi Imran, I can confirm this still happens when calling Kryo serialisation directly, not I’m using Java. The output file is at about 440mb at the time of the crash. Kryo is version 2.21. When I get a chance I’ll see if I can make a shareable test case and try on Kryo 3.0, I doubt they’d be intere

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
I think we already covered that in this thread. You get dependencies from RDD.dependencies() On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang wrote: > Currently in spark, it looks like there is no easy way to know the > dependencies. It is solved at run time. > > Thanks. > > Zhan Zhang > > On Feb 26,

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I think I'm getting more confused the longer this thread goes. So rdd1.dependencies provides immediate parents to rdd1. For now i'm going to walk my internal DAG from the root down and see where running the caching of siblings concurrently gets me. I still like your point, Sean, about trying to do

partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by the same key. Then, when I read them back in, joining the two tables on that key should be about as fast as things can be done. Can I do that, and if so, how? I don't see how to control the partitioning of a SQL table

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf wrote: > I'm considering whether or not it is worth introducing Spark at my new > company. The data is no-where near Hadoop size at this point (it sits in > an RDS Postgres cluster). > Will it ever become "Hadoop size"? Looking at the overhead

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
There is a usability concern I have with the current way of specifying --jars. Imagine a use case like hbase where a lot of jobs need it in its classpath. This needs to be set every time. If we use spark.executor.extraClassPath, then we just need to set it once But there is no programmatic way to s

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer wrote: > Gary, > > On Fri, Feb 27, 2015 at 8:40 AM, Ga

Re: group by order by fails

2015-02-26 Thread Michael Armbrust
Assign an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta wrote: > Actually I just realized , I am using 1.2.0. > > Thanks > Tridib > > -- > Date: Thu, 26 Feb 2015 12:37:06 +0530 > Sub

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf wrote: > The honest answer is that it is unclear to me at this point. I guess what > I am really wondering is if there are cases where one would find it > beneficial to use Spark against one or more RDBs? > Well, RDBs are all about *storage*, wh

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. Thanks, Gary On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer wrote: > Hi > > On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf > wrote: > >> The honest a

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, Issues with using --jars make sense. I believe you can set the classpath via the use the --conf spark.executor.extraClassPath= or in your driver with .set("spark.executor.extraClassPath", ".") I believe you are correct with the localize as well as long as your guaranteed that

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 wrote: > Imran, thanks for your explaining about the parallelism. That is very > helpful. > > In my test case, I am only use one box cl

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf wrote: > So when deciding whether to take on installing/configuring Spark, the size > of the data does not automatically make that decision in your mind. > You got me there ;-) Tobias

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah wrote: > Also, I would like to know if there is a localization overhead when we use > spark.executor.extraClassPath. Again, in the case of hbase, these jars would > be typically available on all nodes. So there is no need to localize them > from the no

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just published a post about my experience doing that, building dashboards in Grafana , and using them to monitor Spark jobs: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Code

Spark-submit not working when application jar is in hdfs

2015-02-26 Thread dilm
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/sim

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; d...@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just publish

Re: Clean up app folders in worker nodes

2015-02-26 Thread markjgreene
I think the setting you are missing is 'spark.worker.cleanup.appDataTtl'. This setting controls how long the age of a file has to be before it is deleted. More info here: https://spark.apache.org/docs/1.0.1/spark-standalone.html. Also, 'spark.worker.cleanup.interval' you have configured is pretty

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my local machine and the problem didn't present itself. I'm pretty confident that means the problem is in the LocalBackend or something near it. On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen wrote: > Okay I confirmed my

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Ognen Duzlevski
Thanks you all! On Thu, Feb 26, 2015 at 5:50 PM, wrote: > Ignite guys spoke at the bigtop workshop last week at Scale, posted slides > here: > > https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x > > Couple main pts around comments made during the preso.., although > incubating > apache

Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-02-26 Thread Arun Luthra
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray]) kryo.r

Reg. KNN on MLlib

2015-02-26 Thread Deep Pradhan
Has KNN classification algorithm been implemented on MLlib? Thank You Regards, Deep

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
Thanks Marcelo. Do you think it would be useful to make spark.executor.extraClassPath be made to pick up some environment variable that can be set from spark-env.sh? Here is a example. spark-env.sh -- executor_extra_cp = get_hbase_jars_for_cp export executor_extra_cp spark-default

Speedup in Cluster

2015-02-26 Thread Deep Pradhan
What should be the expected performance of Spark Applications with the increase in the number of nodes in a cluster, other parameters being constant? Thank You Regards, Deep

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen wrote: > Love to hear some input on this. I did get a standalone cluster up on my > local machine and the pro

Get importerror when i run pyspark with ipython=1

2015-02-26 Thread sourabhguha
I get the above error when I try to run pyspark with the ipython option. I do not get this error when I run it without the ipython option. I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Wi

Re: Reg. KNN on MLlib

2015-02-26 Thread Xiangrui Meng
It is not in MLlib. There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an implementation for integer values. -Xiangrui On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan wrote: > Has KNN classification algorithm been implemented on MLlib? > > Thank You > Regards,

One of the executor not getting StopExecutor message

2015-02-26 Thread twinkle sachdeva
Hi, I am running a spark application on Yarn in cluster mode. One of my executor appears to be in hang state, for a long time, and gets finally killed by the driver. As compared to other executors, It have not received StopExecutor message from the driver. Here are the logs at the end of this c

Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython? On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha wrote: > > > I get the above error when I try to run pyspark with the ipython option. I > do not ge

Global sequential access of elements in RDD

2015-02-26 Thread Wush Wu
Dear all, I want to implement some sequential algorithm on RDD. For example: val conf = new SparkConf() conf.setMaster("local[2]"). setAppName("SequentialSuite") val sc = new SparkContext(conf) val rdd = sc. parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2). sortBy(x => x, true) r

<    1   2