Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Zoltán Zvara
Worked for me 2 weeks ago with a 3.0.0-alpha2 snapshot. Just changed hadoop.version while building. On Fri, Oct 28, 2016, 11:50 Sean Owen wrote: > I don't think it works, but, there is no Hadoop 3.0 right now either. As > the version implies, it's going to be somewhat different API-wise. > > On

Re: Spark Streaming updateStateByKey Implementation

2015-11-08 Thread Zoltán Zvara
It is implemented with cogroup. Basically it stores states in a separate RDD and cogroups the target RDD with the state RDD, which is then hidden from you. See StateDStream.scala, there is everything you need to know. On Fri, Nov 6, 2015 at 6:25 PM Hien Luu wrote: > Hi, > > I am interested in le

Re: Shuffle Write v/s Shuffle Read

2015-10-02 Thread Zoltán Zvara
Hi, Shuffle output goes to local disk each time, as far as I know, never to memory. On Fri, Oct 2, 2015 at 1:26 PM Adrian Tanase wrote: > I’m not sure this is related to memory management – the shuffle is the > central act of moving data around nodes when the computations need the data > on ano

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Zvara
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your write will be performed by the JVM. On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán wrote: > Unfortunately I'm getting the same error: > The other interesting things are that: > - the parquet files got actually written to HDFS

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Zoltán Zvara
I personally build with SBT and run Spark on YARN with IntelliJ. You need to connect to remote JVMs with a remote debugger. You also need to do similar, if you use Python, because it will launch a JVM on the driver aswell. On Wed, Aug 19, 2015 at 2:10 PM canan chen wrote: > Thanks Ted. I notice

Re: Always two tasks slower than others, and then job fails

2015-08-14 Thread Zoltán Zvara
Data skew is still a problem with Spark. - If you use groupByKey, try to express your logic by not using groupByKey. - If you need to use groupByKey, all you can do is to scale vertically. - If you can, repartition with a finer HashPartitioner. You will have many tasks for each stage, but tasks ar

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Zoltán Zvara
Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time. Multiple operations are just chains of simple *map *and *flatMap *operators at task level on simple Scala

Re: YARN mode startup takes too long (10+ secs)

2015-05-08 Thread Zoltán Zvara
sentially > the same place that Zoltán Zvara picked: > > > > 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager > > 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: > Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-1495507

Re: YARN mode startup takes too long (10+ secs)

2015-05-07 Thread Zoltán Zvara
Without considering everything, just a few hints: You are running on YARN. From 09:18:34 to 09:18:37 your application is in state ACCEPTED. There is a noticeable overhead introduced due to communicating with YARN's ResourceManager, NodeManager and given that the YARN scheduler needs time to make a

Re: JAVA for SPARK certification

2015-05-05 Thread Zoltán Zvara
I might join in to this conversation with an ask. Would someone point me to a decent exercise that would approximate the level of this exam (from above)? Thanks! On Tue, May 5, 2015 at 3:37 PM Kartik Mehta wrote: > Production - not whole lot of companies have implemented Spark in > production an

Re: spark-defaults.conf

2015-04-27 Thread Zoltán Zvara
You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King wrote: > I renamed spark-defaults.conf.template to spark-defaults.conf > and invoked

Re: How to debug Spark on Yarn?

2015-04-27 Thread Zoltán Zvara
You can check container logs from RM web UI or when log-aggregation is enabled with the yarn command. There are other, but less convenient options. On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Spark 1.3 > > 1. View stderr/stdout from executor from Web UI: when the job is running i > fi

Re: Complexity of transformations in Spark

2015-04-26 Thread Zoltán Zvara
You can calculate the complexity of these operators by looking at the RDD.scala basically. There, you will find - for example - what happens when you call a map on RDDs. It's a simple Scala map function on a simple Iterator of type T. Distinct has been implemented with mapping and grouping on the i