Worked for me 2 weeks ago with a 3.0.0-alpha2 snapshot. Just changed
hadoop.version while building.
On Fri, Oct 28, 2016, 11:50 Sean Owen wrote:
> I don't think it works, but, there is no Hadoop 3.0 right now either. As
> the version implies, it's going to be somewhat different API-wise.
>
> On
It is implemented with cogroup. Basically it stores states in a separate
RDD and cogroups the target RDD with the state RDD, which is then hidden
from you. See StateDStream.scala, there is everything you need to know.
On Fri, Nov 6, 2015 at 6:25 PM Hien Luu wrote:
> Hi,
>
> I am interested in le
Hi,
Shuffle output goes to local disk each time, as far as I know, never to
memory.
On Fri, Oct 2, 2015 at 1:26 PM Adrian Tanase wrote:
> I’m not sure this is related to memory management – the shuffle is the
> central act of moving data around nodes when the computations need the data
> on ano
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your
write will be performed by the JVM.
On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán wrote:
> Unfortunately I'm getting the same error:
> The other interesting things are that:
> - the parquet files got actually written to HDFS
I personally build with SBT and run Spark on YARN with IntelliJ. You need
to connect to remote JVMs with a remote debugger. You also need to do
similar, if you use Python, because it will launch a JVM on the driver
aswell.
On Wed, Aug 19, 2015 at 2:10 PM canan chen wrote:
> Thanks Ted. I notice
Data skew is still a problem with Spark.
- If you use groupByKey, try to express your logic by not using groupByKey.
- If you need to use groupByKey, all you can do is to scale vertically.
- If you can, repartition with a finer HashPartitioner. You will have many
tasks for each stage, but tasks ar
Serialization only occurs intra-stage, when you are using Python, and as
far as I know, only in the first stage, when reading the data and passing
it to the Python interpreter the first time.
Multiple operations are just chains of simple *map *and *flatMap *operators
at task level on simple Scala
sentially
> the same place that Zoltán Zvara picked:
>
>
>
> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
>
> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
> Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-1495507
Without considering everything, just a few hints:
You are running on YARN. From 09:18:34 to 09:18:37 your application is in
state ACCEPTED. There is a noticeable overhead introduced due to
communicating with YARN's ResourceManager, NodeManager and given that the
YARN scheduler needs time to make a
I might join in to this conversation with an ask. Would someone point me to
a decent exercise that would approximate the level of this exam (from
above)? Thanks!
On Tue, May 5, 2015 at 3:37 PM Kartik Mehta
wrote:
> Production - not whole lot of companies have implemented Spark in
> production an
You should distribute your configuration file to workers and set the
appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
HADOOP_CONF_DIR, SPARK_CONF_DIR.
On Mon, Apr 27, 2015 at 12:56 PM James King wrote:
> I renamed spark-defaults.conf.template to spark-defaults.conf
> and invoked
You can check container logs from RM web UI or when log-aggregation is
enabled with the yarn command. There are other, but less convenient options.
On Mon, Apr 27, 2015 at 8:53 AM ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
> Spark 1.3
>
> 1. View stderr/stdout from executor from Web UI: when the job is running i
> fi
You can calculate the complexity of these operators by looking at the
RDD.scala basically. There, you will find - for example - what happens when
you call a map on RDDs. It's a simple Scala map function on a simple
Iterator of type T. Distinct has been implemented with mapping and grouping
on the i
13 matches
Mail list logo