I was surprised I couldn't find a way of solving this in spark, as it must
be a very common problem for users. Then I decided to ask here.
Consider the code bellow:
```
val joinColumns = Seq("a", "b")
val df1 = Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"), ("a4", null,
"c4")).toDF("a", "b", "c")
va
Hi,
I am using spark on EMR 5.28.0.
We were having a problem in production where, after a join between 2
dataframes, in some situations all data was being moved to a single node,
and then the cluster was failing after retrying many times.
Our join is something like that:
```
df1.join(df2,
df
devan wrote:
> You can do it with custom RDD implementation.
> You will mainly implement "getPartitions" - the logic to split your input
> into partitions and "compute" to compute and return the values from the
> executors.
>
> On Tue, 17 Sep 2019 at 08:47, Marcelo
with spark
On Tue, 17 Sep 2019 at 16:28, Marcelo Valle wrote:
> Hi,
>
> I want to create a custom RDD which will read n lines in sequence from a
> file, which I call a block, and each block should be converted to a spark
> dataframe to be processed in parallel.
>
> Qu
Hi,
I want to create a custom RDD which will read n lines in sequence from a
file, which I call a block, and each block should be converted to a spark
dataframe to be processed in parallel.
Question - do I have to implement a custom hadoop input format to achieve
this? Or is it possible to do it
Maybe you can look at the spark ui. The physical plan has no time
> consuming information.
> 在 2019/8/13 下午10:45, Marcelo Valle 写道:
>
> Hi,
>
> I have a job running on AWS EMR. It's basically a join between 2 tables
> (parquet files on s3), one somehow large (around 50 g
Hi,
I have a job running on AWS EMR. It's basically a join between 2 tables
(parquet files on s3), one somehow large (around 50 gb) and other small
(less than 1gb).
The small table is the result of other operations, but it was a dataframe
with `.persist(StorageLevel.MEMORY_AND_DISK_SER)` and the c
rrari wrote:
> Hi Marcelo,
>
> I'm used to work with https://github.com/jupyter/docker-stacks. There's
> the Scala+jupyter option too. Though there might be better option with
> Zeppelin too.
> Hth
>
>
> On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote:
>
Hi,
I would like to run spark shell + scala on a docker environment, just to
play with docker in development machine without having to install JVM + a
lot of things.
Is there something as an "official docker image" I am recommended to use? I
saw some on docker hub, but it seems they are all contr
what you need?
>
> // Bruno
>
>
> Le 6 juin 2019 à 16:02, Marcelo Valle a écrit :
>
> Generating the city id (child) is easy, monotonically increasing id worked
> for me.
>
> The problem is the country (parent) which has to be in both countries and
> cities data fram
roadcast join unless your dataset is
> pre-bucketed along non-colliding Country name lines, then the
> partition-based solution is probably faster. Or better yet, pre-create a
> list of all the worlds countries with an Id and do a broadcast join
> straight away.
>
>
> Reg
t of citynames is a non shuffling
> fast operation, add a row_number column and do a broadcast join with the
> original dataset and then split into two subsets. Probably a bit faster
> than reshuffling the entire dataframe. As always the proof is in the
> pudding.
>
> //Magn
rite code or define RDDs with map/reduce
> functions.
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Thu, May 30, 2019 at 4:05 AM Marcelo Valle
> wrote:
>
>> Hi all,
>>
>> I am new to spark and I am trying to write an application using
>> dataframes t
Hi all,
I am new to spark and I am trying to write an application using dataframes
that normalize data.
So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY,
CITY_NICKNAME
Here is what I want to do:
1. Map by country, then for each country generate a new ID and write t
14 matches
Mail list logo