dataframe null safe joins given a list of columns

2020-02-06 Thread Marcelo Valle
I was surprised I couldn't find a way of solving this in spark, as it must be a very common problem for users. Then I decided to ask here. Consider the code bellow: ``` val joinColumns = Seq("a", "b") val df1 = Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"), ("a4", null, "c4")).toDF("a", "b", "c") va

join with just 1 record causes all data to go to a single node

2019-11-21 Thread Marcelo Valle
Hi, I am using spark on EMR 5.28.0. We were having a problem in production where, after a join between 2 dataframes, in some situations all data was being moved to a single node, and then the cluster was failing after retrying many times. Our join is something like that: ``` df1.join(df2, df

Re: custom rdd - do I need a hadoop input format?

2019-09-18 Thread Marcelo Valle
devan wrote: > You can do it with custom RDD implementation. > You will mainly implement "getPartitions" - the logic to split your input > into partitions and "compute" to compute and return the values from the > executors. > > On Tue, 17 Sep 2019 at 08:47, Marcelo

Re: custom rdd - do I need a hadoop input format?

2019-09-17 Thread Marcelo Valle
with spark On Tue, 17 Sep 2019 at 16:28, Marcelo Valle wrote: > Hi, > > I want to create a custom RDD which will read n lines in sequence from a > file, which I call a block, and each block should be converted to a spark > dataframe to be processed in parallel. > > Qu

custom rdd - do I need a hadoop input format?

2019-09-17 Thread Marcelo Valle
Hi, I want to create a custom RDD which will read n lines in sequence from a file, which I call a block, and each block should be converted to a spark dataframe to be processed in parallel. Question - do I have to implement a custom hadoop input format to achieve this? Or is it possible to do it

Re: help understanding physical plan

2019-08-16 Thread Marcelo Valle
Maybe you can look at the spark ui. The physical plan has no time > consuming information. > 在 2019/8/13 下午10:45, Marcelo Valle 写道: > > Hi, > > I have a job running on AWS EMR. It's basically a join between 2 tables > (parquet files on s3), one somehow large (around 50 g

help understanding physical plan

2019-08-13 Thread Marcelo Valle
Hi, I have a job running on AWS EMR. It's basically a join between 2 tables (parquet files on s3), one somehow large (around 50 gb) and other small (less than 1gb). The small table is the result of other operations, but it was a dataframe with `.persist(StorageLevel.MEMORY_AND_DISK_SER)` and the c

Re: best docker image to use

2019-06-13 Thread Marcelo Valle
rrari wrote: > Hi Marcelo, > > I'm used to work with https://github.com/jupyter/docker-stacks. There's > the Scala+jupyter option too. Though there might be better option with > Zeppelin too. > Hth > > > On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote: >

best docker image to use

2019-06-11 Thread Marcelo Valle
Hi, I would like to run spark shell + scala on a docker environment, just to play with docker in development machine without having to install JVM + a lot of things. Is there something as an "official docker image" I am recommended to use? I saw some on docker hub, but it seems they are all contr

Re: adding a column to a groupBy (dataframe)

2019-06-07 Thread Marcelo Valle
what you need? > > // Bruno > > > Le 6 juin 2019 à 16:02, Marcelo Valle a écrit : > > Generating the city id (child) is easy, monotonically increasing id worked > for me. > > The problem is the country (parent) which has to be in both countries and > cities data fram

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
roadcast join unless your dataset is > pre-bucketed along non-colliding Country name lines, then the > partition-based solution is probably faster. Or better yet, pre-create a > list of all the worlds countries with an Id and do a broadcast join > straight away. > > > Reg

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
t of citynames is a non shuffling > fast operation, add a row_number column and do a broadcast join with the > original dataset and then split into two subsets. Probably a bit faster > than reshuffling the entire dataframe. As always the proof is in the > pudding. > > //Magn

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
rite code or define RDDs with map/reduce > functions. > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, May 30, 2019 at 4:05 AM Marcelo Valle > wrote: > >> Hi all, >> >> I am new to spark and I am trying to write an application using >> dataframes t

adding a column to a groupBy (dataframe)

2019-05-29 Thread Marcelo Valle
Hi all, I am new to spark and I am trying to write an application using dataframes that normalize data. So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY, CITY_NICKNAME Here is what I want to do: 1. Map by country, then for each country generate a new ID and write t