Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Joris Billen
Hi, as said thanks for little discussion over mail. I understand that the action is triggered in the end at the write and then all of a sudden everything is executed at once. But I dont really need to trigger an action before. I am caching somewherew a df that will be reused several times (sligh

data type missing

2022-04-01 Thread capitnfrakass
Hello After I converted the dataframe to RDD I found the data type was missing. scala> df.show ++---+ |name|age| ++---+ |jone| 12| |rosa| 21| ++---+ scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = false) scala> df.rdd.map{ row => (r

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-01 Thread Sean Owen
This feels like premature optimization, and not clear it's optimizing, but maybe. Caching things that are used once is worse than not caching. It looks like a straight-line through to the write, so I doubt caching helps anything here. On Fri, Apr 1, 2022 at 2:49 AM Joris Billen wrote: > Hi, > as

how to change data type for columns of dataframe

2022-04-01 Thread capitnfrakass
Hi I got a dataframe object from other application, it means this obj is not generated by me. How can I change the data types for some columns in this dataframe? For example, change the column type from Int to Float. Thanks. ---

Re: how to change data type for columns of dataframe

2022-04-01 Thread ayan guha
Please use cast. Also I would strongly recommend to go through spark doco, its pretty good. On Sat, 2 Apr 2022 at 12:43 pm, wrote: > Hi > > I got a dataframe object from other application, it means this obj is > not generated by me. > How can I change the data types for some columns in this data