Re: DropNa in Spark for Columns

Vitali Lupusor Fri, 26 Feb 2021 23:45:36 -0800

Hello Chetan,

I don’t know about Scala, but in PySpark there is no elegant way of dropping 
NAs on column axis.


Here is a possible solution to your problem:

>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> columns = ('A', 'B', 'C')
>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> df = spark.createDataFrame(data, columns)
>>> df.show()
+----+----+---+
|   A|   B|  C|
+----+----+---+
|null|   1|  2|
|   0|null|  2|
|   0|   1|  2|
+----+----+---+
>>> for column in df.columns:
        if df.select(column).where(df[column].isNull()).first():
                df = df.drop(column)
... 
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+—+

If your dataframe doesn’t exceed the size of your memory, I suggest you bring 
it into Pandas.

>>> df_pd = df.toPandas()
>>> df_pd
     A    B  C
0  NaN  1.0  2
1  0.0  NaN  2
2  0.0  1.0  2
>>> df_pd = df_pd.dropna(axis='column’)
>>> df_pd
   C
0  2
1  2
2  2

Which you then can bring back into Spark:

>>> df = spark.createDataFrame(df_pd)
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+---+

Hope that help.

Regards,
V

> On 27 Feb 2021, at 05:25, Chetan Khatri <chetan.opensou...@gmail.com> wrote:
> 
> Hi Users, 
> 
> What is equivalent of df.dropna(axis='columns') of Pandas in the Spark/Scala?
> 
> Thanks

Re: DropNa in Spark for Columns

Reply via email to