Hello Chetan,
I don’t know about Scala, but in PySpark there is no elegant way of dropping
NAs on column axis.
Here is a possible solution to your problem:
>>> data = [(None, 1, 2), (0, None, 2), (0, 1, 2)]
>>> columns = ('A', 'B', 'C')
>>> data = [(None, 1, 2), (0, None, 2), (0, 1, 2)]
>>> df = spark.createDataFrame(data, columns)
>>> df.show()
+----+----+---+
| A| B| C|
+----+----+---+
|null| 1| 2|
| 0|null| 2|
| 0| 1| 2|
+----+----+---+
>>> for column in df.columns:
if df.select(column).where(df[column].isNull()).first():
df = df.drop(column)
...
>>> df.show()
+---+
| C|
+---+
| 2|
| 2|
| 2|
+—+
If your dataframe doesn’t exceed the size of your memory, I suggest you bring
it into Pandas.
>>> df_pd = df.toPandas()
>>> df_pd
A B C
0 NaN 1.0 2
1 0.0 NaN 2
2 0.0 1.0 2
>>> df_pd = df_pd.dropna(axis='column’)
>>> df_pd
C
0 2
1 2
2 2
Which you then can bring back into Spark:
>>> df = spark.createDataFrame(df_pd)
>>> df.show()
+---+
| C|
+---+
| 2|
| 2|
| 2|
+---+
Hope that help.
Regards,
V
> On 27 Feb 2021, at 05:25, Chetan Khatri <[email protected]> wrote:
>
> Hi Users,
>
> What is equivalent of df.dropna(axis='columns') of Pandas in the Spark/Scala?
>
> Thanks