Re: DropNa in Spark for Columns

Peyman Mohajerian Sat, 27 Feb 2021 08:04:55 -0800

I don't have personal experience with Koalas but it does seem to have the
same api:
https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.dropna.html


On Fri, Feb 26, 2021 at 11:46 PM Vitali Lupusor <vitalilupu...@gmail.com>
wrote:

> Hello Chetan,
>
> I don’t know about Scala, but in PySpark there is no elegant way of
> dropping NAs on column axis.
>
> Here is a possible solution to your problem:
>
> >>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
> >>> columns = ('A', 'B', 'C')
> >>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
> >>> df = spark.createDataFrame(data, columns)
> >>> df.show()
> +----+----+---+
> |   A|   B|  C|
> +----+----+---+
> |null|   1|  2|
> |   0|null|  2|
> |   0|   1|  2|
> +----+----+---+
> >>> for column in df.columns:
>         if df.select(column).where(df[column].isNull()).first():
>                 df = df.drop(column)
> ...
> >>> df.show()
> +---+
> |  C|
> +---+
> |  2|
> |  2|
> |  2|
> +—+
>
> If your dataframe doesn’t exceed the size of your memory, I suggest you
> bring it into Pandas.
>
> >>> df_pd = df.toPandas()
> >>> df_pd
>      A    B  C
> 0  NaN  1.0  2
> 1  0.0  NaN  2
> 2  0.0  1.0  2
> >>> df_pd = df_pd.dropna(axis='column’)
> >>> df_pd
>    C
> 0  2
> 1  2
> 2  2
>
> Which you then can bring back into Spark:
>
> >>> df = spark.createDataFrame(df_pd)
> >>> df.show()
> +---+
> |  C|
> +---+
> |  2|
> |  2|
> |  2|
> +---+
>
> Hope that help.
>
> Regards,
> V
>
> On 27 Feb 2021, at 05:25, Chetan Khatri <chetan.opensou...@gmail.com>
> wrote:
>
> Hi Users,
>
> What is equivalent of *df.dropna(axis='columns'**) *of Pandas in the
> Spark/Scala?
>
> Thanks
>
>
>

Re: DropNa in Spark for Columns

Reply via email to