Re: pivoting panda dataframe

Andrew Davidson Tue, 15 Mar 2022 15:00:00 -0700

Many many thanks!

I have been looking for a pyspark data frame  column_bind() solution for 
several months. Hopefully pyspark.pandas  works. The only other solutions I was 
aware of was to use spark.dataframe.join(). This does not scale for obvious 
reason.

Andy

From: Bjørn Jørgensen <bjornjorgen...@gmail.com>
Date: Tuesday, March 15, 2022 at 2:19 PM
To: Andrew Davidson <aedav...@ucsc.edu>
Cc: Mich Talebzadeh <mich.talebza...@gmail.com>, "user @spark" 
<user@spark.apache.org>
Subject: Re: pivoting panda dataframe

Hi Andrew. Mitch asked, and I answered transpose() 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
 .

And now you are asking in the same thread about pandas API on spark and the 
transform().

Apache Spark have pandas API on Spark.

Which means that spark has an API call for pandas functions, and when you use 
pandas API on spark it is spark you are using then.

Add this line in yours import

from pyspark import pandas as ps

Now you can pass yours dataframe back and forward to pandas API on spark by 
using

pf01 = f01.to_pandas_on_spark()

f01 = pf01.to_spark()

Note that I have changed pd to ps here.

df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})

df.transform(lambda x: x + 1)

You will now see that all numbers are +1

You can find more information about pandas API on spark transform 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?

Signature:

df.transform(

    func: Callable[..., ForwardRef('Series')],

    axis: Union[int, str] = 0,

    *args: Any,

    **kwargs: Any,

) -> 'DataFrame'

Docstring:

Call ``func`` on self producing a Series with transformed values

and that has the same length as its input.

See also `Transform and apply a function

<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html%3E%60_>.

.. note:: this API executes the function once to infer the type which is

     potentially expensive, for instance, when the dataset is created after

     aggregations or sorting.

     To avoid this, specify return type in ``func``, for instance, as below:

     >>> def square(x) -> ps.Series[np.int32]:

     ...     return x ** 2

     pandas-on-Spark uses return type hint and does not try to infer the type.

.. note:: the series within ``func`` is actually multiple pandas series as the

    segments of the whole pandas-on-Spark series; therefore, the length of each 
series

    is not guaranteed. As an example, an aggregation against each series

    does work as a global aggregation but an aggregation of each segment. See

    below:

    >>> def func(x) -> ps.Series[np.int32]:

    ...     return x + sum(x)

Parameters

----------

func : function

    Function to use for transforming the data. It must work when pandas Series

    is passed.

axis : int, default 0 or 'index'

    Can only be set to 0 at the moment.

*args

    Positional arguments to pass to func.

**kwargs

    Keyword arguments to pass to func.

Returns

-------

DataFrame

    A DataFrame that must have the same length as self.

Raises

------

Exception : If the returned DataFrame has a different length than self.

See Also

--------

DataFrame.aggregate : Only perform aggregating type operations.

DataFrame.apply : Invoke function on DataFrame.

Series.transform : The equivalent function for Series.

Examples

--------

>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])

>>> df

   A  B

0  0  1

1  1  2

2  2  3

>>> def square(x) -> ps.Series[np.int32]:

...     return x ** 2

>>> df.transform(square)

   A  B

0  0  1

1  1  4

2  4  9

You can omit the type hint and let pandas-on-Spark infer its type.

>>> df.transform(lambda x: x ** 2)

   A  B

0  0  1

1  1  4

2  4  9

For multi-index columns:

>>> df.columns = [('X', 'A'), ('X', 'B')]

>>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE

   X

   A  B

0  0  1

1  1  4

2  4  9

>>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE

   X

   A  B

0  0  1

1  1  2

2  2  3

You can also specify extra arguments.

>>> def calculation(x, y, z) -> ps.Series[int]:

...     return x ** y + z

>>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE

      X

      A      B

0    20     21

1    21   1044

2  1044  59069

File:      /opt/spark/python/pyspark/pandas/frame.py

Type:      method

tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson 
<aedav...@ucsc.edu<mailto:aedav...@ucsc.edu>>:
Hi Bjorn

I have been looking for spark transform for a while. Can you send me a link to 
the pyspark function?

I assume pandas transform is not really an option. I think it will try to pull 
the entire dataframe into the drivers memory.

Kind regards

Andy

p.s. My real problem is that spark does not allow you to bind columns. You can 
use union() to bind rows. I could get the equivalent of cbind() using 
union().transform()

From: Bjørn Jørgensen 
<bjornjorgen...@gmail.com<mailto:bjornjorgen...@gmail.com>>
Date: Tuesday, March 15, 2022 at 10:37 AM
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: pivoting panda dataframe

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html we 
have that transpose in pandas api for spark to.

You also have stack() and multilevel 
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>:

hi,

Is it possible to pivot a panda dataframe by making the row column heading?

thanks

 Error! Filename not specified.  view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: pivoting panda dataframe

Reply via email to