Re: pivoting panda dataframe

Bjørn Jørgensen Tue, 15 Mar 2022 14:20:04 -0700

Hi Andrew. Mitch asked, and I answered transpose()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
.


And now you are asking in the same thread about pandas API on spark and the
transform().

Apache Spark have pandas API on Spark.

Which means that spark has an API call for pandas functions, and when you
use pandas API on spark it is spark you are using then.

Add this line in yours import

from pyspark import pandas as ps


Now you can pass yours dataframe back and forward to pandas API on spark by
using

pf01 = f01.to_pandas_on_spark()


f01 = pf01.to_spark()


Note that I have changed pd to ps here.

df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})

df.transform(lambda x: x + 1)

You will now see that all numbers are +1

You can find more information about pandas API on spark transform
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
or in yours notbook
df.transform?

Signature:
df.transform(
    func: Callable[..., ForwardRef('Series')],
    axis: Union[int, str] = 0,
    *args: Any,
    **kwargs: Any,) -> 'DataFrame'Docstring:
Call ``func`` on self producing a Series with transformed values
and that has the same length as its input.

See also `Transform and apply a function
<https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.

.. note:: this API executes the function once to infer the type which is
     potentially expensive, for instance, when the dataset is created after
     aggregations or sorting.

     To avoid this, specify return type in ``func``, for instance, as below:

     >>> def square(x) -> ps.Series[np.int32]:
     ...     return x ** 2

     pandas-on-Spark uses return type hint and does not try to infer the type.

.. note:: the series within ``func`` is actually multiple pandas series as the
    segments of the whole pandas-on-Spark series; therefore, the
length of each series
    is not guaranteed. As an example, an aggregation against each series
    does work as a global aggregation but an aggregation of each segment. See
    below:

    >>> def func(x) -> ps.Series[np.int32]:
    ...     return x + sum(x)

Parameters
----------
func : function
    Function to use for transforming the data. It must work when pandas Series
    is passed.
axis : int, default 0 or 'index'
    Can only be set to 0 at the moment.
*args
    Positional arguments to pass to func.
**kwargs
    Keyword arguments to pass to func.

Returns
-------
DataFrame
    A DataFrame that must have the same length as self.

Raises
------
Exception : If the returned DataFrame has a different length than self.

See Also
--------
DataFrame.aggregate : Only perform aggregating type operations.
DataFrame.apply : Invoke function on DataFrame.
Series.transform : The equivalent function for Series.

Examples
--------
>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  1  2
2  2  3

>>> def square(x) -> ps.Series[np.int32]:
...     return x ** 2
>>> df.transform(square)
   A  B
0  0  1
1  1  4
2  4  9

You can omit the type hint and let pandas-on-Spark infer its type.

>>> df.transform(lambda x: x ** 2)
   A  B
0  0  1
1  1  4
2  4  9

For multi-index columns:

>>> df.columns = [('X', 'A'), ('X', 'B')]
>>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
   X
   A  B
0  0  1
1  1  4
2  4  9

>>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
   X
   A  B
0  0  1
1  1  2
2  2  3

You can also specify extra arguments.

>>> def calculation(x, y, z) -> ps.Series[int]:
...     return x ** y + z
>>> df.transform(calculation, y=10, z=20)  # doctest: +NORMALIZE_WHITESPACE
      X
      A      B
0    20     21
1    21   1044
2  1044  59069File:
/opt/spark/python/pyspark/pandas/frame.pyType:      method





tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedav...@ucsc.edu>:

> Hi Bjorn
>
>
>
> I have been looking for spark transform for a while. Can you send me a
> link to the pyspark function?
>
>
>
> I assume pandas transform is not really an option. I think it will try to
> pull the entire dataframe into the drivers memory.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> p.s. My real problem is that spark does not allow you to bind columns. You
> can use union() to bind rows. I could get the equivalent of cbind() using
> union().transform()
>
>
>
> *From: *Bjørn Jørgensen <bjornjorgen...@gmail.com>
> *Date: *Tuesday, March 15, 2022 at 10:37 AM
> *To: *Mich Talebzadeh <mich.talebza...@gmail.com>
> *Cc: *"user @spark" <user@spark.apache.org>
> *Subject: *Re: pivoting panda dataframe
>
>
>
>
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html 
> we
> have that transpose in pandas api for spark to.
>
>
>
> You also have stack() and multilevel
> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>
>
>
>
>
>
>
> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>
> hi,
>
>
>
> Is it possible to pivot a panda dataframe by making the row column
> heading?
>
>
>
> thanks
>
>
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
> --
>
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: pivoting panda dataframe

Reply via email to