Re: [AirFlow]: Pandas DataFrame Between Tasks

Anton Zayniev Wed, 25 Dec 2019 09:13:14 -0800

Maybe the simpliest solution would be generating a temp csv file from
pandas, pass it's path through xcom to next task. To make it idempotent you
can dynamically generate filename to avoid collisions.


On Wed, Dec 25, 2019, 16:55 Jarek Potiuk <jarek.pot...@polidea.com> wrote:

> I think it really depends what kind of data, what size, which frequency you
> are going to use it for and what will be the use pattern. It's best to make
> a conscious choice based on knowing the options you have :).
>
> There are a number of options on top of the mentioned above. From what I
> hear - Avro becomes more and more popular - most of the services (like BQ
> and others) support it.  Also Parquet is an interesting one and natively
> supported by Panda.
>
> There are some converters that can be used to convert between different
> formats (for example https://github.com/ynqa/pandavro for panda<>avro or
> "to_parquet" method built in panda itself:
>
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
> ).
> Avro is record based (like CSV) with nested data capability, where Parquet
> is column based (where set of columns can change over time).
>
> But those are just a few examples and it's up to you to choose the right
> approach for you, so here are some articles to explore:
>
>    - Here you can find nice comparison/benchmark of different formats for
>    Panda serialisation
>
> https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
>    - Also nice explanation in SO what are the benefits of using Parquet:
>
> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>    - And finally very nice article describing different types of file
>    formats (record, column, nested, hierarchical, array, model...) -
> including
>    comparisons and properties of each type:
>
> https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats
>
>
> J.
>
>
>
>
> On Tue, Dec 24, 2019 at 10:50 AM Deng Xiaodong <xd.den...@gmail.com>
> wrote:
>
> > Yep, exactly what I suggested below.
> >
> > In terms of format, Feather (suggested by Robin below) should be favoured
> > over .csv given it persists schema as well.
> >
> >
> > XD
> >
> > On Tue, Dec 24, 2019 at 17:44 Tomasz Urbaszek <
> tomasz.urbas...@polidea.com
> > >
> > wrote:
> >
> > > Personally I would use a .csv format and store the file on a S3/GCS
> > bucket.
> > > Xcom is meant to store small amount of data.
> > >
> > > T.
> > >
> > > On Tue, Dec 24, 2019 at 10:33 AM Robin Edwards <r...@bidnamic.com>
> wrote:
> > >
> > > > Feather is probably a good option for data frames:
> > > >
> > > >
> > > >
> > >
> >
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_feather.html
> > > >
> > > > R
> > > >
> > > > On Tue, 24 Dec 2019 at 07:52, Deng Xiaodong <xd.den...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi David.
> > > > >
> > > > > The only “out of box” way to share data/information between tasks
> is
> > > > XCom (
> > > > >
> > > >
> > >
> >
> https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#xcoms
> > > > ).
> > > > >
> > > > > For you case, the quick suggestion I can share is
> > > > >
> > > > > - either merging your tasks
> > > > > - or persisting your Pandas Dataframes somewhere then load it in
> your
> > > 2nd
> > > > > task (e.g. using pickle)
> > > > >
> > > > >
> > > > > XD
> > > > >
> > > > > On Tue, Dec 24, 2019 at 15:00 David Muñoz <
> david.munoz4...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Excuse me, I am new to this and maybe this topic has already been
> > > > treated.
> > > > > >
> > > > > > I would like to know if there is a way to "share/pass" pandas
> > > > dataframes
> > > > > > between tasks in airflow.
> > > > > >
> > > > > > Any help would be appreciated.
> > > > > >
> > > > > > Thank you!!!
> > > > > >
> > > > > > David.
> > > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Tomasz Urbaszek
> > > Polidea <https://www.polidea.com/> | Software Engineer
> > >
> > > M: +48 505 628 493 <+48505628493>
> > > E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: [AirFlow]: Pandas DataFrame Between Tasks

Reply via email to