Thanks! very helpful. Eila On Mon, Jan 14, 2019 at 4:35 AM Robert Bradshaw <rober...@google.com> wrote:
> I am not aware of any built-in transform that can do this, however it > should not be that difficult to do with a group-by-key. > > Suppose one reads in the CSV file to a PCollection of dictionaries of the > format {'original_column_1': value1, 'original_column_2', value2, ...}. > Suppose further that original_column_N is the index column (which is what > will become the new column names). To compute the transpose you can use the > PTransform > > class Transpose(beam.PTransform): > def __init__(self, index_column): > self._index_column = index_column > def expand(self, pcoll): > return (pcoll > # Map to tuples of the form (column_name, (index, value)) > | beam.FlatMap(lambda original_row, ix_col: [ > (col, (original_row[ix_col], value)) > for col, value in original_row.items() > if col != ix_col], self._index_column) > # Group all values for a column together. > | beam.GroupByKey() > # Map to dictionaries of the form {'index': value} > | beam.Map(lambda (col, values): dict(values, > original_column_name=col))) > > You can then apply this to your pcollection by writing > > transposed_pcoll = pcoll | Transpose('original_column_N') > > > On Sun, Jan 13, 2019 at 5:19 PM Sameer Abhyankar <saabhyan...@google.com> > wrote: > >> Hi Eila - While I am not aware of a transpose transform available for CSV >> files, there is a sample pipeline available to transpose a BigQuery table >> and write the results to a different table[1]. It might be possible to >> modify this to work on a CSV source. >> >> [1] >> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose >> >> >> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof < >> e...@orielresearch.org> wrote: >> >>> Hi all, >>> >>> I am working with many CSV files where the common part is the row names >>> and therefore, my processing should be by columns. My plan is to have the >>> tables transposed and have the combines tables written into BQ. >>> So , the code should perform: >>> 1. transpose the tables (columns -> new_rows, rows->new_columns). >>> new_rows x new_columns = new_table >>> 2. extract the new_rows values from the new_tables and write them to big >>> query. >>> >>> Is there an easy way to transpose the CSV files? I am avoiding the usage >>> of pandas library because the size of the tables could be very large. >>> should I be concern by the table size. Is this consideration relevant or >>> should the Apache Beam be able to handle the resources for the pandas? >>> >>> What is my other option? is there any built in transpose method that I >>> am not aware of? >>> >>> Thanks for your help, >>> -- >>> Eila >>> www.orielresearch.org >>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/> >>> p.co <https://www.meetup.com/Deep-Learning-In-Production/> >>> m/Deep-Learning-In-Production/ >>> <https://www.meetup.com/Deep-Learning-In-Production/> >>> >>> >>> -- Eila www.orielresearch.org https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co <https://www.meetup.com/Deep-Learning-In-Production/> m/Deep-Learning-In-Production/ <https://www.meetup.com/Deep-Learning-In-Production/>