> (1) if I want to cast n columns to a different type (e.g., float to int). What is the smallest memory overhead that I can do? (memory overhead of 1 column, n columns or 100 columns?)
You should be able to do this with only 1 column of overhead. Though you might need to go a little out of your way to ensure the table is deleted so it's not holding onto the old columns: Example: ``` import pyarrow as pa import pyarrow.compute as pc my_table = pa.Table.from_pydict({'a': list(range(100)), 'b': list(range(100)), 'c': list(range(100))}) print('Starting table') print(my_table) print(f'Starting RAM usage: {pa.default_memory_pool().bytes_allocated()}') cols = my_table.columns names = my_table.column_names del my_table for idx in range(len(cols)): cols[idx] = pc.cast(cols[idx], pa.int16()) print(f'RAM usage after converting col {idx}: {pa.default_memory_pool().bytes_allocated()}') new_table = pa.Table.from_arrays(cols, names=names) print('Final table') print(new_table) print(f'Final RAM usage: {pa.default_memory_pool().bytes_allocated()}') ``` Output: Starting table pyarrow.Table a: int64 b: int64 c: int64 ---- a: [[0,1,2,3,4,...,95,96,97,98,99]] b: [[0,1,2,3,4,...,95,96,97,98,99]] c: [[0,1,2,3,4,...,95,96,97,98,99]] Starting RAM usage: 2496 RAM usage after converting col 0: 1984 RAM usage after converting col 1: 1472 RAM usage after converting col 2: 960 Final table pyarrow.Table a: int16 b: int16 c: int16 ---- a: [[0,1,2,3,4,...,95,96,97,98,99]] b: [[0,1,2,3,4,...,95,96,97,98,99]] c: [[0,1,2,3,4,...,95,96,97,98,99]] Final RAM usage: 960 On Wed, Feb 15, 2023 at 2:59 PM Aldrin <akmon...@ucsc.edu.invalid> wrote: > I think you can replace the schema metadata using [1]. You can perhaps also > do the same for the field metadata, depending on where timezone metadata > may be on a timestamp array [2]. > > [1]: > > https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.replace_schema_metadata > [2]: > > https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.with_metadata > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Wed, Feb 15, 2023 at 2:52 PM Li Jin <ice.xell...@gmail.com> wrote: > > > Oh thanks that could be a workaround! I thought pa tables are supposed to > > be immutable , is there a safe way to just change the metadata? > > > > On Wed, Feb 15, 2023 at 5:44 PM Rok Mihevc <rok.mih...@gmail.com> wrote: > > > > > Well that's suboptimal. As a workaround I suppose you could just change > > the > > > metadata if the array is timezone aware. > > > > > > On Wed, Feb 15, 2023 at 10:37 PM Li Jin <ice.xell...@gmail.com> wrote: > > > > > > > Oh found this comment: > > > > > > > > > > > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_cast_temporal.cc#L156 > > > > > > > > > > > > > > > > On Wed, Feb 15, 2023 at 4:23 PM Li Jin <ice.xell...@gmail.com> > wrote: > > > > > > > > > Not sure if this is actually a bug or expected behavior - I filed > > > > > https://github.com/apache/arrow/issues/34210 > > > > > > > > > > On Wed, Feb 15, 2023 at 4:15 PM Li Jin <ice.xell...@gmail.com> > > wrote: > > > > > > > > > >> Hmm..something feels off here - I did the following experiment on > > > Arrow > > > > >> 11 and casting timestamp-naive to int64 is much faster than > casting > > > > >> timestamp-naive to timestamp-utc: > > > > >> > > > > >> In [16]: %time table.cast(schema_int) > > > > >> CPU times: user 114 µs, sys: 30 µs, total: 144 µs > > > > >> *Wall time: 231 µs* > > > > >> Out[16]: > > > > >> pyarrow.Table > > > > >> time: int64 > > > > >> ---- > > > > >> time: > [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]] > > > > >> > > > > >> In [17]: %time table.cast(schema_tz) > > > > >> CPU times: user 119 ms, sys: 140 ms, total: 260 ms > > > > >> *Wall time: 259 ms* > > > > >> Out[17]: > > > > >> pyarrow.Table > > > > >> time: timestamp[ns, tz=UTC] > > > > >> ---- > > > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01 > > > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01 > > > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01 > > > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01 > > > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 > > > > >> 00:00:00.099999999]] > > > > >> > > > > >> In [18]: table > > > > >> Out[18]: > > > > >> pyarrow.Table > > > > >> time: timestamp[ns] > > > > >> ---- > > > > >> time: [[1970-01-01 00:00:00.000000000,1970-01-01 > > > > >> 00:00:00.000000001,1970-01-01 00:00:00.000000002,1970-01-01 > > > > >> 00:00:00.000000003,1970-01-01 00:00:00.000000004,...,1970-01-01 > > > > >> 00:00:00.099999995,1970-01-01 00:00:00.099999996,1970-01-01 > > > > >> 00:00:00.099999997,1970-01-01 00:00:00.099999998,1970-01-01 > > > > >> 00:00:00.099999999]] > > > > >> > > > > >> On Wed, Feb 15, 2023 at 2:52 PM Rok Mihevc <rok.mih...@gmail.com> > > > > wrote: > > > > >> > > > > >>> I'm not sure about (1) but I'm pretty sure for (2) doing a cast > of > > > > >>> tz-aware > > > > >>> timestamp to tz-naive should be a metadata-only change. > > > > >>> > > > > >>> On Wed, Feb 15, 2023 at 4:19 PM Li Jin <ice.xell...@gmail.com> > > > wrote: > > > > >>> > > > > >>> > Asking (2) because IIUC this is a metadata operation that could > > be > > > > zero > > > > >>> > copy but I am not sure if this is actually the case. > > > > >>> > > > > > >>> > On Wed, Feb 15, 2023 at 10:17 AM Li Jin <ice.xell...@gmail.com > > > > > > wrote: > > > > >>> > > > > > >>> > > Hello! > > > > >>> > > > > > > >>> > > I have some questions about type casting memory usage with > > > pyarrow > > > > >>> Table. > > > > >>> > > Let's say I have a pyarrow Table with 100 columns. > > > > >>> > > > > > > >>> > > (1) if I want to cast n columns to a different type (e.g., > > float > > > to > > > > >>> int). > > > > >>> > > What is the smallest memory overhead that I can do? (memory > > > > overhead > > > > >>> of 1 > > > > >>> > > column, n columns or 100 columns?) > > > > >>> > > > > > > >>> > > (2) if I want to cast n timestamp columns from tz-native to > > > tz-UTC. > > > > >>> What > > > > >>> > > is the smallest memory overhead that I can do? (0, 1 column, > n > > > > >>> columns or > > > > >>> > > 100 columns?) > > > > >>> > > > > > > >>> > > Thanks! > > > > >>> > > Li > > > > >>> > > > > > > >>> > > > > > >>> > > > > >> > > > > > > > > > >