[ https://issues.apache.org/jira/browse/ARROW-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662345#comment-17662345 ]
Rok Mihevc commented on ARROW-5324: ----------------------------------- This issue has been migrated to [issue #21786|https://github.com/apache/arrow/issues/21786] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Plasma] API requests > --------------------- > > Key: ARROW-5324 > URL: https://issues.apache.org/jira/browse/ARROW-5324 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma > Reporter: Darren Weber > Priority: Minor > > Copied from [https://github.com/apache/arrow/issues/4318] (it's easier to > read there, sorry hate Jira formatting) > Related to https://issues.apache.org/jira/browse/ARROW-3444 > While working with the plasma API to create/seal an object for a table, using > a custom object-ID, it would help to have a convenience API to get the size > of the table. > The following code might help to illustrate the request and notes below: > {code:java} > if not parquet_path: > parquet_path = f"./data/dataset_{size}.parquet" > if not plasma_path: > plasma_path = f"./data/dataset_{size}.plasma" > try: > plasma_client = plasma.connect(plasma_path) > except: > plasma_client = None > if plasma_client: > table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8')) > try: > table = plasma_client.get(table_id, timeout_ms=4000) > if table.__name__ == 'ObjectNotAvailable': > raise ValueError('Failed to get plasma object') > except ValueError: > table = pq.read_table(parquet_path, use_threads=True) > plasma_client.create_and_seal(table_id, table) > {code} > > The use case is a workflow something like this: > - process-A > ** generate a pandas DataFrame `df` > ** save the `df` to parquet, using pyarrow.parquet, with a unique parquet > path > ** (this process will not save directly to plasma) > - process-B > ** get the data from plasma or load it into plasma from the parquet file > ** use the unique parquet path to generate a unique object-ID > Notes: > - `plasma_client.put` for the same data-table is not idempotent, it > generates unique object-ID values that are not based on any hash of the data > payload, so every put saves a new object-ID; could it use a data hash for > idempotent puts? e.g. > - > {code:java} > In : plasma_client.put(table) > ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6) > In : plasma_client.put(table) > ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d) > In : plasma_client.put(table) > ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61) > In : hash(table) > TypeError: unhashable type: 'pyarrow.lib.Table'{code} > - In process-B, when the data is not already in plasma, it reads data from a > parquet file into a pyarrow.Table and then needs an object-ID and the table > size to use plasma `client.create_and_seal` but it's not easy to get the > table size - this might be related to github issue #2707 (#3444) - it might > be ideal if the `client.create_and_seal` accepts responsibility for the size > of the object to be created when given a pyarrow data object like a table. > - when the plasma store does not have the object, it could have a default > timeout rather than hang indefinitely, and it's a bit clumsy to return an > object that is not easily checked with `isinstance` and it could be better to > have an exception handling pattern (or something like the requests 404 > patterns and options?) -- This message was sent by Atlassian Jira (v8.20.10#820010)