Can we create a JIRA to track this issue?
On Wed, Feb 21, 2018 at 5:04 AM, ALBERTO Bocchinfuso <alberto_boc...@hotmail.it> wrote: > Hi, > > Have you had any news on this issue? > Do you plan to solve it for the next releases of Arrow, or is there any way > to avoid the problem? > > Thanks in advance, > Alberto > Da: Philipp Moritz<mailto:pcmor...@gmail.com> > Inviato: venerdì 9 febbraio 2018 00:30 > A: dev@arrow.apache.org<mailto:dev@arrow.apache.org> > Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function > > Thanks! I can indeed reproduce this problem. I'm a bit busy right now and > plan to look into it on the weekend. > > Here is the preliminary backtrace for everybody interested: > > CESS (code=1, address=0x111138158) > > frame #0: 0x000000010e6457fc > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 > > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py: > > -> 0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi > > 0x10e645800 <+32>: callq 0x10e698170 ; symbol stub for: > PyInt_FromLong > > 0x10e645805 <+37>: testq %rax, %rax > > 0x10e645808 <+40>: je 0x10e64580c ; <+44> > > (lldb) bt > > * thread #1: tid = 0xf1378e, 0x000000010e6457fc > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, > queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, > address=0x111138158) > > * frame #0: 0x000000010e6457fc > lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 > > frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) > + 133 > > frame #2: 0x000000010e613b25 > lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933 > > frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60 > > frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + > 22305 > > On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso < > alberto_boc...@hotmail.it> wrote: > >> Hi, >> >> I’m using python 3.5.2 and pyarrow 0.8.0 >> >> As key, I put a string of 20 bytes, of course. I’m doing it differently >> from the canonical way since I’m no more using python 2.7, but python 3, >> and this seemed to me to be the right way to create a string of 20 bytes. >> The full code is: >> >> import pyarrow as pa >> import pyarrow.plasma as plasma >> >> def retrieve1(): >> client = plasma.connect('test', "", 0) >> >> key = "keynumber1keynumber1" >> pid = plasma.ObjectID(bytearray(key,'UTF-8')) >> >> [buff] = client .get_buffers([pid]) >> batch = pa.RecordBatchStreamReader(buff).read_next_batch() >> >> print(batch) >> print(batch.schema) >> print(batch[0]) >> >> return batch >> >> client = plasma.connect('test', "", 0) >> >> test1 = [1, 12, 23, 3, 21, 34] >> test1 = pa.array(test1, pa.int32()) >> >> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) >> >> key = "keynumber1keynumber1" >> pid = plasma.ObjectID(bytearray(key,'UTF-8')) >> sink = pa.MockOutputStream() >> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema) >> stream_writer.write_batch(batch) >> stream_writer.close() >> >> bff = client.create(pid, sink.size()) >> >> stream = pa.FixedSizeBufferWriter(bff) >> writer = pa.RecordBatchStreamWriter(stream, batch.schema) >> writer.write_batch(batch) >> client.seal(pid) >> >> batch = retrieve1() >> print(batch) >> print(batch.schema) >> print(batch[0]) >> >> I hope this helps, >> thank you >> >> Da: Philipp Moritz<mailto:pcmor...@gmail.com> >> Inviato: martedì 6 febbraio 2018 00:00 >> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org> >> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a >> function >> >> Hey Alberto, >> >> Thanks for your message! I'm trying to reproduce it. >> >> Can you attach the code you use to write the batch into the store? >> >> Also can you say which version of Python and Arrow you are using? On my >> installation, I get >> >> ``` >> >> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) >> >> ------------------------------------------------------------ >> --------------- >> >> ValueError Traceback (most recent call last) >> >> <ipython-input-5-fbec5bb33c33> in <module>() >> >> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) >> >> >> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__() >> >> >> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1 >> ``` >> >> (the canonical way to do this would be plasma.ObjectID(b >> "keynumber1keynumber1")) >> >> Best, >> Philipp. >> >> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso < >> alberto_boc...@hotmail.it> wrote: >> >> > Good morning, >> > >> > I am experiencing problems with the RecordBatches stored in plasma in a >> > particular situation. >> > >> > If I return a RecordBatch as result of a python function, I am able to >> > read just the metadata, while I get an error when reading the columns. >> > >> > For example, the following code >> > def retrieve1(): >> > client = plasma.connect('test', "", 0) >> > >> > key = "keynumber1keynumber1" >> > pid = plasma.ObjectID(bytearray(key,'UTF-8')) >> > >> > [buff] = client .get_buffers([pid]) >> > batch = pa.RecordBatchStreamReader(buff).read_next_batch() >> > return batch >> > >> > batch = retrieve1() >> > print(batch) >> > print(batch.schema) >> > print(batch[0]) >> > >> > Represents a simple python code in which a function is in charge of >> > retrieving the RecordBatch from the plasma store, and then returns it to >> > the caller. Running the previous example I get: >> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48> >> > FIELD1: int32 >> > metadata >> > -------- >> > {} >> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98> >> > [ >> > 1, >> > 12, >> > 23, >> > 3, >> > 21, >> > 34 >> > ] >> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48> >> > FIELD1: int32 >> > metadata >> > -------- >> > {} >> > Errore di segmentazione (core dump creato) >> > >> > >> > If I retrieve and use the data in the same part of the code (as I do in >> > the function retrieve1(), but it also works when I put everything in the >> > main program.) everything runs without problems. >> > >> > Also the problem seems to be related to the particular case in which I >> > retrieve the RecordBatch from the plasma store, since the following >> > (simpler) code: >> > def create(): >> > test1 = [1, 12, 23, 3, 21, 34] >> > test1 = pa.array(test1, pa.int32()) >> > >> > batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) >> > print(batch) >> > print(batch.schema) >> > print(batch[0]) >> > return batch >> > >> > batch1 = create() >> > print(batch1) >> > print(batch1.schema) >> > print(batch1[0]) >> > >> > Prints: >> > >> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598> >> > FIELD1: int32 >> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8> >> > [ >> > 1, >> > 12, >> > 23, >> > 3, >> > 21, >> > 34 >> > ] >> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598> >> > FIELD1: int32 >> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318> >> > [ >> > 1, >> > 12, >> > 23, >> > 3, >> > 21, >> > 34 >> > ] >> > >> > Which is what I expect. >> > >> > Is this issue known or am I doing something wrong when retrieving the >> > RecordBatch from plasma? >> > >> > Also I would like to pinpoint the fact that this problem was as easy to >> > find as hard to re-create. For this reason, there can be other situations >> > in which the same problem arises that I did not experienced, since I >> mostly >> > deal with plasma and I’ve been using only python so long: the >> description I >> > gave is not intended to be complete. >> > >> > Thank you, >> > Alberto >> > >> >> >