Seems like we might want to write down some best practices for this
level of large scale usage, essentially a supercomputer-like rig. I
wouldn't even know where to come by a machine with a machine with >
2TB memory for scalability / concurrency load testing

On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara
<robertnishih...@gmail.com> wrote:
> Are you using the same plasma client from all of the different threads? If
> so, that could cause race conditions as the client is not thread safe.
>
> Alternatively, if you have a separate plasma client for each thread, then
> you may be running out of file descriptors somewhere (either the client
> process or the store).
>
> Can you check if the object store evicting objects (it prints something to
> stdout/stderr when this happens)? Could you be running out of memory but
> failing to release the objects?
>
> On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cjno...@gmail.com> wrote:
>
>> Update:
>>
>> I'm investigating the possibility that I've reached the overcommit limit in
>> the kernel as a result of all the parallel processes.
>>
>> This still doesn't fix the client.release() problem but it might explain
>> why the processing appears to halt, after some time, until I restart the
>> Jupyter kernel.
>>
>> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cjno...@gmail.com> wrote:
>>
>> > Wes,
>> >
>> > Unfortunately, my code is on a separate network. I'll try to explain what
>> > I'm doing and if you need further detail, I can certainly pseudocode
>> > specifics.
>> >
>> > I am using multiprocessing.Pool() to fire up a bunch of threads for
>> > different filenames. In each thread, I'm performing a pd.read_csv(),
>> > sorting by the timestamp field (rounded to the day) and chunking the
>> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for
>> each
>> > of the chunked Dataframes, convert them to RecordBuffer objects, stream
>> the
>> > bytes to Plasma and seal the objects. Only the objectIDs are returned to
>> > the orchestration thread.
>> >
>> > In follow-on processing, I'm combining the ObjectIDs for each of the
>> > unique day timestamps into lists and I'm passing those into a function in
>> > parallel using multiprocessing.Pool(). In this function, I'm iterating
>> > through the lists of objectIds, loading them back into Dataframes,
>> > appending them together until their size
>> > is > some predefined threshold, and performing a df.to_parquet().
>> >
>> > The steps in the 2 paragraphs above are performing in a loop, batching up
>> > 500-1k files at a time for each iteration.
>> >
>> > When I run this iteration a few times, it eventually locks up the Plasma
>> > client. With regards to the release() fault, it doesn't seem to matter
>> when
>> > or where I run it (in the orchestration thread or in other threads), it
>> > always seems to crash the Jupyter kernel. I'm thinking I might be using
>> it
>> > wrong, I'm just trying to figure out where and what I'm doing.
>> >
>> > Thanks again!
>> >
>> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >
>> >> hi Corey,
>> >>
>> >> Can you provide the code (or a simplified version thereof) that shows
>> >> how you're using Plasma?
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjno...@gmail.com>
>> wrote:
>> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's
>> >> Plasma
>> >> > client to convert a series of CSV files (via Pandas) into a Parquet
>> >> store.
>> >> >
>> >> > I've got a little over 20k CSV files to process which are about 1-2gb
>> >> each.
>> >> > I'm loading 500 to 1000 files at a time.
>> >> >
>> >> > In each iteration, I'm loading a series of files, partitioning them
>> by a
>> >> > time field into separate dataframes, then writing parquet files in
>> >> > directories for each day.
>> >> >
>> >> > The problem I'm having is that the Plasma client & server appear to
>> >> lock up
>> >> > after about 2-3 iterations. It locks up to the point where I can't
>> even
>> >> > CTRL+C the server. I am able to stop the notebook and re-trying the
>> code
>> >> > just continues to lock up when interacting with Jupyter. There are no
>> >> > errors in my logs to tell me something's wrong.
>> >> >
>> >> > Just to make sure I'm not just being impatient and possibly need to
>> wait
>> >> > for some background services to finish, I allowed the code to run
>> >> overnight
>> >> > and it was still in the same state when I came in to work this
>> morning.
>> >> I'm
>> >> > running the Plasma server with 4TB max.
>> >> >
>> >> > In an attempt to pro-actively free up some of the object ids that I no
>> >> > longer need, I also attempted to use the client.release() function
>> but I
>> >> > cannot seem to figure out how to make this work properly. It crashes
>> my
>> >> > Jupyter kernel each time I try.
>> >> >
>> >> > I'm using Pyarrow 0.9.0
>> >> >
>> >> > Thanks in advance.
>> >>
>> >
>>

Reply via email to