Seems like we might want to write down some best practices for this level of large scale usage, essentially a supercomputer-like rig. I wouldn't even know where to come by a machine with a machine with > 2TB memory for scalability / concurrency load testing
On Mon, Jul 16, 2018 at 2:59 PM, Robert Nishihara <robertnishih...@gmail.com> wrote: > Are you using the same plasma client from all of the different threads? If > so, that could cause race conditions as the client is not thread safe. > > Alternatively, if you have a separate plasma client for each thread, then > you may be running out of file descriptors somewhere (either the client > process or the store). > > Can you check if the object store evicting objects (it prints something to > stdout/stderr when this happens)? Could you be running out of memory but > failing to release the objects? > > On Tue, Jul 10, 2018 at 9:48 AM Corey Nolet <cjno...@gmail.com> wrote: > >> Update: >> >> I'm investigating the possibility that I've reached the overcommit limit in >> the kernel as a result of all the parallel processes. >> >> This still doesn't fix the client.release() problem but it might explain >> why the processing appears to halt, after some time, until I restart the >> Jupyter kernel. >> >> On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cjno...@gmail.com> wrote: >> >> > Wes, >> > >> > Unfortunately, my code is on a separate network. I'll try to explain what >> > I'm doing and if you need further detail, I can certainly pseudocode >> > specifics. >> > >> > I am using multiprocessing.Pool() to fire up a bunch of threads for >> > different filenames. In each thread, I'm performing a pd.read_csv(), >> > sorting by the timestamp field (rounded to the day) and chunking the >> > Dataframe into separate Dataframes. I create a new Plasma ObjectID for >> each >> > of the chunked Dataframes, convert them to RecordBuffer objects, stream >> the >> > bytes to Plasma and seal the objects. Only the objectIDs are returned to >> > the orchestration thread. >> > >> > In follow-on processing, I'm combining the ObjectIDs for each of the >> > unique day timestamps into lists and I'm passing those into a function in >> > parallel using multiprocessing.Pool(). In this function, I'm iterating >> > through the lists of objectIds, loading them back into Dataframes, >> > appending them together until their size >> > is > some predefined threshold, and performing a df.to_parquet(). >> > >> > The steps in the 2 paragraphs above are performing in a loop, batching up >> > 500-1k files at a time for each iteration. >> > >> > When I run this iteration a few times, it eventually locks up the Plasma >> > client. With regards to the release() fault, it doesn't seem to matter >> when >> > or where I run it (in the orchestration thread or in other threads), it >> > always seems to crash the Jupyter kernel. I'm thinking I might be using >> it >> > wrong, I'm just trying to figure out where and what I'm doing. >> > >> > Thanks again! >> > >> > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> >> hi Corey, >> >> >> >> Can you provide the code (or a simplified version thereof) that shows >> >> how you're using Plasma? >> >> >> >> - Wes >> >> >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjno...@gmail.com> >> wrote: >> >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's >> >> Plasma >> >> > client to convert a series of CSV files (via Pandas) into a Parquet >> >> store. >> >> > >> >> > I've got a little over 20k CSV files to process which are about 1-2gb >> >> each. >> >> > I'm loading 500 to 1000 files at a time. >> >> > >> >> > In each iteration, I'm loading a series of files, partitioning them >> by a >> >> > time field into separate dataframes, then writing parquet files in >> >> > directories for each day. >> >> > >> >> > The problem I'm having is that the Plasma client & server appear to >> >> lock up >> >> > after about 2-3 iterations. It locks up to the point where I can't >> even >> >> > CTRL+C the server. I am able to stop the notebook and re-trying the >> code >> >> > just continues to lock up when interacting with Jupyter. There are no >> >> > errors in my logs to tell me something's wrong. >> >> > >> >> > Just to make sure I'm not just being impatient and possibly need to >> wait >> >> > for some background services to finish, I allowed the code to run >> >> overnight >> >> > and it was still in the same state when I came in to work this >> morning. >> >> I'm >> >> > running the Plasma server with 4TB max. >> >> > >> >> > In an attempt to pro-actively free up some of the object ids that I no >> >> > longer need, I also attempted to use the client.release() function >> but I >> >> > cannot seem to figure out how to make this work properly. It crashes >> my >> >> > Jupyter kernel each time I try. >> >> > >> >> > I'm using Pyarrow 0.9.0 >> >> > >> >> > Thanks in advance. >> >> >> > >>