Re: Numpy and Terabyte data

Albert-Jan Roskam Wed, 03 Jan 2018 20:43:27 -0800

On Jan 2, 2018 18:27, Rustom Mody <rustompm...@gmail.com> wrote:
>
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memory
>
> Well sure *python* can handle (streams of) terabyte data I guess
> *numpy* cannot
>
> Is there a more sophisticated answer?
>
> ["Terabyte" is a just a figure of speech for "too large for main memory"]


Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. 
Very, very cool stuff.

Dask DataFrames have been mentioned already.

numpy has memmapped arrays: 
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Numpy and Terabyte data

Reply via email to