On Jan 2, 2018 18:27, Rustom Mody <rustompm...@gmail.com> wrote: > > Someone who works in hadoop asked me: > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > analysis on it? > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > to not work if the data does not fit in memory > > Well sure *python* can handle (streams of) terabyte data I guess > *numpy* cannot > > Is there a more sophisticated answer? > > ["Terabyte" is a just a figure of speech for "too large for main memory"]
Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. Very, very cool stuff. Dask DataFrames have been mentioned already. numpy has memmapped arrays: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html -- https://mail.python.org/mailman/listinfo/python-list