Re: Numpy and Terabyte data

2018-01-03 Thread Albert-Jan Roskam
On Jan 2, 2018 18:27, Rustom Mody wrote: > > Someone who works in hadoop asked me: > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > analysis on it? > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > to not work if the data does not fit in memo

Re: Numpy and Terabyte data

2018-01-02 Thread Rustom Mody
On Wednesday, January 3, 2018 at 1:43:40 AM UTC+5:30, Paul Moore wrote: > On 2 January 2018 at 17:24, Rustom Mody wrote: > > Someone who works in hadoop asked me: > > > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > > analysis on it? > > > > I said: No (I dont think so

Re: Numpy and Terabyte data

2018-01-02 Thread Paul Moore
On 2 January 2018 at 17:24, Rustom Mody wrote: > Someone who works in hadoop asked me: > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > analysis on it? > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > to not work if the data does not fit in m

Re: Numpy and Terabyte data

2018-01-02 Thread Irving Duran
I've never heard or done that type of testing for a large dataset solely on python, so I don't know what's the cap from the memory standpoint that python can handle base on memory availability. Now, if I understand what you are trying to do, you can achieve that by leveraging Apache Spark and invo

Re: Numpy and Terabyte data

2018-01-02 Thread jason
I'm not sure if I'll be laughed at, but a statistical sampling of a randomized sample should resemble the whole. If you need min/max then min ( min(each node) ) If you need average then you need sum( sum(each node)) sum(count(each node))* *You'll likely need to use log here, as you'll probably o

Numpy and Terabyte data

2018-01-02 Thread Rustom Mody
Someone who works in hadoop asked me: If our data is in terabytes can we do statistical (ie numpy pandas etc) analysis on it? I said: No (I dont think so at least!) ie I expect numpy (pandas etc) to not work if the data does not fit in memory Well sure *python* can handle (streams of) terabyte d