On Jan 2, 2018 18:27, Rustom Mody wrote:
>
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memo
On Wednesday, January 3, 2018 at 1:43:40 AM UTC+5:30, Paul Moore wrote:
> On 2 January 2018 at 17:24, Rustom Mody wrote:
> > Someone who works in hadoop asked me:
> >
> > If our data is in terabytes can we do statistical (ie numpy pandas etc)
> > analysis on it?
> >
> > I said: No (I dont think so
On 2 January 2018 at 17:24, Rustom Mody wrote:
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in m
I've never heard or done that type of testing for a large dataset solely on
python, so I don't know what's the cap from the memory standpoint that
python can handle base on memory availability. Now, if I understand what
you are trying to do, you can achieve that by leveraging Apache Spark and
invo
I'm not sure if I'll be laughed at, but a statistical sampling of a randomized
sample should resemble the whole.
If you need min/max then min ( min(each node) )
If you need average then you need sum( sum(each node)) sum(count(each node))*
*You'll likely need to use log here, as you'll probably o
Someone who works in hadoop asked me:
If our data is in terabytes can we do statistical (ie numpy pandas etc)
analysis on it?
I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
to not work if the data does not fit in memory
Well sure *python* can handle (streams of) terabyte d