Dataset column statistics

Flavio Pompermaier Wed, 28 Nov 2018 09:19:17 -0800

Hi to all,
I have a batch dataset  and I want to get some standard info about its
columns (like min, max, avg etc).
In order to achieve this I wrote a simple program that use SQL on table API
like the following:


SELECT
MAX(col1), MIN(col1), AVG(col1),
MAX(col2), MIN(col2), AVG(col2),
MAX(col3), MIN(col3), AVG(col3)
FROM MYTABLE

In my dataset I have about 50 fields and the query becomes quite big (and
the job plan too).
It seems that this kind of job cause the cluster to crash (too much garbage
collection).
Is there any smarter way to achieve this goal (apart from running a job per
column)?
Is this "normal" or is this a bug of Flink?

Best,
Flavio

Dataset column statistics

Reply via email to