Re: Dataset column statistics

Flavio Pompermaier Thu, 29 Nov 2018 03:09:02 -0800

What do you advice to compute column stats?
Should I run multiple job (one per column) or try to compute all at once?


Are you ever going to consider supporting ANALYZE TABLE (like in Hive or
Spark) in Flink Table API?

Best,
Flavio

On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <fhue...@gmail.com> wrote:

> Hi,
>
> You could try to enable object reuse.
> Alternatively you can give more heap memory or fine tune the GC parameters.
>
> I would not consider it a bug in Flink, but might be something that could
> be improved.
>
> Fabian
>
>
> Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <
> pomperma...@okkam.it>:
>
>> Hi to all,
>> I have a batch dataset  and I want to get some standard info about its
>> columns (like min, max, avg etc).
>> In order to achieve this I wrote a simple program that use SQL on table
>> API like the following:
>>
>> SELECT
>> MAX(col1), MIN(col1), AVG(col1),
>> MAX(col2), MIN(col2), AVG(col2),
>> MAX(col3), MIN(col3), AVG(col3)
>> FROM MYTABLE
>>
>> In my dataset I have about 50 fields and the query becomes quite big (and
>> the job plan too).
>> It seems that this kind of job cause the cluster to crash (too much
>> garbage collection).
>> Is there any smarter way to achieve this goal (apart from running a job
>> per column)?
>> Is this "normal" or is this a bug of Flink?
>>
>> Best,
>> Flavio
>>
>

Re: Dataset column statistics

Reply via email to