Re: Dataset column statistics

Fabian Hueske Thu, 29 Nov 2018 05:23:52 -0800

I'd try to tune it in a single query.
If that does not work, go for as few queries as possible, splitting by
column for better projection push-down.


This is the first time I hear somebody requesting ANALYZE TABLE.
I don't see a reason why it shouldn't be added in the future.



Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <
pomperma...@okkam.it>:

> What do you advice to compute column stats?
> Should I run multiple job (one per column) or try to compute all at once?
>
> Are you ever going to consider supporting ANALYZE TABLE (like in Hive or
> Spark) in Flink Table API?
>
> Best,
> Flavio
>
> On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi,
>>
>> You could try to enable object reuse.
>> Alternatively you can give more heap memory or fine tune the GC
>> parameters.
>>
>> I would not consider it a bug in Flink, but might be something that could
>> be improved.
>>
>> Fabian
>>
>>
>> Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <
>> pomperma...@okkam.it>:
>>
>>> Hi to all,
>>> I have a batch dataset  and I want to get some standard info about its
>>> columns (like min, max, avg etc).
>>> In order to achieve this I wrote a simple program that use SQL on table
>>> API like the following:
>>>
>>> SELECT
>>> MAX(col1), MIN(col1), AVG(col1),
>>> MAX(col2), MIN(col2), AVG(col2),
>>> MAX(col3), MIN(col3), AVG(col3)
>>> FROM MYTABLE
>>>
>>> In my dataset I have about 50 fields and the query becomes quite big
>>> (and the job plan too).
>>> It seems that this kind of job cause the cluster to crash (too much
>>> garbage collection).
>>> Is there any smarter way to achieve this goal (apart from running a job
>>> per column)?
>>> Is this "normal" or is this a bug of Flink?
>>>
>>> Best,
>>> Flavio
>>>
>>
>

Re: Dataset column statistics

Reply via email to