Hi Wojtek,

Thanks for explaining the detail. I understand you have a larger amount of
data than `hive.fetch.task.conversion.threshold`.

Taking a glance, SimpleFetchOptimizer is likely to respect LIMIT if
`hive.fetch.task.caching` is disabled and all predicates are for partition
pruning. The case you mentioned [1] would work as far as I tried.

Could you check if your query works as expected with
`hive.fetch.task.caching=false`? Looks like, HIVE-25976 [2] introduced the
option and I'm not sure why Hive ignores LIMIT with it enabled.

Regards,
Okumin

- [1] select * from tbl limit 100
- [2] https://issues.apache.org/jira/browse/HIVE-25976

On Thu, Aug 3, 2023 at 9:14 PM Wojtek Meler <wme...@wp.pl> wrote:

> Disabling hive.limit.optimize.enable make situation even worse - TEZ job
> scans all files in partition which is unnecessary.
>
> I've run debugger and discovered that partition size in my table obtained
> from metastore  exceed  hive.fetch.task.conversion.threshold.
> It seems that since hive-1.0.1 optimization process changed a lot -
> earlier version used estimator to get data size for the job, now hive use
> stats from metastore if they are available.
> SimpleFetchOpitimizer is unaware of GlobalLimitOptimizer and input data
> size reduction through partition prunning.
>
> Solution for me is to tune the threshold and maybe fill RFE to change
> SimpleFetchOptimizer behaviour in case of GlobalLimitOptimizer qualified
> the optimize that reduces input size for limit.
>
> Regards
> Wojtek
>
> Dnia 06 lipca 2023 18:07 Okumin <m...@okumin.com> napisał(a):
>
> Hi Wojtek,
> I tried to submit the query with the given configurations on
> Hive 4.0.0-alpha-2 on Tez on YARN. In my environment, the query is
> converted to a single fetch task.
> Could you please give us the precise revision of Hive, your table
> definition, the amount of data, and so on? Also, I'm curious what if you
> disable `hive.limit.optimize.enable`.
> Regards,
> Okumin
> On Wed, Jul 5, 2023 at 9:40 PM Wojtek Meler <wme...@wp.pl> wrote:
>
> Hi, after switching to Hive 4.0 and Tez on yarn I've noticed that simple
> fetch queries run much longer.
> I have following configuration:
> hive.fetch.task.conversion=more
> hive.fetch.task.conversion.threshold=1073741824
> hive.limit.optimize.enable=true
> hive.limit.optimize.fetch.max=50000
> hive.limit.optimize.limit.file=10
> hive.limit.pushdown.memory.usage=0.1
> hive.limit.row.max.size=100000
> and query
> select * from tbl limit 100
> runs on Tez containers instead of being run internaly inside hiveserver2.
> How to configure fetch task conversion properly as even clicking on Hue
> has bad experience with data previews run on Tez on yarn...
> Regards,
> Wojtek
>
>
>
>

Reply via email to