Hi Zsolt,

It's an interesting suggestion. I often use hooks.

It seems true that some hooks or parts of some hooks can be
asynchronous. However, as Stamatis says, some hooks are not safe to be
asynchronous. For example, DisallowTransformHook[1] is expected to be
completed before a DAG starts and is probably not interruptible.
OperatorStatsReaderHook[2] is not thread-safe as it touches
Configuration and mutates states. In my opinion, the asynchronous hook
or part should be explicitly marked by a developer of hooks.

So, I guess we need to invent a good API that allows Hive to identify
which part is safe to run asynchronously. Is there any good idea?

- [1] 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/security/authorization/plugin/DisallowTransformHook.java
- [2] 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/OperatorStatsReaderHook.java

Regards,
Okumin

On Wed, Sep 25, 2024 at 6:57 PM Stamatis Zampetakis <zabe...@gmail.com> wrote:
>
> Hey Zsolt,
>
> The same argument could be generalized to many phases of the query
> execution. If something goes bad when compiling or running the query
> there is a risk that the whole HS2 could go down.
>
> In many cases, adding asynchronous executions seems to alleviate a
> problem but at the same time it makes the code more complex and error
> prone. Moreover, whenever we add more thread pools we have to be
> mindful that this can affect the performance of the whole process.  If
> a hook goes wild and takes a lot of time, then even if it runs on a
> separate thread it will have an impact on the whole server and it may
> be even more difficult to diagnose what is going wrong. Furthermore,
> adding a timeout assumes that the hooks are interruptible which might
> not always be the case.
>
> Bottom line is that there are trade-offs to consider and with the
> information I have so far I am neither for nor against this proposal.
>
> Best,
> Stamatis
>
>
> On Wed, Sep 25, 2024 at 10:21 AM Zsolt Miskolczi
> <zsolt.miskol...@gmail.com> wrote:
> >
> > Hi folks!
> >
> > At this point, Hive hooks are running synchronously:
> >
> > for (ExecuteWithHookContext hook : hooks) {
> >   perfLogger.perfLogBegin(CLASS_NAME, prefix + hook.getClass().getName());
> >   hook.run(hookContext);
> >   perfLogger.perfLogEnd(CLASS_NAME, prefix + hook.getClass().getName());
> > }
> >
> > My current problem is that if any problem happens with the hook, it just 
> > slows Hive down.
> > In the current situation, we got a hook that has a retry logic in it and it 
> > consumed a lot of time from the Hive side.
> >
> > I'm thinking about two possible solutions:
> > - running the hooks asynchronously, so that Hive wouldn't even care about 
> > how long the hooks are running
> > - adding a timeout (like 2 seconds) to run the hook.
> >
> > What are your thoughts?
> >
> > Thank you,
> > Zsolt

Reply via email to