Re: Maintaining big and complex Hive queries

Stephen Sprague Wed, 21 Dec 2016 12:03:07 -0800

my 2 cents. :)

as soon as you say "complex query" i would submit you've lost the upperhand
and you're behind the eight-ball right off the bat.  And you know this too
otherwise you wouldn't have posted here. ha!


i use cascading CTAS statements so that i can examine the intermediate
tables.  Another approach is to use CTE's but while that makes things
easier to read it's still one big query and you don't get insight to the
"work" tables.

yes, it could take longer execution time if those intermediate tables can't
be run in parallel but small price to pay compared to human debug time in
my book anyway.

thoughts?

Cheers,
Stephen.





On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hi Elliot,
>
> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
> in the documentation I could not see any way to pass output generated by
> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
> platform for multiple big-data systems (http://www.hplsql.org/about).
>
> However in order to break single complex hive query, DDLs look to be only
> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>
> -- Saumitra S. Shahapure
>
> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West <tea...@gmail.com> wrote:
>
>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>> I expect that is another approach that you could use to modularise:
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
>> http://www.hplsql.org/doc
>>
>> On 15 December 2016 at 17:17, Elliot West <tea...@gmail.com> wrote:
>>
>>> Some options are covered here, although there is no definitive guidance
>>> as far as I know:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>
>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>> saumitra.offic...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are running and maintaining quite big and complex Hive SELECT query
>>>> right now. It's basically a single SELECT query which performs JOIN of
>>>> about ten other SELECT query outputs.
>>>>
>>>> A simplest way to refactor that we can think of is to break this query
>>>> down into multiple views and then join the views. There is similar
>>>> possibility to create intermediate tables.
>>>>
>>>> However creating multiple DDLs in order to maintain a single DML is not
>>>> very smooth. We would end up polluting metadata database by creating views
>>>> / intermediate tables which are used in just this ETL.
>>>>
>>>> What are the other efficient ways to maintain complex SQL queries
>>>> written in Hive? Are there better ways to break Hive query into multiple
>>>> modules?
>>>>
>>>> -- Saumitra S. Shahapure
>>>>
>>>
>>>
>>
>

Re: Maintaining big and complex Hive queries

Reply via email to