Re: Maintaining big and complex Hive queries

Edward Capriolo Wed, 21 Dec 2016 12:07:25 -0800

I have been contemplating attaching meta data for the query lineage to each
table such that I can know where the data came from and have a 1 click
regenerate button.


On Wed, Dec 21, 2016 at 3:02 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> my 2 cents. :)
>
> as soon as you say "complex query" i would submit you've lost the
> upperhand and you're behind the eight-ball right off the bat.  And you know
> this too otherwise you wouldn't have posted here. ha!
>
> i use cascading CTAS statements so that i can examine the intermediate
> tables.  Another approach is to use CTE's but while that makes things
> easier to read it's still one big query and you don't get insight to the
> "work" tables.
>
> yes, it could take longer execution time if those intermediate tables
> can't be run in parallel but small price to pay compared to human debug
> time in my book anyway.
>
> thoughts?
>
> Cheers,
> Stephen.
>
>
>
>
>
> On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hi Elliot,
>>
>> Thanks for letting me know. HPL-SQL sounded particularly interesting. But
>> in the documentation I could not see any way to pass output generated by
>> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL
>> platform for multiple big-data systems (http://www.hplsql.org/about).
>>
>> However in order to break single complex hive query, DDLs look to be only
>> way in HPL-SQL too. Or is there any alternate way that I might have missed?
>>
>> -- Saumitra S. Shahapure
>>
>> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West <tea...@gmail.com> wrote:
>>
>>> I notice that HPL/SQL is not mentioned on the page I referenced, however
>>> I expect that is another approach that you could use to modularise:
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=59690156
>>> http://www.hplsql.org/doc
>>>
>>> On 15 December 2016 at 17:17, Elliot West <tea...@gmail.com> wrote:
>>>
>>>> Some options are covered here, although there is no definitive guidance
>>>> as far as I know:
>>>>
>>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin
>>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation
>>>>
>>>> On 15 December 2016 at 17:08, Saumitra Shahapure <
>>>> saumitra.offic...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are running and maintaining quite big and complex Hive SELECT query
>>>>> right now. It's basically a single SELECT query which performs JOIN of
>>>>> about ten other SELECT query outputs.
>>>>>
>>>>> A simplest way to refactor that we can think of is to break this query
>>>>> down into multiple views and then join the views. There is similar
>>>>> possibility to create intermediate tables.
>>>>>
>>>>> However creating multiple DDLs in order to maintain a single DML is
>>>>> not very smooth. We would end up polluting metadata database by creating
>>>>> views / intermediate tables which are used in just this ETL.
>>>>>
>>>>> What are the other efficient ways to maintain complex SQL queries
>>>>> written in Hive? Are there better ways to break Hive query into multiple
>>>>> modules?
>>>>>
>>>>> -- Saumitra S. Shahapure
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Maintaining big and complex Hive queries

Reply via email to