I have been contemplating attaching meta data for the query lineage to each table such that I can know where the data came from and have a 1 click regenerate button.
On Wed, Dec 21, 2016 at 3:02 PM, Stephen Sprague <sprag...@gmail.com> wrote: > my 2 cents. :) > > as soon as you say "complex query" i would submit you've lost the > upperhand and you're behind the eight-ball right off the bat. And you know > this too otherwise you wouldn't have posted here. ha! > > i use cascading CTAS statements so that i can examine the intermediate > tables. Another approach is to use CTE's but while that makes things > easier to read it's still one big query and you don't get insight to the > "work" tables. > > yes, it could take longer execution time if those intermediate tables > can't be run in parallel but small price to pay compared to human debug > time in my book anyway. > > thoughts? > > Cheers, > Stephen. > > > > > > On Wed, Dec 21, 2016 at 10:07 AM, Saumitra Shahapure < > saumitra.offic...@gmail.com> wrote: > >> Hi Elliot, >> >> Thanks for letting me know. HPL-SQL sounded particularly interesting. But >> in the documentation I could not see any way to pass output generated by >> one Hive query to the next one. The tool looks good as a homogeneous PL-SQL >> platform for multiple big-data systems (http://www.hplsql.org/about). >> >> However in order to break single complex hive query, DDLs look to be only >> way in HPL-SQL too. Or is there any alternate way that I might have missed? >> >> -- Saumitra S. Shahapure >> >> On Thu, Dec 15, 2016 at 6:21 PM, Elliot West <tea...@gmail.com> wrote: >> >>> I notice that HPL/SQL is not mentioned on the page I referenced, however >>> I expect that is another approach that you could use to modularise: >>> >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa >>> geId=59690156 >>> http://www.hplsql.org/doc >>> >>> On 15 December 2016 at 17:17, Elliot West <tea...@gmail.com> wrote: >>> >>>> Some options are covered here, although there is no definitive guidance >>>> as far as I know: >>>> >>>> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin >>>> g+Hive+SQL#UnitTestingHiveSQL-Modularisation >>>> >>>> On 15 December 2016 at 17:08, Saumitra Shahapure < >>>> saumitra.offic...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are running and maintaining quite big and complex Hive SELECT query >>>>> right now. It's basically a single SELECT query which performs JOIN of >>>>> about ten other SELECT query outputs. >>>>> >>>>> A simplest way to refactor that we can think of is to break this query >>>>> down into multiple views and then join the views. There is similar >>>>> possibility to create intermediate tables. >>>>> >>>>> However creating multiple DDLs in order to maintain a single DML is >>>>> not very smooth. We would end up polluting metadata database by creating >>>>> views / intermediate tables which are used in just this ETL. >>>>> >>>>> What are the other efficient ways to maintain complex SQL queries >>>>> written in Hive? Are there better ways to break Hive query into multiple >>>>> modules? >>>>> >>>>> -- Saumitra S. Shahapure >>>>> >>>> >>>> >>> >> >