Hi, My bad, you are right I intended to give the query as below (TXN left joining CURRENCY not the other way around):
*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC from TXN LEFT JOIN CURRENCY on (CURRENCY.CCY_CD = TXN.CCY_CD) where TXN_DT between EFF_ST_DT and EFF_END_DT; Thanks and regards, Rams On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <furcy....@flaminem.com> wrote: > Ho, in that case... > > (First I notice that you say you want all records in TXN but in the query > you give, you perform your join the other way round.) > > This is a typical use case that SQL is not very good at handling... > > The solutions I see are: > > - use RANK as you suggested. > Note that Hive is smart enough to optimize it correctly : > if you use a filter WHERE Rank() < K, it will take the K first values in > RAM and ignore the rest, rather than ranking everything and filtering > afterwards. > > - perform a GROUP BY TXN.ROW_NUM so you have only one line per > transaction. Aggregate the rest the way you like (max, max_by or arbitrary) > To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL) > to your where clause (and reverse the left join). > > - Most complicated way to implement, but simplest to use afterwards: > replace your CURRENCY table with a table where you have only one row per > currency, and store the multiple start/end dates in an array of structs, > then a UDF that given such array and a date returns the correct > description. It only works if you don't have too many description per > currency though. > This is where Spark comes handy: it is simpler to define UDFs with it. > > > > On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan < > ramasubramanian.naraya...@gmail.com> wrote: > >> Hi, >> >> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. >> It needs to be equated. >> >> >> regards, >> Rams >> >> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy....@flaminem.com> >> wrote: >> >>> I would suggest to use a subquery >>> >>> WITH unique_currency AS ( >>> SELECT >>> CCY_CD, >>> MAX(CNTRY_DESC) as CNTRY_DESC >>> FROM CURRENCY >>> GROUP BY CCY_CD >>> ) >>> >>> and then perform your left join on it. >>> >>> Some SQL engine (e.g. Presto) have aggregation functions like >>> arbitrary(col) that take any value and are a little less costly than a >>> max. >>> Sometimes, they also have functions like max_by(x, y) >>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that >>> would allow you to get the most recent description. >>> >>> It is a shame that this function is not included in Hive yet, but still >>> you can find some UDAF implementations on github >>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group> >>> . >>> >>> >>> >>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan < >>> ramasubramanian.naraya...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Need your suggestion on the below. >>>> >>>> Have two tables TXN and CURRENCY. >>>> >>>> Need all records in TXN and hence doing Left Join with CURRENCY. >>>> >>>> *Two problems :* >>>> 1. CURRENCY table may contain duplicate records hence it needs to be >>>> handled through RANK or some other function. >>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where >>>> clause' then we will loose the EUR records which should not happen. >>>> >>>> Please suggest a solution to over come both the problems. For >>>> duplicated records it is fine if we select any of the CNTRY_DESC. >>>> >>>> *Table : CURRENCY* >>>> >>>> *Table : TXN* >>>> >>>> *CCY_CD* >>>> >>>> *CNTRY_DESC* >>>> >>>> *EFF_ST_DT* >>>> >>>> *EFF_END_DT* >>>> >>>> *ROW_NUM* >>>> >>>> *CCY_CD* >>>> >>>> *TXN_DT* >>>> >>>> INR >>>> >>>> Indian Rupee >>>> >>>> 1-Jan-15 >>>> >>>> 20-Feb-16 >>>> >>>> 1 >>>> >>>> INR >>>> >>>> 16-Feb-17 >>>> >>>> INR >>>> >>>> Indian Rupee New >>>> >>>> 21-Feb-16 >>>> >>>> 20-Feb-99 >>>> >>>> 2 >>>> >>>> USD >>>> >>>> 16-Feb-17 >>>> >>>> USD >>>> >>>> US Dollar >>>> >>>> 1-Jan-15 >>>> >>>> 20-Feb-16 >>>> >>>> 3 >>>> >>>> SGD >>>> >>>> 16-Feb-17 >>>> >>>> SGD >>>> >>>> Singapore Dollar >>>> >>>> 1-Jan-15 >>>> >>>> 20-Feb-17 >>>> >>>> 4 >>>> >>>> EUR >>>> >>>> 16-Feb-17 >>>> >>>> SGD >>>> >>>> Singapore Dollar New >>>> >>>> 15-Feb-17 >>>> >>>> 20-Feb-99 >>>> >>>> SGD >>>> >>>> Singapore Dollar Latest >>>> >>>> 16-Feb-17 >>>> >>>> 16-Feb-17 >>>> >>>> *Expected Output* >>>> >>>> *ROW_NUM* >>>> >>>> *CCY_CD* >>>> >>>> *TXN_DT* >>>> >>>> *CNTRY_DESC* >>>> >>>> 1 >>>> >>>> INR >>>> >>>> 16-Feb-17 >>>> >>>> Indian Rupee >>>> >>>> 2 >>>> >>>> USD >>>> >>>> 16-Feb-17 >>>> >>>> US Dollar >>>> >>>> 3 >>>> >>>> SGD >>>> >>>> 16-Feb-17 >>>> >>>> Singapore Dollar Latest (Any of three valid valid is fine) >>>> >>>> 4 >>>> >>>> EUR >>>> >>>> 16-Feb-17 >>>> >>>> <Null> >>>> >>>> >>>> >>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC >>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD) >>>> where >>>> TXN_DT between EFF_ST_DT and EFF_END_DT; >>>> >>>> >>>> >>>> This query will drop the "EUR" record because of the where clause used. >>>> It cannot be handled with case statement instead of 'where clause' as >>>> we have more than one record for 'SGD' when TXN_DT is 16-FEB. >>>> >>>> regards, >>>> Rams >>>> >>> >>> >> >