Ho, in that case... (First I notice that you say you want all records in TXN but in the query you give, you perform your join the other way round.)
This is a typical use case that SQL is not very good at handling... The solutions I see are: - use RANK as you suggested. Note that Hive is smart enough to optimize it correctly : if you use a filter WHERE Rank() < K, it will take the K first values in RAM and ignore the rest, rather than ranking everything and filtering afterwards. - perform a GROUP BY TXN.ROW_NUM so you have only one line per transaction. Aggregate the rest the way you like (max, max_by or arbitrary) To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL) to your where clause (and reverse the left join). - Most complicated way to implement, but simplest to use afterwards: replace your CURRENCY table with a table where you have only one row per currency, and store the multiple start/end dates in an array of structs, then a UDF that given such array and a date returns the correct description. It only works if you don't have too many description per currency though. This is where Spark comes handy: it is simpler to define UDFs with it. On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan < ramasubramanian.naraya...@gmail.com> wrote: > Hi, > > TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT. > It needs to be equated. > > > regards, > Rams > > On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy....@flaminem.com> wrote: > >> I would suggest to use a subquery >> >> WITH unique_currency AS ( >> SELECT >> CCY_CD, >> MAX(CNTRY_DESC) as CNTRY_DESC >> FROM CURRENCY >> GROUP BY CCY_CD >> ) >> >> and then perform your left join on it. >> >> Some SQL engine (e.g. Presto) have aggregation functions like >> arbitrary(col) that take any value and are a little less costly than a >> max. >> Sometimes, they also have functions like max_by(x, y) >> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that >> would allow you to get the most recent description. >> >> It is a shame that this function is not included in Hive yet, but still >> you can find some UDAF implementations on github >> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group> >> . >> >> >> >> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan < >> ramasubramanian.naraya...@gmail.com> wrote: >> >>> Hi, >>> >>> Need your suggestion on the below. >>> >>> Have two tables TXN and CURRENCY. >>> >>> Need all records in TXN and hence doing Left Join with CURRENCY. >>> >>> *Two problems :* >>> 1. CURRENCY table may contain duplicate records hence it needs to be >>> handled through RANK or some other function. >>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where >>> clause' then we will loose the EUR records which should not happen. >>> >>> Please suggest a solution to over come both the problems. For duplicated >>> records it is fine if we select any of the CNTRY_DESC. >>> >>> *Table : CURRENCY* >>> >>> *Table : TXN* >>> >>> *CCY_CD* >>> >>> *CNTRY_DESC* >>> >>> *EFF_ST_DT* >>> >>> *EFF_END_DT* >>> >>> *ROW_NUM* >>> >>> *CCY_CD* >>> >>> *TXN_DT* >>> >>> INR >>> >>> Indian Rupee >>> >>> 1-Jan-15 >>> >>> 20-Feb-16 >>> >>> 1 >>> >>> INR >>> >>> 16-Feb-17 >>> >>> INR >>> >>> Indian Rupee New >>> >>> 21-Feb-16 >>> >>> 20-Feb-99 >>> >>> 2 >>> >>> USD >>> >>> 16-Feb-17 >>> >>> USD >>> >>> US Dollar >>> >>> 1-Jan-15 >>> >>> 20-Feb-16 >>> >>> 3 >>> >>> SGD >>> >>> 16-Feb-17 >>> >>> SGD >>> >>> Singapore Dollar >>> >>> 1-Jan-15 >>> >>> 20-Feb-17 >>> >>> 4 >>> >>> EUR >>> >>> 16-Feb-17 >>> >>> SGD >>> >>> Singapore Dollar New >>> >>> 15-Feb-17 >>> >>> 20-Feb-99 >>> >>> SGD >>> >>> Singapore Dollar Latest >>> >>> 16-Feb-17 >>> >>> 16-Feb-17 >>> >>> *Expected Output* >>> >>> *ROW_NUM* >>> >>> *CCY_CD* >>> >>> *TXN_DT* >>> >>> *CNTRY_DESC* >>> >>> 1 >>> >>> INR >>> >>> 16-Feb-17 >>> >>> Indian Rupee >>> >>> 2 >>> >>> USD >>> >>> 16-Feb-17 >>> >>> US Dollar >>> >>> 3 >>> >>> SGD >>> >>> 16-Feb-17 >>> >>> Singapore Dollar Latest (Any of three valid valid is fine) >>> >>> 4 >>> >>> EUR >>> >>> 16-Feb-17 >>> >>> <Null> >>> >>> >>> >>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC >>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD) >>> where >>> TXN_DT between EFF_ST_DT and EFF_END_DT; >>> >>> >>> >>> This query will drop the "EUR" record because of the where clause used. >>> It cannot be handled with case statement instead of 'where clause' as we >>> have more than one record for 'SGD' when TXN_DT is 16-FEB. >>> >>> regards, >>> Rams >>> >> >> >