Re: LEFT JOIN and WHERE CLAUSE - How to handle

Ramasubramanian Narayanan Thu, 24 Aug 2017 03:00:09 -0700

Hi,

My bad, you are right I intended to give the query  as below (TXN left
joining CURRENCY not the other way around):


*Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
from TXN LEFT JOIN CURRENCY  on (CURRENCY.CCY_CD = TXN.CCY_CD)
where
TXN_DT between EFF_ST_DT and EFF_END_DT;

Thanks and regards,
Rams


On Wed, Aug 23, 2017 at 9:17 PM, Furcy Pin <furcy....@flaminem.com> wrote:

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> ramasubramanian.naraya...@gmail.com> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <furcy....@flaminem.com>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>>     CCY_CD,
>>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> ramasubramanian.naraya...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Reply via email to