Re: LEFT JOIN and WHERE CLAUSE - How to handle

peter zhang Wed, 23 Aug 2017 17:30:29 -0700

How about splitting your txn data into two parts, one for the tx that has
currency info (just use join) and the other part for the tx that can't find
currency info then use a union all operator combines two parts tx, as below:


SELECT ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
  FROM CURRENCY JOIN TXN ON (CURRENCY.CCY_CD = TXN.CCY_CD)
 WHERE TXN_DT BETWEEN EFF_ST_DT AND EFF_END_DT;
 UNION ALL
SELECT ROW_NUM,CCY_CD,TXN_DT, null AS CNTRY_DESC
  FROM TXN
 WHERE NOT EXISTS (SELECT 1 FROM CURRENCY WHERE CURRENCY.CCY_CD =
TXN.CCY_CD)

2017-08-23 23:47 GMT+08:00 Furcy Pin <[email protected]>:

> Ho, in that case...
>
> (First I notice that you say you want all records in TXN but in the query
> you give, you perform your join the other way round.)
>
> This is a typical use case that SQL is not very good at handling...
>
> The solutions I see are:
>
> - use RANK as you suggested.
>   Note that Hive is smart enough to optimize it correctly :
>   if you use a filter WHERE Rank() < K, it will take the K first values in
> RAM and ignore the rest, rather than ranking everything and filtering
> afterwards.
>
> - perform a GROUP BY TXN.ROW_NUM so you have only one line per
> transaction. Aggregate the rest the way you like (max, max_by or arbitrary)
>   To handle the missing EUR line, just add a (OR CURRENCY.CCY_CD IS NULL)
> to your where clause (and reverse the left join).
>
> - Most complicated way to implement, but simplest to use afterwards:
>   replace your CURRENCY table with a table where you have only one row per
> currency, and store the multiple start/end dates in an array of structs,
>   then a UDF that given such array and a date returns the correct
> description. It only works if you don't have too many description per
> currency though.
>   This is where Spark comes handy: it is simpler to define UDFs with it.
>
>
>
> On Wed, Aug 23, 2017 at 5:21 PM, Ramasubramanian Narayanan <
> [email protected]> wrote:
>
>> Hi,
>>
>> TXN.TXN_DT should be between CURRENCY.EFF_ST_DT and CURRENCY.EFF_END_DT.
>> It needs to be equated.
>>
>>
>> regards,
>> Rams
>>
>> On Wed, Aug 23, 2017 at 7:55 PM, Furcy Pin <[email protected]>
>> wrote:
>>
>>> I would suggest to use a subquery
>>>
>>> WITH unique_currency AS (
>>>   SELECT
>>>     CCY_CD,
>>>     MAX(CNTRY_DESC) as CNTRY_DESC
>>>   FROM CURRENCY
>>>   GROUP BY CCY_CD
>>> )
>>>
>>> and then perform your left join on it.
>>>
>>> Some SQL engine (e.g. Presto) have aggregation functions like
>>> arbitrary(col) that take any value and are a little less costly than a
>>> max.
>>> Sometimes, they also have functions like max_by(x, y)
>>> <https://prestodb.io/docs/current/functions/aggregate.html#max_by> that
>>> would allow you to get the most recent description.
>>>
>>> It is a shame that this function is not included in Hive yet, but still
>>> you can find some UDAF implementations on github
>>> <https://github.com/dataiku/dataiku-hive-udf#first_of_group-last_of_group>
>>> .
>>>
>>>
>>>
>>> On Wed, Aug 23, 2017 at 3:37 PM, Ramasubramanian Narayanan <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Need your suggestion on the below.
>>>>
>>>> Have two tables TXN and CURRENCY.
>>>>
>>>> Need all records in TXN and hence doing Left Join with CURRENCY.
>>>>
>>>> *Two problems :*
>>>> 1. CURRENCY table may contain duplicate records hence it needs to be
>>>> handled through RANK or some other function.
>>>> 2. If we equate TXN_DT between EFF_ST_DT and EFF_END_DT in the 'where
>>>> clause' then we will loose the EUR records which should not happen.
>>>>
>>>> Please suggest a solution to over come both the problems. For
>>>> duplicated records it is fine if we select any of the CNTRY_DESC.
>>>>
>>>> *Table : CURRENCY*
>>>>
>>>> *Table : TXN*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> *EFF_ST_DT*
>>>>
>>>> *EFF_END_DT*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> INR
>>>>
>>>> Indian Rupee
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> INR
>>>>
>>>> Indian Rupee New
>>>>
>>>> 21-Feb-16
>>>>
>>>> 20-Feb-99
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> USD
>>>>
>>>> US Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-16
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar
>>>>
>>>> 1-Jan-15
>>>>
>>>> 20-Feb-17
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar New
>>>>
>>>> 15-Feb-17
>>>>
>>>> 20-Feb-99
>>>>
>>>> SGD
>>>>
>>>> Singapore Dollar Latest
>>>>
>>>> 16-Feb-17
>>>>
>>>> 16-Feb-17
>>>>
>>>> *Expected Output*
>>>>
>>>> *ROW_NUM*
>>>>
>>>> *CCY_CD*
>>>>
>>>> *TXN_DT*
>>>>
>>>> *CNTRY_DESC*
>>>>
>>>> 1
>>>>
>>>> INR
>>>>
>>>> 16-Feb-17
>>>>
>>>> Indian Rupee
>>>>
>>>> 2
>>>>
>>>> USD
>>>>
>>>> 16-Feb-17
>>>>
>>>> US Dollar
>>>>
>>>> 3
>>>>
>>>> SGD
>>>>
>>>> 16-Feb-17
>>>>
>>>> Singapore Dollar Latest (Any of three valid valid is fine)
>>>>
>>>> 4
>>>>
>>>> EUR
>>>>
>>>> 16-Feb-17
>>>>
>>>> <Null>
>>>>
>>>>
>>>>
>>>> *Query : *Select ROW_NUM,CCY_CD,TXN_DT,CNTRY_DESC
>>>> from CURRENCY LEFT JOIN TXN on (CURRENCY.CCY_CD = TXN.CCY_CD)
>>>> where
>>>> TXN_DT between EFF_ST_DT and EFF_END_DT;
>>>>
>>>>
>>>>
>>>> This query will drop the "EUR" record because of the where clause used.
>>>> It cannot be handled with case statement instead of 'where clause' as
>>>> we have   more than one record for 'SGD' when  TXN_DT is 16-FEB.
>>>>
>>>> regards,
>>>> Rams
>>>>
>>>
>>>
>>
>

Re: LEFT JOIN and WHERE CLAUSE - How to handle

Reply via email to