ozankabak commented on issue #318:
URL: https://github.com/apache/datafusion/issues/318#issuecomment-2194718023

   @jirislav I'm not sure if you chose your tone intentionally to be this way 
and I have to say I don't really appreciate it. 
   
   Nevertheless I will try to answer as best as I can. My concern about the 
`ASOF` join is not specific to a particular implementation -- it is a 
conceptual one. The pain point is of course real, but one doesn't really need a 
new, special syntax to solve it.
   
   Consider a scenario where we are computing the most recent values of each 
position in a stock portfolio (akin to @alamb's example). Let's create some 
sample data via:
   
   ```sql
   CREATE TABLE holdings (
       ts TIMESTAMP,
       ticker VARCHAR(10),
       amount INT
   );
   
   INSERT INTO holdings (ts, ticker, amount) VALUES
     ('2024-06-27 09:00:00', 'AAPL', '50'),
     ('2024-06-27 09:00:00', 'GOOGL', '30'),
     ('2024-06-27 09:00:00', 'MSFT', '20'),
     ('2024-06-28 09:00:00', 'AAPL', '60'),
     ('2024-06-28 09:00:00', 'GOOGL', '35'),
     ('2024-06-28 09:00:00', 'MSFT', '25');
   
   CREATE TABLE prices (
       ts TIMESTAMP,
       ticker VARCHAR(10),
       price DECIMAL(10, 2)
   );
   
   INSERT INTO prices (ts, ticker, price) VALUES
     ('2024-06-27 09:00:00', 'AAPL', '145.50'),
     ('2024-06-27 09:00:00', 'GOOGL', '2754.36'),
     ('2024-06-27 09:00:00', 'MSFT', '301.22'),
     ('2024-06-28 09:00:00', 'AAPL', '146.20'),
     ('2024-06-28 09:00:00', 'GOOGL', '2770.01'),
     ('2024-06-28 09:00:00', 'MSFT', '305.50');
   ```
   
   Consider the following standard query:
   ```sql
   SELECT h.*, h.amount * LAST_VALUE(p.price ORDER BY p.ts)
   FROM holdings as h, prices as p
   WHERE h.ts >= p.ts AND h.ticker = p.ticker
   GROUP BY h.ts, h.ticker, h.amount
   ORDER BY h.ts, h.ticker
   ```
   
   We get the following (expected) result:
   ```
   +---------------------+--------+--------+----------+
   | ts                  | ticker | amount | value    |
   +---------------------+--------+--------+----------+
   | 2024-06-27T09:00:00 | AAPL   | 50     | 7275.00  |
   | 2024-06-27T09:00:00 | GOOGL  | 30     | 82630.80 |
   | 2024-06-27T09:00:00 | MSFT   | 20     | 6024.40  |
   | 2024-06-28T09:00:00 | AAPL   | 60     | 8772.00  |
   | 2024-06-28T09:00:00 | GOOGL  | 35     | 96950.35 |
   | 2024-06-28T09:00:00 | MSFT   | 25     | 7637.50  |
   +---------------------+--------+--------+----------+
   ```
   
   This solves the problem, it generalizes to arbitrary join conditions, and it 
already works in `main` branch. The plan one gets from OOTB DataFusion for this 
query is not very efficient today, but it can be made more efficient. For 
example, if we specify that the input tables are ordered w.r.t. `ts` (which we 
already allow today for external tables), the `ORDER BY` clause becomes 
unnecessary and the query gets even more succinct [1]. We can make versions of 
join algorithms that will take that data ordering and execute even more 
efficiently than the standard hash join this query compiles into. The nice 
thing about DF is that one can make this query efficiently run even in more 
challenging contexts (like streaming data) by implementing custom join 
operators. This is what we did at Synnada.
   
   Given that we can solve this problem in a general way without any special 
syntax, I do not see any real benefit of adding another syntax for an 
already-solved problem in upstream DF. We are also improving built-in join 
operators constantly and adding new ones, so our plans will also get ever more 
efficient over time.
   
   Downstream users are obviously free to customize DF in any way they wish. I 
hope this helps.
   
   [1] @alamb, writing up this example makes me think it could be a good idea 
to file ticket to support order specification for memory tables too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to