ozankabak commented on issue #318:
URL: https://github.com/apache/datafusion/issues/318#issuecomment-2194718023
@jirislav I'm not sure if you chose your tone intentionally to be this way
and I have to say I don't really appreciate it.
Nevertheless I will try to answer as best as I can. My concern about the
`ASOF` join is not specific to a particular implementation -- it is a
conceptual one. The pain point is of course real, but one doesn't really need a
new, special syntax to solve it.
Consider a scenario where we are computing the most recent values of each
position in a stock portfolio (akin to @alamb's example). Let's create some
sample data via:
```sql
CREATE TABLE holdings (
ts TIMESTAMP,
ticker VARCHAR(10),
amount INT
);
INSERT INTO holdings (ts, ticker, amount) VALUES
('2024-06-27 09:00:00', 'AAPL', '50'),
('2024-06-27 09:00:00', 'GOOGL', '30'),
('2024-06-27 09:00:00', 'MSFT', '20'),
('2024-06-28 09:00:00', 'AAPL', '60'),
('2024-06-28 09:00:00', 'GOOGL', '35'),
('2024-06-28 09:00:00', 'MSFT', '25');
CREATE TABLE prices (
ts TIMESTAMP,
ticker VARCHAR(10),
price DECIMAL(10, 2)
);
INSERT INTO prices (ts, ticker, price) VALUES
('2024-06-27 09:00:00', 'AAPL', '145.50'),
('2024-06-27 09:00:00', 'GOOGL', '2754.36'),
('2024-06-27 09:00:00', 'MSFT', '301.22'),
('2024-06-28 09:00:00', 'AAPL', '146.20'),
('2024-06-28 09:00:00', 'GOOGL', '2770.01'),
('2024-06-28 09:00:00', 'MSFT', '305.50');
```
Consider the following standard query:
```sql
SELECT h.*, h.amount * LAST_VALUE(p.price ORDER BY p.ts)
FROM holdings as h, prices as p
WHERE h.ts >= p.ts AND h.ticker = p.ticker
GROUP BY h.ts, h.ticker, h.amount
ORDER BY h.ts, h.ticker
```
We get the following (expected) result:
```
+---------------------+--------+--------+----------+
| ts | ticker | amount | value |
+---------------------+--------+--------+----------+
| 2024-06-27T09:00:00 | AAPL | 50 | 7275.00 |
| 2024-06-27T09:00:00 | GOOGL | 30 | 82630.80 |
| 2024-06-27T09:00:00 | MSFT | 20 | 6024.40 |
| 2024-06-28T09:00:00 | AAPL | 60 | 8772.00 |
| 2024-06-28T09:00:00 | GOOGL | 35 | 96950.35 |
| 2024-06-28T09:00:00 | MSFT | 25 | 7637.50 |
+---------------------+--------+--------+----------+
```
This solves the problem, it generalizes to arbitrary join conditions, and it
already works in `main` branch. The plan one gets from OOTB DataFusion for this
query is not very efficient today, but it can be made more efficient. For
example, if we specify that the input tables are ordered w.r.t. `ts` (which we
already allow today for external tables), the `ORDER BY` clause becomes
unnecessary and the query gets even more succinct [1]. We can make versions of
join algorithms that will take that data ordering and execute even more
efficiently than the standard hash join this query compiles into. The nice
thing about DF is that one can make this query efficiently run even in more
challenging contexts (like streaming data) by implementing custom join
operators. This is what we did at Synnada.
Given that we can solve this problem in a general way without any special
syntax, I do not see any real benefit of adding another syntax for an
already-solved problem in upstream DF. We are also improving built-in join
operators constantly and adding new ones, so our plans will also get ever more
efficient over time.
Downstream users are obviously free to customize DF in any way they wish. I
hope this helps.
[1] @alamb, writing up this example makes me think it could be a good idea
to file ticket to support order specification for memory tables too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]