(please note that this random string function is NOT the good way to do it, i should random int then use it as index to an array containing all the letter)
Thanks a lot for this new version! It seems to be slower than your first solution (no index use I guess, I gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't make assumption about a fixed interval (2 sec in your example). But I think I see where you are going. After some test, the fastest is using BETWEEN and range. (it is way faster than using the <@, strangely) Here is the code : ------------------------------------------------------- --usefull function to fill with random text CREATE OR REPLACE FUNCTION rc_random_string(INTEGER ) RETURNS text AS $$ SELECT array_to_string( ARRAY( SELECT substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM (random()*36)::int + 1 FOR 1) FROM generate_series(1,$1) ) ,'') $$ LANGUAGE sql; --creating tables DROP TABLE IF EXISTS a; DROP TABLE IF EXISTS b; create table a(gid int, t numeric, r numrange, data text); create table b(gid int, t numeric, data text); CREATE INDEX ON a (t); CREATE INDEX ON b (t); --CREATE INDEX ON a USING spgist (r); CREATE INDEX ON a (r); --filling tables WITH the_serie AS ( SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS data FROM generate_series(1,100000) AS s ) insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1) over(order by gid ASC))::numeric ,s::numeric) , data FROM the_serie; WITH the_serie AS ( SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS data FROM generate_series(1,100000) AS s ) insert into b (gid, t, data) SELECT gid,s, data FROM the_serie; ANALYZE a; ANALYZE b; --computing join with range --slow : 80 sec DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT b.* FROM b LEFT JOIN a ON (b.t <@ a.r) ORDER BY gid ASC LIMIT 30 --slow: 80 sec DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT b.* FROM a,b WHERE b.t <@a.r --fast : 3sec DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT b.* , a.data as d2 FROM a,b WHERE b.t BETWEEN lower(a.r) AND upper(a.r) --fast : 8 sec DROP TABLE IF EXISTS t; CREATE TABLE t AS select a.t As a_t, b.t as b_t from ( select t, least( least(t, mint), least(t, maxt)) as t2 from ( select t, (select t from a where a.t >= b.t order by a.t limit 1) as mint, (select t from a where a.t < b.t order by a.t desc limit 1) as maxt from b ) as tmp ) as tmp2 inner join a on (tmp2.t2 = a.t) inner join b on (tmp2.t = b.t) ------------------------------------------------------- Thanks again, Rémi-C 2014-04-11 20:18 GMT+02:00 Rémi Cura <remi.c...@gmail.com>: > Wow many thanks! > > I had thought about the order by and limit because it is the natural way > to express the problem, > but I had discarded it for fear of suchbad complexity > (theoretically, for each row of B , compute the distance to every other > row of A!) > . > > And it's okay if 2 row from B share the same join to row from A, because > when interpolating it will be different. > > Here is the test env with realistic number, your solution is very fast, I > have to raise my hat (4 sec!) > ------------------------------------------------------- > > --usefull function to fill with random text > CREATE OR REPLACE FUNCTION rc_random_string(INTEGER ) > RETURNS text AS $$ > SELECT array_to_string( > ARRAY( > SELECT > substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM > (random()*36)::int + 1 FOR 1) > FROM generate_series(1,$1) > ) > ,'') > $$ LANGUAGE sql; > > > --creating tables > DROP TABLE IF EXISTS a; > DROP TABLE IF EXISTS b; > > create table a(t float, data text); > create table b(t float, data text); > CREATE INDEX ON a (t); > CREATE INDEX ON b (t); > > > --filling tables > WITH the_serie AS ( > SELECT s+random()/2 AS s, rc_random_string(100) aS data > FROM generate_series(1,100000) AS s > > ) > insert into a SELECT s, data > FROM the_serie; > > WITH the_serie AS ( > SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data > FROM generate_series(1,100000) AS s > > ) > insert into b SELECT s, data > FROM the_serie; > > ANALYZE a; > ANALYZE b; > > --computing result > DROP TABLE IF EXISTS t; > CREATE TABLE t AS > select a.t As a_t, b.t as b_t > > from ( > select t, least( least(t, mint), least(t, maxt)) as t2 from ( > select t, > (select t from a where a.t >= b.t order by a.t limit 1) as mint, > (select t from a where a.t < b.t order by a.t desc limit 1) as > maxt > from b > ) as tmp > ) as tmp2 > inner join a on (tmp2.t2 = a.t) > inner join b on (tmp2.t = b.t) > -------------------------------------------------------- > > > > 2014-04-11 19:16 GMT+02:00 Andy Colson <a...@squeakycode.net>: > > On 4/11/2014 7:50 AM, Rémi Cura wrote: >> >>> Hey dear List, >>> >>> I'm looking for some advice about the best way to perform a "fuzzy" >>> join, that is joining two table based on approximate matching. >>> >>> It is about temporal matching >>> given a table A with rows containing data and a control_time (for >>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced) >>> >>> given another table B with lines on no precise timing (eg control_time = >>> 2.3 ; 5.8 ; 6.2 for example) >>> >>> How to join every row of B to A based on >>> min(@(A.control_time-B.control_time)) >>> (that is, for every row of B, get the row of A that is temporaly the >>> closest), >>> in an efficient way? >>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6) >>> >>> Optionnaly, how to get interpolation efficiently (meaning one has to get >>> the previous time and next time for 1 st order interpolation, 2 before >>> and 2 after for 2nd order interpolation, and so on)? >>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8 >>> respectively) >>> >>> >>> Currently my data is spatial so I use Postgis function to interpolate a >>> point on a line, but is is far from efficient or general, and I don't >>> have control on interpolation (only the spatial values are interpolated). >>> >>> >>> Cheers, >>> Rémi-C >>> >> >> >> Ok, here is a just sql way. No ranges. No idea if its right. A first >> pass, so to speak. >> >> >> >> create table a(t float, data text); >> create table b(t float, data text); >> >> insert into a values (1), (5), (6); >> insert into b values (2.3), (5.8), (6.2); >> >> >> select a.t, b.t >> from ( >> select t, least( least(t, mint), least(t, maxt)) as t2 from ( >> select t, >> (select t from a where a.t >= b.t order by a.t limit 1) as mint, >> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt >> from b >> ) as tmp >> ) as tmp2 >> inner join a on (tmp2.t2 = a.t) >> inner join b on (tmp2.t = b.t) >> >> >> >> >> The middle part is the magic: >> >> select t, >> (select t from a where a.t >= b.t order by a.t limit 1) as mint, >> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt >> from b >> >> The rest is just to make it usable. If t is indexed, it'll probably be >> fast too. >> >> -Andy >> >> >> >