(please note that this random string function is NOT the good way to do it,
i should random int then use it as index to an array containing all the
letter)

Thanks a lot for this new version!
It seems to be slower than your first solution (no index use I guess, I
gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't
make assumption about a fixed interval (2 sec in your example). But I think
I see where you are going.


After some test, the fastest is using BETWEEN and range.
(it is way faster than using the <@, strangely)

Here is the code :
-------------------------------------------------------

--usefull function to fill with random text
    CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
    RETURNS text AS $$
    SELECT array_to_string(
    ARRAY(
    SELECT
    substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
(random()*36)::int + 1 FOR 1)
    FROM generate_series(1,$1)
    )
    ,'')
    $$ LANGUAGE sql;


--creating tables
    DROP TABLE IF EXISTS a;
    DROP TABLE IF EXISTS b;

    create table a(gid int, t numeric, r numrange, data text);
    create table b(gid int, t numeric, data text);
    CREATE INDEX ON a (t);
    CREATE INDEX ON b (t);
    --CREATE INDEX ON a USING spgist (r);
    CREATE INDEX ON a (r);

--filling tables
    WITH the_serie AS (
        SELECT s AS gid,  s+random()/2-0.5 AS s, rc_random_string(100) aS
data
        FROM generate_series(1,100000) AS s

    )
    insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1)
over(order by gid ASC))::numeric  ,s::numeric) , data
    FROM the_serie;


    WITH the_serie AS (
        SELECT  s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS
data
        FROM generate_series(1,100000) AS s
    )
    insert into b (gid, t, data) SELECT gid,s, data
    FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing join with range

    --slow : 80 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.*
        FROM b LEFT JOIN a ON (b.t <@ a.r)
        ORDER BY gid ASC
        LIMIT 30

    --slow: 80 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.*
        FROM a,b
        WHERE b.t <@a.r

    --fast : 3sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        SELECT b.* , a.data as d2
        FROM a,b
        WHERE b.t BETWEEN lower(a.r) AND upper(a.r)

    --fast : 8 sec
    DROP TABLE IF EXISTS t;
    CREATE TABLE t AS
        select a.t As a_t, b.t as b_t

        from (
          select t, least( least(t, mint), least(t, maxt)) as t2 from (
        select t,
         (select t from a where a.t >= b.t order by a.t limit 1) as mint,
         (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
          from b
          ) as tmp
        ) as tmp2
        inner join a on (tmp2.t2 = a.t)
        inner join b on (tmp2.t = b.t)
-------------------------------------------------------

Thanks again,
Rémi-C


2014-04-11 20:18 GMT+02:00 Rémi Cura <remi.c...@gmail.com>:

> Wow many thanks!
>
> I had thought about the order by and limit because it is the natural way
> to express the problem,
> but I had discarded it for fear of suchbad complexity
> (theoretically, for each row of B , compute the distance to every other
> row of A!)
> .
>
> And it's okay if 2 row from B share the same join to row from A, because
> when interpolating it will be different.
>
> Here is the test env with realistic number, your solution is very fast, I
> have to raise my hat (4 sec!)
> -------------------------------------------------------
>
> --usefull function to fill with random text
>     CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
>     RETURNS text AS $$
>     SELECT array_to_string(
>     ARRAY(
>     SELECT
>     substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
> (random()*36)::int + 1 FOR 1)
>     FROM generate_series(1,$1)
>     )
>     ,'')
>     $$ LANGUAGE sql;
>
>
> --creating tables
>     DROP TABLE IF EXISTS a;
>     DROP TABLE IF EXISTS b;
>
>     create table a(t float, data text);
>     create table b(t float, data text);
>     CREATE INDEX ON a (t);
>     CREATE INDEX ON b (t);
>
>
> --filling tables
>     WITH the_serie AS (
>         SELECT  s+random()/2 AS s, rc_random_string(100) aS data
>         FROM generate_series(1,100000) AS s
>
>     )
>     insert into a SELECT s, data
>     FROM the_serie;
>
>     WITH the_serie AS (
>         SELECT  s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
>         FROM generate_series(1,100000) AS s
>
>     )
>     insert into b SELECT s, data
>     FROM the_serie;
>
> ANALYZE a;
> ANALYZE b;
>
> --computing result
> DROP TABLE IF EXISTS t;
> CREATE TABLE t AS
>     select a.t As a_t, b.t as b_t
>
>     from (
>       select t, least( least(t, mint), least(t, maxt)) as t2 from (
>         select t,
>          (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>          (select t from a where a.t < b.t order by a.t desc limit 1) as
> maxt
>       from b
>       ) as tmp
>     ) as tmp2
>     inner join a on (tmp2.t2 = a.t)
>     inner join b on (tmp2.t = b.t)
> --------------------------------------------------------
>
>
>
> 2014-04-11 19:16 GMT+02:00 Andy Colson <a...@squeakycode.net>:
>
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>>
>>> Hey dear List,
>>>
>>> I'm looking for some advice about the best way to perform a "fuzzy"
>>> join, that is joining two table based on approximate matching.
>>>
>>> It is about temporal matching
>>> given a table A with rows containing data and a control_time (for
>>> instance 1 ; 5; 6;  .. sec, not necessarly rounded of evenly-spaced)
>>>
>>> given another table B with lines on no precise timing (eg control_time =
>>> 2.3 ; 5.8 ; 6.2 for example)
>>>
>>> How to join every row of B to A based on
>>> min(@(A.control_time-B.control_time))
>>> (that is, for every row of B, get the row of A that is temporaly the
>>> closest),
>>> in an efficient way?
>>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>>
>>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>>> the previous time and next time for 1 st order interpolation, 2 before
>>> and 2 after for 2nd order interpolation, and so on)?
>>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>>> respectively)
>>>
>>>
>>> Currently my data is spatial so I use Postgis function to interpolate a
>>> point on a line, but is is far from efficient or general, and I don't
>>> have control on interpolation (only the spatial values are interpolated).
>>>
>>>
>>> Cheers,
>>> Rémi-C
>>>
>>
>>
>> Ok, here is a just sql way.  No ranges.  No idea if its right.  A first
>> pass, so to speak.
>>
>>
>>
>> create table a(t float, data text);
>> create table b(t float, data text);
>>
>> insert into a values (1), (5), (6);
>> insert into b values (2.3), (5.8), (6.2);
>>
>>
>> select a.t, b.t
>> from (
>>   select t, least( least(t, mint), least(t, maxt)) as t2 from (
>>     select t,
>>      (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>>      (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>>   from b
>>   ) as tmp
>> ) as tmp2
>> inner join a on (tmp2.t2 = a.t)
>> inner join b on (tmp2.t = b.t)
>>
>>
>>
>>
>> The middle part is the magic:
>>
>> select t,
>>  (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>>  (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>> from b
>>
>> The rest is just to make it usable.  If t is indexed, it'll probably be
>> fast too.
>>
>> -Andy
>>
>>
>>
>

Reply via email to