Re: Array of integer indexed nested-loop semi join

Mickael van der Beek Mon, 23 May 2022 00:57:50 -0700

Hello Jeff,

Sadly, the query you suggested won't work because you are only returning
the first row of the matching inner query rows.
Example:


SELECT
>   u.idx,
>   u.page_idxs
> FROM
>   (
>     VALUES
>       (1, ARRAY[11, 21, 31]),
>       (2, ARRAY[12, 21, 32]),
>       (3, ARRAY[13, 23, 31])
>   )
>     AS u(idx, page_idxs)
> WHERE
>   u.page_idxs && ARRAY[(
>     SELECT
>       p.idx
>     FROM
>       (
>         VALUES
>           (11, ARRAY[101, 201, 301]),
>           (21, ARRAY[102, 201, 302]),
>           (13, ARRAY[103, 203, 301])
>       )
>         AS p(idx, attribute_idxs)
>     WHERE
>       p.attribute_idxs && ARRAY[201]
>     FETCH FIRST 1 ROWS ONLY
>   )]
> ;


This query only returns one row while it should actually return two:

1 {11,21,31}


The INNER JOIN version of the query will return all matching rows but also
include duplicates:

SELECT
>   u.idx,
>   u.page_idxs
> FROM
>   (
>     VALUES
>       (1, ARRAY[11, 21, 31]),
>       (2, ARRAY[12, 21, 32]),
>       (3, ARRAY[13, 23, 31])
>   )
>     AS u(idx, page_idxs)
> INNER JOIN
>   (
>     SELECT
>       p.idx
>     FROM
>       (
>         VALUES
>           (11, ARRAY[101, 201, 301]),
>           (21, ARRAY[102, 201, 302]),
>           (13, ARRAY[103, 203, 301])
>       )
>         AS p(idx, attribute_idxs)
>     WHERE
>       p.attribute_idxs && ARRAY[201]
>   )
>   AS p2
>   ON u.page_idxs && ARRAY[p2.idx]
> ;


Results:

1 {11,21,31}
> 1 {11,21,31}
> 2 {12,21,32}


As far as I know, the the IN + sub-expression query can't work since the
left side of the operation is an array of integers and the right side a set
of rows with a single integer column.
The reason I'm using integer arrays is because it is the only way I have
found in PostgreSQL to get fast inclusion / exclusion checks on large
datasets (hundreds of millions of values).
Did I misunderstand your response?
Thank you for the ongoing help,

Mickael

On Mon, May 23, 2022 at 4:45 AM Jeff Janes <[email protected]> wrote:

>
>
> On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <
> [email protected]> wrote:
>
>>
>> Query:
>>
>> EXPLAIN (
>>>   ANALYZE,
>>>   VERBOSE,
>>>   COSTS,
>>>   BUFFERS,
>>>   TIMING
>>> )
>>> SELECT
>>>   fu.w2_page_idxs
>>> FROM
>>>   fact_users
>>>     AS fu
>>> WHERE
>>>   EXISTS (
>>>     SELECT
>>>     FROM
>>>       (
>>>         SELECT
>>>           ARRAY[idx] AS page_idx
>>>         FROM
>>>           fact_pages
>>>         WHERE
>>>           attribute_idxs && ARRAY[300000160]
>>>         FETCH FIRST 1 ROWS ONLY
>>>       )
>>>         AS fp
>>>     WHERE
>>>       fu.w2_page_idxs && fp.page_idx
>>>   )
>>> ;
>>
>>
>> Without any surprises, the planner is using a sequential scan on the
>> "fact_users" table which is very large instead of using the GIN index set
>> on the "w2_page_idxs" column.
>>
>
> For me, using the subquery in and expression, instead of the EXISTS, does
> get it to use the gin index.  And I think it must give the same results.
>
> SELECT
>   fu.w2_page_idxs
> FROM  fact_users AS fu
> WHERE
>       fu.w2_page_idxs && ARRAY[(select idx from fact_pages where
> attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];
>
> But why are you using intarray?  That is unnecessary here, and by creating
> ambiguity about the array operators it might be harmful.
>
> Cheers,
>
> Jeff
>
>>

-- 
Mickael van der BeekWeb developer & Security analyst

[email protected]

Re: Array of integer indexed nested-loop semi join

Reply via email to