[GENERAL] Enforcing Parameterised Nested Loop Join Order for Foreign Table Joins

Adam Zegelin Sun, 17 Mar 2013 21:10:27 -0700

Hello,

I’m in the process of writing a Postgres FDW that can interface with web 
service endpoints. Certain FDW columns would act as web service parameters, 
while others would be the output.


For example:

    adam=# select * from bing where query = 'xbox';
     query |               url               |     description
    -------+---------------------------------+---------------------
     xbox  | http://www.xbox.com/en-AU/index | Xbox Australia is
       ⋮   |                ⋮                |           ⋮

For the simple cases, extracting the quals (such as var [query] = const “xbox”) 
works perfectly.

I’d like to join FDW tables with other tables, possibly local or foreign.
ex: `select * from search_terms, bing where bing.query = search_terms.term`, 
where `search_terms` is a local table.

Without any parameterised paths, Postgres, as expected, will attempt to perform 
unfiltered foreign sequence scans followed by a hash join of the two tables. 
Some service endpoints have no concept of unqualified queries. In the example 
above, a ‘sequence scan’ of Bing is a not possible.

I generate parameterised paths inside the FDW handler function 
`GetForeignPaths`. I call `create_foreignscan_path` with a set of req_outer 
relids found by scanning PlannerInfo’s eq_classes, left_join_clauses and 
right_join_clauses.

        Bitmapset* outer_relids = NULL;

        foreach(lc, root->eq_classes) {
            EquivalenceClass* ec = (EquivalenceClass *) lfirst(lc);
            if (ec->ec_members->length > 1)
                outer_relids = bms_union(outer_relids, ec->ec_relids);
        }

        foreach(lc, list_union(root->left_join_clauses, 
root->right_join_clauses)) {
            RestrictInfo *ri = (RestrictInfo *) lfirst(lc);
            outer_relids = bms_union(outer_relids, ri->outer_relids);
        }

        Bitmapset* req_outer = bms_difference(outer_relids,
                                              
bms_make_singleton(baserel->relid));

        foreignPath = create_foreignscan_path(root, baserel, nbrows, 0, 0, NIL, 
req_outer, NULL);


For simple joins this works. `BeginForeignScan` can access a list of quals with 
params, and the planner generates an appropriate nested loop join over the 
foreign table.

    adam=# explain select * from search_terms, bing where bing.query = 
search_terms.term;
                                  QUERY PLAN
    -----------------------------------------------------------------------
     Nested Loop  (cost=0.00..49.30 rows=6550 width=96)
       ->  Seq Scan on search_terms  (cost=0.00..23.10 rows=1310 width=32)
       ->  Foreign Scan on bing  (cost=0.00..0.00 rows=2 width=64)
             Filter: (search_terms.term = query)

Even a query over two foreign tables works correctly:

    adam=# create foreign table ft1 (inp text, out text) server test;
    adam=# create foreign table ft2 (inp text, out text) server test;

    adam=# explain select * from ft1, ft2 where ft1.inp = ‘hello’ and ft2.inp = 
ft1.out;
                                  QUERY PLAN
    ----------------------------------------------------------------------
     Nested Loop  (cost=0.00..500020.00 rows=5000 width=128)
       ->  Foreign Scan on ft1  (cost=0.00..500000.00 rows=1000 width=64)
             Filter: (inp = 'hello'::text)
       ->  Foreign Scan on ft2  (cost=0.00..0.00 rows=2 width=64)
             Filter: (ft1."out" = inp)


But, on a more complex query consisting of multiple foreign tables the planner 
generates something a little less desirable:

    adam=# create foreign table ft3 (inp text, out text) server test;

    adam=# explain select * from ft1, ft2, ft3 where ft1.inp = 'hello' and 
ft2.inp = ft1.out and ft3.inp = ft2.out;
                                        QUERY PLAN
    
----------------------------------------------------------------------------------
     Nested Loop  (cost=500012.50..1000290.00 rows=25000 width=192)
       ->  Hash Join  (cost=500012.50..1000190.00 rows=5000 width=128)
             Hash Cond: (ft1."out" = ft2.inp)
             ->  Foreign Scan on ft1  (cost=0.00..500000.00 rows=1000 width=64)
                   Filter: (inp = 'hello'::text)
             ->  Hash  (cost=500000.00..500000.00 rows=1000 width=64)
                   ->  Foreign Scan on ft2  (cost=0.00..500000.00 rows=1000 
width=64)
       ->  Foreign Scan on ft3  (cost=0.00..0.00 rows=2 width=64)
             Filter: (ft2."out" = inp)

The high total costs are the result of my attempts to coerce the planner to 
select the parameterised paths and generate filtered foreign scans rather than 
preferring unfiltered foreign scans.

I’ve tried adjusting the query planner tuneables (enable_hashjoin, et al) and 
the path costs with some degree of success, but often the generated plans will 
filter the tables in the wrong order -- the output column of table 1 will be 
filtered by the input column of table 2 -- which is technically correct as the 
operation should be associative and transitive, but in this case, table 2 must 
be filtered by the output of table 1, not vice-versa.

Is there a way to convince Postgres to always generate nested loop joins 
without sub merges, hashes or materialisations and in the correct order?

I’m thinking that the current method I’m using to generate the foreign scan 
path doesn’t take into account the required order and which fields can and 
cannot be parameterised. I’m not sure how to proceed with this.

Thanks,
Adam

smime.p7s
Description: S/MIME cryptographic signature

[GENERAL] Enforcing Parameterised Nested Loop Join Order for Foreign Table Joins

Reply via email to