Perhaps you can first create a temp table that contains only the records that will match? See the UNION ALL trick at http://www.mail-archive.com/[email protected]/msg01906.html
________________________________ From: Brad Ruderman <[email protected]> To: [email protected] Sent: Monday, July 29, 2013 11:38 AM Subject: Best Performance on Large Scale Join Hi All- I have 2 tables: CREATE TABLE users ( a bigint, b int ) CREATE TABLE products ( a bigint, c int ) Each table has about 8 billion records (roughly 2k files total mappers). I want to know the most performant way to do the following query: SELECT u.b, p.c, count(*) as count FROM users u INNER JOIN products p ON u.a = p.a GROUP BY u.b, p.c Right now the reducing is killing me. Any suggestions on improving performance? Would a mapbucket join be optimal here? Thanks, Brad
