That is probably true, regarding needing skewed joins, but our users
rarely encounter those situations, and I have never knowingly done so
- though I may have worked my way round one or two without knowing it.

As to the bugs, I used dev branch for a long time so my
recommendations are colored by that, and some are owing to
peculiarities with our storage UDFs. Don't remember exactly, but round
0.2 and a few releases on up, lots of features were on the books but
were really 'Yahoo only.' :D

Pig is much more stable now. I should try more features and be more expressive.

Sent from my iPhone

On Jan 7, 2011, at 11:22 AM, Dmitriy Ryaboy <[email protected]> wrote:

> Would love to see what bugs you are running into with skewed and replicated
> joins.
> I use them all the time to great effect.
>
> You are correct in saying putting biggest relation to the left in regular
> joins is effective but totally wrong when saying it's the same thing as
> skewed join; we've encountered queries that are simply not possible to
> finish without skewed join.
>
> D
>
> On Thu, Jan 6, 2011 at 11:44 PM, Russell Jurney 
> <[email protected]>wrote:
>
>> I wrote this up for LinkedIn Hadoop Users today, figured it was worth
>> sharing.  If you have any other tips, or edits, please submit and I'll put
>> these in a wiki some place:
>>
>> /* Russell's philosophy of Pig:
>>  1) The pig is powerful, but cannot be trusted. His nature is perverse. He
>> will eat anything, and his diet effects his mood.
>>  In time, you will understand his nature.  In the meantime, do as little
>> as possible at each step - each line of Pig code.
>>  Don't tempt the pig, for he fill fuck your world, going from a tool that
>> enables you to a tool that gores you.
>>  2) Whenever possible, do all similar operations on a relation together in
>> one step. In one foreach.
>>  3) When operations on a single field of a relation become too chained,
>> too complex to read, do not be ashamed to break them into
>>  two foreaches, one right after the other. Pig is smart enough to combine
>> those into one job.
>>  4) Each GROUP BY/FOREACH is a MR job.  Each LOAD/FOREACH is an MR job.
>> Pig 0.8 will hint at this by telling you the relations
>>  used in each job, but it is helpful to yourself as you edit your code
>> later, and to others that follow, to label your scripts
>>  as you learn to infer which Pig code chunks correspond to which Pig
>> lines.  Examples of this are below.
>>  5) Always strip relation names, even if it means another
>> FOREACH/GENERATE.  For example, after a JOIN you relations may have two
>>  namespaces like one_type::thing and another_type::gearbox. Take the time
>> to do:
>>
>>     relation = FOREACH relation GENERATE one_type::thing AS thing,
>>                                          another_type::gearbox AS gearbox;
>>
>>   Either you, or the person inheriting your code, will thank you later.
>>
>>  6) Pig Latin Code Highlighting/Formatting. If you aren't using Textmate
>> http://macromates.com/ to edit your Pig Latin code... shoot yourself in
>> the
>> foot.
>>  Facilities has pistols and first-aid kits suitable for this masochism.
>> Stop the bleeding, then, download Textmate here:
>> http://download.macromates.com/TextMate_1.5.10.zip
>>  and make a helpdesk ticket for a permanent licence here:****** Then,
>> download the
>>  Pig Latin syntax and install it by pasting the commands listed here into
>> your shell, and restarting TextMate:
>> http://tommy.chheng.com/index.php/2009/09/pig-textmate-bundle/
>>  7) Never use the special variables $0/$1/$2 to represent a field, unless
>> you are creating a throwaway script.  In that case, do not save the script.
>>
>>  8) If your code fails, you are probably being too clever for the Pig.
>> Back off doing things in combination, starting with the point of failure,
>>  working back.  Add steps when sensible.  Think like the pig parser - and
>> know that there are really 5 or so Pig parsers, each one thinking a little
>>  differently.
>>  9) Use ILLUSTRATE if it works.  Complain bitterly to *** if it does not.
>> And give *** or
>>  *** or *** the evil eye, then learn their material weaknesses and bribe
>> them to fix Pig's parser to make it work.
>>
>>  In the absence of ILLUSTRATE, use:
>>     foo = SAMPLE my_relation 0.01; STORE foo INTO '/tmp/foo5125'; cat
>> /tmp/foo5125'
>>       OR
>>     foo = SAMPLE my_relation 0.01; LIMIT my_relation 100; DUMP
>> my_relation; -- Be aware that this sorts the 'sample' and makes it less
>> random.
>>
>>  11) Format your Pig code so that lists of things being generated line up
>> with CRs after each thing, as below.  This makes it readable.
>>  12) Always put the smaller dataset to the left in a JOIN. This puts it in
>> RAM if possible, resulting in a 10-1000x performance improvement.
>>  The other join types are often much buggier, so I personally never use
>> them.
>>  13) There are undocumented limitations to Pig.  If you run into a
>> problem, search Pig's JIRA: https://issues.apache.org/jira/browse/PIG
>>  Do not feel afraid to file bugs, after you email
>> [email protected].  We have contributed enough to the Pig project,
>> both through
>>  UDFs, steering feedback, events and marketing that they are sensitive to
>> our needs if we make reasonable requests.
>>  14) PUT YOUR UDFs IN THE NEW PIGGYBANK.  It is on github and is on
>> wilbur.  https://github.com/wilbur/Piggybank  Fork the project, git clone
>> it,
>>  add your UDF, and do a pull request.  Email me [email protected] when
>> you do so and I will immediately approve it.  Congrats - you've
>>  contritubed to the pig project!
>> */
>>

Reply via email to