Hello, Our data is skewed, so we are using a 'skewed' join but still the 'Join' operation is taking a long time. From the documentation, it appears Pig samples data & creates a file that is passed using 'pig.keyDistFile' config. It also appears that for our data this sample is a bit biased.
Our data for this Join is pretty static & we think we can create a better sample. Questions are: 1) If we pass -Dpig.keyDistFile=/path/to/our data, would that work? 2) Is the format of this file: key1,from, to e.g. key1, 0, 4 (Key1 will be distributed to first 4 reducers?) 3) We want to specify only 10 keys in this file. The rest can go thru normal processing, so should we just omit them from this file? 4) Feel free to say, this is a terrible idea - don't do this -:) But then please suggest a better idea ;) Thanks for your time.
