Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Martijn van Leeuwen
Thank you guys! I will have a look at this. Kind regards, Martijn On Feb 3, 2013, at 8:36 PM, Edward Capriolo wrote: > You may want to look at sort by, distribute by, and cluster by. This > syntax controls which Reducers the data end up on and how it is sorted > on each reducer. > > On Sun, Fe

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread John Omernik
Yes, I agree with this. If you did a hive transform to say a python script that collected your offsets per doc id and used "distributed by" to ensure that the script you sent the data too had all the data to work with , you could then do the logic to join what you need to join together and, emit th

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Edward Capriolo
You may want to look at sort by, distribute by, and cluster by. This syntax controls which Reducers the data end up on and how it is sorted on each reducer. On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen wrote: > yes there is. Each document has a UUID as its identifier. The actual output > o

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Martijn van Leeuwen
yes there is. Each document has a UUID as its identifier. The actual output of my map reduce job that produces the list of person names looks like this docId Name Typelength offset f83c6ca3-9585-4c66-b9b0-f4c3bd57cc

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread John Omernik
Is there some think akin to a document I'd so we can assure all rows belonging to the same document can be sent to one mapper? On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" wrote: > Hi John, > > Here is some background about my data and what I want as output. > > I have a 215K documents containin

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Martijn van Leeuwen
Hi John, Here is some background about my data and what I want as output. I have a 215K documents containing text. From those text files I extract names of persons, organisations and locations by using the Stanford NER library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) Looking at t

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread Dean Wampler
If you really only need to consider adjacent rows, it might just be easier to write a UDF or use streaming, where your code remembers the last record seen and emits a new record if you want to do the join with the current record. On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen wrote: > Hi all

Re: Combine multiple row values based upon a condition.

2013-02-03 Thread John Omernik
Well there are some methods that may work, but I'd have to understand your data and your constraints more. You want to be able to (As it sounds) sort by offset, and then look at the one row, and then the next row, to determine if the the two items should be joined. It "looks" like you are doing a