Thank you guys! I will have a look at this.
Kind regards,
Martijn
On Feb 3, 2013, at 8:36 PM, Edward Capriolo wrote:
> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
>
> On Sun, Fe
Yes, I agree with this. If you did a hive transform to say a python script
that collected your offsets per doc id and used "distributed by" to ensure
that the script you sent the data too had all the data to work with , you
could then do the logic to join what you need to join together and, emit
th
You may want to look at sort by, distribute by, and cluster by. This
syntax controls which Reducers the data end up on and how it is sorted
on each reducer.
On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
wrote:
> yes there is. Each document has a UUID as its identifier. The actual output
> o
yes there is. Each document has a UUID as its identifier. The actual output of
my map reduce job that produces the list of person names looks like this
docId Name
Typelength offset
f83c6ca3-9585-4c66-b9b0-f4c3bd57cc
Is there some think akin to a document I'd so we can assure all rows
belonging to the same document can be sent to one mapper?
On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" wrote:
> Hi John,
>
> Here is some background about my data and what I want as output.
>
> I have a 215K documents containin
Hi John,
Here is some background about my data and what I want as output.
I have a 215K documents containing text. From those text files I extract names
of persons, organisations and locations by using the Stanford NER library. (see
http://nlp.stanford.edu/software/CRF-NER.shtml)
Looking at t
If you really only need to consider adjacent rows, it might just be easier
to write a UDF or use streaming, where your code remembers the last record
seen and emits a new record if you want to do the join with the current
record.
On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen wrote:
> Hi all
Well there are some methods that may work, but I'd have to understand your
data and your constraints more. You want to be able to (As it sounds) sort
by offset, and then look at the one row, and then the next row, to
determine if the the two items should be joined. It "looks" like you are
doing a