Yes, I agree with this. If you did a hive transform to say a python script
that collected your offsets per doc id and used "distributed by" to ensure
that the script you sent the data too had all the data to work with , you
could then do the logic to join what you need to join together and, emit
the resultant set.

On Sun, Feb 3, 2013 at 1:36 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
>
> On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
> <icodesh...@gmail.com> wrote:
> > yes there is. Each document has a UUID as its identifier. The actual
> output
> > of my map reduce job that produces the list of person names looks like
> this
> >
> > docId        Name Type length offset
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
> > 11234
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 17073
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17330
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17347
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17480
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 19498
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 19530
> >
> > Use the following code to produce a table inside Hive.
> >
> > DROP TABLE IF EXISTS entities_extract;
> >
> >     CREATE TABLE entities_extract (doc_id STRING, name STRING, type
> STRING,
> > len INT, offset BIGINT)
> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> >     LINES TERMINATED BY '\n'
> >     STORED AS TEXTFILE
> >     LOCATION '/research/45924/hive/entities_extract';
> >
> > LOAD DATA LOCAL INPATH
> > '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
> > OVERWRITE INTO TABLE entities_extract;
> >
> >
> >
> > On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote:
> >
> > Is there some think akin to a document I'd so we can assure all rows
> > belonging to the same document can be sent to one mapper?
> >
> > On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com>
> wrote:
> >>
> >> Hi John,
> >>
> >> Here is some background about my data and what I want as output.
> >>
> >> I have a 215K documents containing text. From those text files I extract
> >> names of persons, organisations and locations by using the Stanford NER
> >> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
> >>
> >> Looking at the following line:
> >>
> >> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
> >> from his father.
> >>
> >> when the classifier is done annotating the line looks like this:
> >>
> >> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
> >> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
> >> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> >>
> >> When looping through this annotated line you can save the persons and
> its
> >> offsets, please note that offset is a LONG value, inside a Map for
> example:
> >>
> >> MAP<STRING, LONG> entities
> >>
> >> Jan, 0
> >> Janssen, 5
> >> Klaas, 26
> >> Jan, 48
> >> Janssen, 50
> >>
> >> Jan Janssen in the line is actually the one person and not two. Jan
> occurs
> >> at offset 0, to determine if Janssen belongs to Jan I could subtract the
> >> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if
> outcome
> >> isn't greater then 1 then combine the two person into one person.
> >>
> >> (offset Jansen) - (offset Jan + whitespace) not greater then 1
> >>
> >> If this is true then combine the two person and save this inside a new
> >> MAP<STRING, LONG[]> like
> >> Jan Janssen, [ 0 ].
> >>
> >> The next time we come across Jan Janssen inside the text then just save
> >> the offset. Which produces the following MAP<STRING, LONG[]>
> >>
> >> Jan Janssen, [0, 48]
> >>
> >> I hope this clarifies my question.
> >> If things are still unclear please don't hesitate to ask me to clarify
> my
> >> question further.
> >>
> >> Kind regards,
> >> Martijn
> >>
> >> On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote:
> >>
> >> Well there are some methods that may work, but I'd have to understand
> your
> >> data and your constraints more. You want to be able to (As it sounds)
> sort
> >> by offset, and then look at the one row, and then the next row, to
> determine
> >> if the the two items should be joined. It "looks" like you  are doing a
> >> string comparison between numbers ("100 "to "104" there is only one
> >> "position" out of three that is different (0 vs 4).  Trouble is, look
> at id
> >> 3 and id 4.  150 to 160 is only one position different as well, are you
> >> looking for Klaas Jan?  Also, is the ID fields filled from the first
> match?
> >> It seems like you have some very odd data here. I don't think you've
> >> provided enough information on the data for us to be able to help you.
> >>
> >>
> >>
> >> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <
> icodesh...@gmail.com>
> >> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I new to Apache Hive and I am doing some test to see if it fits my
> needs,
> >>> one of the questions I have if it is possible to "peek" for the next
> row in
> >>> order to find out if the values should be combined. Let me explain by
> an
> >>> example.
> >>>
> >>> Let say my data looks like this
> >>>
> >>> Id name offset
> >>> 1 Jan 100
> >>> 2 Janssen 104
> >>> 3 Klaas 150
> >>> 4 Jan 160
> >>> 5 Janssen 164
> >>>
> >>> An my output to another table should be this
> >>>
> >>> Id fullname offsets
> >>> 1 Jan Janssen [ 100, 160 ]
> >>>
> >>> I would like to combine the name values from two rows where the offset
> of
> >>> the two rows are no more then 1 character apart.
> >>>
> >>> Is this type of data manipulation is possible and if it is could
> someone
> >>> point me to the right direction hopefully with some explaination?
> >>>
> >>> Kind regards
> >>> Martijn
> >>
> >>
> >>
> >
>

Reply via email to