Yes, I agree with this. If you did a hive transform to say a python script that collected your offsets per doc id and used "distributed by" to ensure that the script you sent the data too had all the data to work with , you could then do the logic to join what you need to join together and, emit the resultant set.
On Sun, Feb 3, 2013 at 1:36 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > You may want to look at sort by, distribute by, and cluster by. This > syntax controls which Reducers the data end up on and how it is sorted > on each reducer. > > On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen > <icodesh...@gmail.com> wrote: > > yes there is. Each document has a UUID as its identifier. The actual > output > > of my map reduce job that produces the list of person names looks like > this > > > > docId Name Type length offset > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 10858 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 11063 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Ken PERSON 3 11186 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Marottoli PERSON 9 > > 11234 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 > > 17073 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 17095 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 > > 17330 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17340 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 > > 17347 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 > > 17480 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17490 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 > > 19498 > > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 > > 19530 > > > > Use the following code to produce a table inside Hive. > > > > DROP TABLE IF EXISTS entities_extract; > > > > CREATE TABLE entities_extract (doc_id STRING, name STRING, type > STRING, > > len INT, offset BIGINT) > > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > > LINES TERMINATED BY '\n' > > STORED AS TEXTFILE > > LOCATION '/research/45924/hive/entities_extract'; > > > > LOAD DATA LOCAL INPATH > > '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' > > OVERWRITE INTO TABLE entities_extract; > > > > > > > > On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote: > > > > Is there some think akin to a document I'd so we can assure all rows > > belonging to the same document can be sent to one mapper? > > > > On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com> > wrote: > >> > >> Hi John, > >> > >> Here is some background about my data and what I want as output. > >> > >> I have a 215K documents containing text. From those text files I extract > >> names of persons, organisations and locations by using the Stanford NER > >> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) > >> > >> Looking at the following line: > >> > >> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole > >> from his father. > >> > >> when the classifier is done annotating the line looks like this: > >> > >> <PERSON>Jan<PERSON><OFFSET>0<OFFSET> > >> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to > >> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle > >> <PERSON>Jan<PERSON><OFFSET>48<OFFSET> > >> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father. > >> > >> When looping through this annotated line you can save the persons and > its > >> offsets, please note that offset is a LONG value, inside a Map for > example: > >> > >> MAP<STRING, LONG> entities > >> > >> Jan, 0 > >> Janssen, 5 > >> Klaas, 26 > >> Jan, 48 > >> Janssen, 50 > >> > >> Jan Janssen in the line is actually the one person and not two. Jan > occurs > >> at offset 0, to determine if Janssen belongs to Jan I could subtract the > >> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if > outcome > >> isn't greater then 1 then combine the two person into one person. > >> > >> (offset Jansen) - (offset Jan + whitespace) not greater then 1 > >> > >> If this is true then combine the two person and save this inside a new > >> MAP<STRING, LONG[]> like > >> Jan Janssen, [ 0 ]. > >> > >> The next time we come across Jan Janssen inside the text then just save > >> the offset. Which produces the following MAP<STRING, LONG[]> > >> > >> Jan Janssen, [0, 48] > >> > >> I hope this clarifies my question. > >> If things are still unclear please don't hesitate to ask me to clarify > my > >> question further. > >> > >> Kind regards, > >> Martijn > >> > >> On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote: > >> > >> Well there are some methods that may work, but I'd have to understand > your > >> data and your constraints more. You want to be able to (As it sounds) > sort > >> by offset, and then look at the one row, and then the next row, to > determine > >> if the the two items should be joined. It "looks" like you are doing a > >> string comparison between numbers ("100 "to "104" there is only one > >> "position" out of three that is different (0 vs 4). Trouble is, look > at id > >> 3 and id 4. 150 to 160 is only one position different as well, are you > >> looking for Klaas Jan? Also, is the ID fields filled from the first > match? > >> It seems like you have some very odd data here. I don't think you've > >> provided enough information on the data for us to be able to help you. > >> > >> > >> > >> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen < > icodesh...@gmail.com> > >> wrote: > >>> > >>> Hi all, > >>> > >>> I new to Apache Hive and I am doing some test to see if it fits my > needs, > >>> one of the questions I have if it is possible to "peek" for the next > row in > >>> order to find out if the values should be combined. Let me explain by > an > >>> example. > >>> > >>> Let say my data looks like this > >>> > >>> Id name offset > >>> 1 Jan 100 > >>> 2 Janssen 104 > >>> 3 Klaas 150 > >>> 4 Jan 160 > >>> 5 Janssen 164 > >>> > >>> An my output to another table should be this > >>> > >>> Id fullname offsets > >>> 1 Jan Janssen [ 100, 160 ] > >>> > >>> I would like to combine the name values from two rows where the offset > of > >>> the two rows are no more then 1 character apart. > >>> > >>> Is this type of data manipulation is possible and if it is could > someone > >>> point me to the right direction hopefully with some explaination? > >>> > >>> Kind regards > >>> Martijn > >> > >> > >> > > >