Re: [DISCUSS] Apache Pig bylaws

2010-10-05 Thread Alan Gates
Comments inlined. However, I feel like we're getting stuck in a rathole on this one issue of consensus and 2/3's votes. So I would like to ask two questions now: 1) Are there any other issues besides voting we feel should be discussed before we move to a vote? 2) For those who have expre

[VOTE] Bylaws for the Pig project

2010-10-07 Thread Alan Gates
I propose that we adopt the bylaws proposed at http://wiki.apache.org/pig/ProposedByLaws as the bylaws for the Pig project. In a self referential use of these bylaws I further propose that this vote will be open for 10 business days and require +1 votes from two thirds of PMC members (which

Re: Pig Streaming with Python Scripts

2010-10-08 Thread Alan Gates
I don't think Pig understands that this is a Python script. What happens if you put #!/bin/python (or whatever is appropriate in your system) at the beginning of your GroupStreamer? Alternatively you could explicitly call python on this file in your command by saying STREAM test THROUGH `

Re: Statistics optimizer

2010-10-14 Thread Alan Gates
AFAIK no one is working on that currently. Our next thoughts on optimizer improvement were to start using the new optimizer framework in the MR optimizer so we can bring some order to the madness of visitors that is the MR optimizer. I think Thejas plans on starting work on that in 0.9.

Re: [VOTE] Bylaws for the Pig project

2010-10-15 Thread Alan Gates
Alan. On Oct 7, 2010, at 9:22 AM, Alan Gates wrote: I propose that we adopt the bylaws proposed at http://wiki.apache.org/pig/ProposedByLaws as the bylaws for the Pig project. In a self referential use of these bylaws I further propose that this vote will be open for 10 business days and requi

Re: UDF Loader - one line in input result in multiple tuples

2010-10-27 Thread Alan Gates
The easiest way to do this might be to have your loader return a single tuple that contains bag, with all of the tuples you want to return in that bag. Then your next statement can be a foreach with a flatten to turn each of those into its own record. A = load 'foo' as (b:bag{}); B = forea

Re: UDF Loader - one line in input result in multiple tuples

2010-10-28 Thread Alan Gates
On Oct 28, 2010, at 8:36 AM, John Hui wrote: I look into the return data bag as an option. The problem is the Loader interface require me to return a Tuple object. public Tuple getNext() throws IOException { but the DataBag interface is not a derive class of Tuple so this means I will

Re: Reporting progress in a storage UDF

2010-10-28 Thread Alan Gates
Progress is reported by Pig operators, so in general other operators in your pipeline should be reporting progress so that the store function does not need to. The store function is not passed a reference to the progress reporter, so it would not be able to report progress anyway. The pro

Re: Registering Jars from HDFS?

2010-11-01 Thread Alan Gates
It is integrated into Pig in 0.8. You could back port the patch into 0.7 if you wanted. The JIRA with the patch is https://issues.apache.org/jira/browse/PIG-1505 . Alan. On Nov 1, 2010, at 12:54 PM, Zach Bailey wrote: Is it possible to feed a path of the format "hdfs:///path/to/ my.j

Re: Finding records with a given prefix

2010-11-02 Thread Alan Gates
Basically you want to join on a regular expression, correct? Unfortunately Map Reduce (and thus Pig) is spectacularly bad at non- equijoins. Is 'prefixes' small enough to fit in memory? If so, you could write a UDF that loaded it into memory and did the comparison. This way the join woul

Fwd: Hadoop workshop at SARA, December 7

2010-11-03 Thread Alan Gates
Begin forwarded message: From: Evert Lammerts Date: November 3, 2010 9:13:08 AM PDT To: Evert Lammerts Subject: Hadoop workshop at SARA, December 7 Reply-To: "gene...@hadoop.apache.org" What: SARA Apache Hadoop computing workshop When: Tuesday, December 7th Where: SARA, Science Park 121 Am

Re: Any update on when Pig 0.8 will be release

2010-11-05 Thread Alan Gates
We've been running the 0.8 branch in a beta mode at Yahoo for a couple of months and have found it stable. The big bugs we have found we've fixed and committed to the branch. Alan. On Nov 5, 2010, at 8:52 AM, Robert Goodman wrote: Is there any updated on when pig 0.8 will be release? Ther

Re: Implementation of ORDER and LIMIT

2010-11-15 Thread Alan Gates
POSort is only used for sorts of bags in memory (such as sort inside a foreach) not top level sorts. In both cases the physical operators only capture part of the actual operations, since much of the work is done by the Hadoop framework. Very briefly, order by works by taking a sample of t

Re: How to make PIG delete its temporary files ?

2010-11-29 Thread Alan Gates
Pig is supposed to remove all these temporary files, as long as the java process finishes in such a way that it has a chance to clean up (ie, no one does a kill -9 on it or something). Can you file a JIRA with a reproducible case so we can track this down and fix it? Alan. On Nov 22, 2010

Re: Easy question...difference between this::form and this.form?

2010-12-06 Thread Alan Gates
The reason it's needed is that ambiguities would result otherwise. A = load 'foo' as (x, y, z); B = load 'bar' as (w, x, y, z); C = join A by x, B by x; D = filter C by z > 0; -- which z? As long as the name is not ambiguous, the :: is not required. So in the above example it would be perfec

Re: questions about dependent jars of func in piggybank

2010-12-08 Thread Alan Gates
On Dec 8, 2010, at 12:16 AM, Lin Guo wrote: Hi, All, I want to check in some functions into piggybank and have some questions about dependent jars: 1. it depends on some new jars, where should I add them? updating ivy.xml under trunk to include them? That's what we've done so far. 2

Re: How to divide by the minimum number in a set in Pig?

2010-12-14 Thread Alan Gates
Actually, in 0.8 the code you give will work, if you cast min_generated to an int. 0.8 Is in the release process now. Are you in a position to use new code? Alan. On Dec 14, 2010, at 10:32 AM, Jonathan Coveney wrote: I'm not sure if Pig can handle this...perhaps in this specific case th

Re: How to divide by the minimum number in a set in Pig?

2010-12-14 Thread Alan Gates
as stable as you'd want right at the moment. Alan. On Dec 14, 2010, at 10:52 AM, Jonathan Coveney wrote: I can use new code, yes. If I simply use the dev version of pig, will it support this then? 2010/12/14 Alan Gates Actually, in 0.8 the code you give will work, if you cast m

Re: Pig 0.8.0 is released!

2010-12-20 Thread Alan Gates
See https://issues.apache.org/jira/browse/INFRA-3288. Alan. On Dec 18, 2010, at 5:42 PM, Eli Collins wrote: Congrats guys! Any ideas when this will be reflected in the git repo? There's no 0.8 release tag and the last change in the trunk and branch-0.8 branches is from Oct 5th. Thanks, Eli

Re: Master thesis about Hive/Pig/MapReduce

2011-01-04 Thread Alan Gates
Hi Michal, A couple of areas where you could study performance without duplicating Robert Stewart's work come to mind. One is in the area of how data skew affects performance. This is a very real world concern since in my experience almost all input data is power law distributed. Consi

Re: Taking advantage of structure when doing UDFs and whatnot?

2011-01-04 Thread Alan Gates
Answers inline. On Jan 4, 2011, at 11:10 AM, Jonathan Coveney wrote: I wasn't quite sure what title this, but hopefully it'll make sense. I have a couple of questions relating to a query that ultimately seeks to do this You have 1 10 1 12 1 15 1 16 2 1 2 2 2 3 2 6 You want your output to

Re: Taking advantage of structure when doing UDFs and whatnot?

2011-01-04 Thread Alan Gates
. We haven't yet added the ability for them to extend the Algebraic and Accumulator interfaces. Alan. The internal sort in the foreach and the using 'collected' (assuming I can get it to work :) should be big wins. 2011/1/4 Alan Gates Answers inline. On Jan 4, 2011, a

Re: Iterative MapReduce with PIG

2011-01-10 Thread Alan Gates
This is one of our major initiatives for 0.9. See http://wiki.apache.org/pig/TuringCompletePig and https://issues.apache.org/jira/browse/PIG-1479. But until that's ready you'll have to use Java or piglet as recommended by Dmitriy. Alan. On Jan 10, 2011, at 3:09 AM, deepak@wipro.com w

Re: wild card for all fields in a tuple

2011-01-12 Thread Alan Gates
There isn't a way to do that yet. See https://issues.apache.org/jira/browse/PIG-1693 for our plans on adding it in the next release. Alan. On Jan 12, 2011, at 2:51 PM, Dexin Wang wrote: Hi, Hope there is some simple answer to this. I have bunch of rows, for each row, I want to add a colu

Re: wild card for all fields in a tuple

2011-01-12 Thread Alan Gates
Jonathan is right, you can do all fields in a tuple with *. I was thinking of doing all fields in between two fields, which you can't do yet. Alan. On Jan 12, 2011, at 3:18 PM, Alan Gates wrote: There isn't a way to do that yet. See https://issues.apache.org/jira/browse/PIG

Re: Custom partitioning and order for optimum hbase store

2011-01-24 Thread Alan Gates
Do you want to order the groups or just within the groups? If you want to order within the groups you can do that in Pig in a single job. Alan. On Jan 24, 2011, at 1:20 PM, Dmitriy Lyubimov wrote: Thanks. So i take there's no way in pig to specify custom partitioner And the ordering in one

Re: Custom partitioning and order for optimum hbase store

2011-01-24 Thread Alan Gates
er in the PARTITIONED BY clause. I guess what would really solve the problem is custom partitioner in the ORDER BY. so using GROUP would just be a hack. On Mon, Jan 24, 2011 at 1:28 PM, Alan Gates wrote: Do you want to order the groups or just within the groups? If you want to order

Re: What do you guys write your pig code in?

2011-01-25 Thread Alan Gates
See http://wiki.apache.org/pig/PigTools, which lists editing highlight scripts for Eclipse, emacs, TextMate, and vim. Alan. On Jan 25, 2011, at 10:34 AM, Jonathan Coveney wrote: Howdy, I think I saw in one post that some people use TextMate, but what do those among you who use Windows dev

Fwd: CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Alan Gates
Begin forwarded message: From: Isabel Drost Date: January 25, 2011 12:53:28 PM PST To: "u...@mahout.apache.org" Cc: "gene...@lucene.apache.org" , "gene...@hadoop.apache.org " , "u...@hbase.apache.org" >, "solr-u...@lucene.apache.org" , "java-u...@lucene.apache.org " , "u...@nutch.apache.or

Re: Unexpected data type -1 found in stream.

2011-01-26 Thread Alan Gates
On Jan 26, 2011, at 10:17 AM, Jonathan Coveney wrote: I have never really had to raise a bug before, what should I do? Open a ticket, attach the code and the description, and posit that it may be a serialiazable error? Yes. It would make me feel so good if there was a bug, though...I'v

Re: Efficient ways to do non-equijoins?

2011-01-27 Thread Alan Gates
The script you propose will work, but if your data is of even reasonable size it will be very slow. A quick search of the web turned up one paper with an algorithm for parallel non-equijoins that at first glance might work in your case. Alan. On Jan 26, 2011, at 5:15 PM, Jonathan Coveney

Re: Efficient ways to do non-equijoins?

2011-01-27 Thread Alan Gates
This time with the link to the paper: http://www.vldb.org/conf/1991/P443.PDF :) Alan. On Jan 27, 2011, at 8:48 AM, Alan Gates wrote: The script you propose will work, but if your data is of even reasonable size it will be very slow. A quick search of the web turned up one paper with an

Re: How would you implement a custom join?

2011-01-28 Thread Alan Gates
Depending on the join algorithm you may be able to implement it with cogroup, a custom UDF, and possibly a custom partitioner. I haven't finished reading the band join algorithm paper I sent a link for, but I suspect it requires some records to be duplicated (since records within the band

[VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-02 Thread Alan Gates
Howl is a table management system built to provide metadata and storage management across data processing tools in Hadoop (Pig, Hive, MapReduce, ...). You can learn more details at http://wiki.apache.org/pig/Howl . For the last six months the code has been hosted at github. The Howl team

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-02 Thread Alan Gates
d On Feb 2, 2011, at 1:18 PM, Alan Gates wrote: Howl is a table management system built to provide metadata and storage management across data processing tools in Hadoop (Pig, Hive, MapReduce, ...). You can learn more details at http://wiki.apache.org/pig/Howl . For the last six months th

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-02 Thread Alan Gates
rote: On Wed, Feb 2, 2011 at 5:08 PM, Jeff Hammerbacher wrote: Awesome! Huge +1. On Wed, Feb 2, 2011 at 1:18 PM, Alan Gates wrote: Howl is a table management system built to provide metadata and storage management across data processing tools in Hadoop (Pig, Hive, MapReduce, ...). You ca

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread Alan Gates
Alan, I see your points. I agree with you and I am +1. (incubator/subproject is not important to me) You mentioned that hive is cautious about checking changes into the meta-store. I would not say we (hive) are cautious. Hive is getting pulled in many people in many directions (this is a goo

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread Alan Gates
Yes, it adds Input and Output formats for MapReduce and load and store functions for Pig. In the future it we expect it will continue to add more additional layers. Alan. On Feb 3, 2011, at 2:49 PM, John Sichi wrote: But Howl does layer on some additional code, right? https://github.com/

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-03 Thread Alan Gates
at 3:11 PM, Alan Gates wrote: Yes, it adds Input and Output formats for MapReduce and load and store functions for Pig. In the future it we expect it will continue to add more additional layers. Alan. On Feb 3, 2011, at 2:49 PM, John Sichi wrote: But Howl does layer on some additional

Re: [VOTE] Sponsoring Howl as an Apache Incubator project

2011-02-08 Thread Alan Gates
With 8 +1 votes and no -1s, the vote passes. Alan. On Feb 2, 2011, at 1:18 PM, Alan Gates wrote: Howl is a table management system built to provide metadata and storage management across data processing tools in Hadoop (Pig, Hive, MapReduce, ...). You can learn more details at http

Sponsoring Howl as an incubator project

2011-02-08 Thread Alan Gates
Last week I sent an email proposing that we, the Pig project, sponsor Howl as an incubator project. You can see the thread at http://tinyurl.com/4acfut4 . However, in proposing this I did not realize that I was also proposing that Howl should become a Pig subproject upon graduation from the

Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

2011-02-11 Thread Alan Gates
Possible, but it will be ignored. Anything done inside a nested foreach block will be executed at the parallel level of the preceding group by. Alan. On Feb 11, 2011, at 8:57 AM, Charles Gonçalves wrote: Is possible to use a parallel statment inside a nested foreach block like in : 28

Re: PARALLEL INSIDE a nested foreach block / DEFAULT_PARALLEL not workin?!

2011-02-11 Thread Alan Gates
tube.com/watch?v=hMtZfW2z9dw>no space to process everything) I tried Group by something and worked. Could be some optimization issue!? On Fri, Feb 11, 2011 at 3:10 PM, Alan Gates wrote: Possible, but it will be ignored. Anything done inside a nested foreach block will be executed at t

Re: Pig 0.8: DESCRIBE and DUMP are in disagreement after a GROUP BY and a FLATTEN

2011-02-15 Thread Alan Gates
The issue here is that describe is incorrectly removing the second level of tuple, even though dump is doing the right thing. I tested it again the top of trunk code, and describe now does the right thing. I suspect this is a side effect of the semantics work that's been going on (see htt

Re: Parameter to Pig Script contains forward slash - Script treats it as a Division operator. Escape not working

2011-02-15 Thread Alan Gates
If you pass '10/24/2010' instead of 10/24/2010 (that is, with single quotes), I think it will work. You'll have to be careful to actually get the single quotes through your shell, which will try to eat them. Alan. On Feb 13, 2011, at 12:29 PM, Arun A K wrote: Hi I am passing a parameter

Re: Cases of Work using pig on Industry to cite on my MSc

2011-02-19 Thread Alan Gates
There have been talks given at the Bay Area HUGs about how people use Pig. I know for example Yahoo Mail did one on how it uses Pig for spam detection. Presentations for those talks are posted to Yahoo's Hadoop blog: http://developer.yahoo.com/blogs/hadoop/ Alan. On Feb 19, 2011, at 1:1

Fwd: March 2011 San Francisco Hadoop User Meetup ("integration")

2011-02-24 Thread Alan Gates
Begin forwarded message: From: Aaron Kimball Date: February 23, 2011 2:07:28 PM PST To: "gene...@hadoop.apache.org" , "common-u...@hadoop.apache.org " Subject: March 2011 San Francisco Hadoop User Meetup ("integration") Reply-To: "common-u...@hadoop.apache.org" > Hadoop fans, I'm please

Re: Using the DistributedCache with pig

2011-02-28 Thread Alan Gates
https://issues.apache.org/jira/browse/PIG-1752 has a patch (checked into trunk) that allows UDFs to store files in the distributed cache. Alan. On Feb 28, 2011, at 2:43 PM, Jonathan Coveney wrote: I was just curious if anyone might have a decent code example of using the distributed cache

Re: Using dynamic invokers (InvokeForString)

2011-03-01 Thread Alan Gates
IIRC Java won't let you update the classpath on the fly (for security reasons I think). But giving a better error message would definitely be good. Alan. On Mar 1, 2011, at 10:04 AM, Dmitriy Ryaboy wrote: patches accepted :-) D On Tue, Mar 1, 2011 at 10:02 AM, Dan Brickley wrote:

Re: Shared resources

2011-03-02 Thread Alan Gates
There is no method in the eval func that gets called on the backend before any exec calls. You can keep a flag that tracks whether you have done the initialization so that you only do it the first time. Alan. On Mar 2, 2011, at 5:29 AM, Lai Will wrote: Hello, I wrote a EvalFunc implement

Re: Shared resources

2011-03-02 Thread Alan Gates
. Alan. On Mar 2, 2011, at 7:54 AM, Lai Will wrote: So I still get the redundant work whenever the same clusternode/vm creates multiple instances of my EvalFunc? And is it usual to have several instance of the EvalFunc on the same clusternode/vm? Will -Original Message- From: Alan Gates

Re: Shared resources

2011-03-02 Thread Alan Gates
ginal Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Wednesday, March 02, 2011 5:17 PM To: user@pig.apache.org Subject: Re: Shared resources There is no shared inter-task processing in Hadoop. Each task runs in a separate JVM and is locked off from all other tasks. This is p

Re: Hadoop Version.

2011-03-02 Thread Alan Gates
When Hadoop 0.21 was released it was declared not production worthy. We don't generally port Pig to a version of Hadoop until it is production worthy. Alan. On Mar 2, 2011, at 3:25 PM, Jane Chen wrote: Hi, I noticed that Pig 0.8 runs on Hadoop 0.20.2. Is there any plan to upgrade to 0

Re: [DISCUSSION] Pig.next

2011-03-03 Thread Alan Gates
I agree that there will probably need to be several 0.9.x releases as the new optimization and parser work mature. As a consequence of this it may be longer between 0.9 and Pig.next then there has been between the last few releases. That only delays the question of what we call Pig.next,

Re: frontend & backend

2011-03-03 Thread Alan Gates
Frontend is the machine you launch Pig from. This is where all planning is done. The backend is the Hadoop cluster where your jobs are executed. Alan. On Mar 3, 2011, at 2:03 PM, Lai Will wrote: Hello, When skimming through the API I sometimes find docs like: "Get the JobConf. This sho

Re: PerformanceTimerFactory error?

2011-03-08 Thread Alan Gates
What version of Pig are you using? PerformanceTimerFactory should only be used in debugging code, it shouldn't be involved in a regular usage. Alan. On Mar 8, 2011, at 6:50 AM, Jonathan Coveney wrote: I have a rather large query that took quite a while to execute (11hours, probably on t

Fwd: First Hadoop meetup in Houston

2011-03-09 Thread Alan Gates
Begin forwarded message: From: Mark Kerzner Date: March 7, 2011 7:37:38 PM PST To: Hadoop Discussion Group Subject: First Hadoop meetup in Houston Reply-To: "common-u...@hadoop.apache.org" > Hi, I have just created the Houston Hadoop Meetup group, and all suggestions are welcome. http

Re: Converting Pig DataTypes to Java Data Types

2011-03-10 Thread Alan Gates
Apache lists don't allow attachments. Instead you can file a JIRA and attach your code there. Alan. On Mar 10, 2011, at 3:04 PM, Jonathan Holloway wrote: I ran into an issue tonight with parsing log lines whereby I had to generate a schema in a user defined function. Part of that involved

Re: loading data

2011-03-11 Thread Alan Gates
Check out the XMLLoader in Piggybank. http://wiki.apache.org/pig/PiggyBank Alan. On Mar 10, 2011, at 3:44 PM, Andrew Hammond wrote: Hi all, Our application that generates a single xml file per transaction. I was wondering what the best practice is for getting this kind of data loaded int

Re: DUMP or STORE Depending on Parameter Input

2011-03-15 Thread Alan Gates
If you know before you start the script which you want, you can use parameter substitution: A = load 'foo'; ... Z = foreach Y generate ...; $DO_OUTPUT Then, depending on which you want, run pig with pig -pDO_OUTPUT='dump Z;'; or pig -pDO_OUTPUT="store Z into 'outfile';" If you want to deci

Re: Schema?

2011-03-17 Thread Alan Gates
Currently there is no way to specify the schema for values in the map up front. You have to cast them when you bring them out of the map. We hope to resolve that in 0.9. Alan. On Mar 17, 2011, at 2:11 AM, deepak kumar v wrote: I have a UDF , the output is a tuple of the following format

Re: question about Pig UDF

2011-03-17 Thread Alan Gates
It will be instantiated multiple times; once for each map or reduce (depending on which it is in). Pig itself also constructs your UDF during planning on the machine you launch your job on. Alan. On Mar 17, 2011, at 11:12 AM, souri datta wrote: Hi, If in a UDF , say in the constructo

Re: question about Pig UDF

2011-03-17 Thread Alan Gates
will be running simultaneously). Alan. On Mar 17, 2011, at 11:37 AM, souri datta wrote: so if i make the list static also, it will be created multiple times as each instance will be created in different machine's JVM . is that correct? On Thu, Mar 17, 2011 at 11:59 PM, Alan Gates

Re: possibly Pig throttles the number of mappers

2011-03-23 Thread Alan Gates
What version of Pig are you using? Starting in 0.8 Pig will combine small blocks into a single map. This prevents jobs that actually are reading small amounts of data from taking a lot of slots on the cluster. You can turn this off by adding - Dpig.noSplitCombination=true to your command

Re: Anti-Joins

2011-03-24 Thread Alan Gates
A = load 'input1' as (x, y); B = load 'input2' as (u, v); C = cogroup A by x, B by u; D = filter C by IsEmpty(B); E = foreach D generate flatten(A); Alan. On Mar 24, 2011, at 4:28 PM, mike st. john wrote: Are there any examples of Anti-Joins using Pig. Thanks Msj

Please welcome Xuefu as our latest committer

2011-03-29 Thread Alan Gates
It's my pleasure to announce that Pig has added Xuefu as our newest committer. He has been working on pig for several months now. In particular he has done the bulk of the work for moving the Pig Latin parser from javacc to antlr as part of 0.9. Please join me in welcoming him. Alan.

Re: Pig on CDH? or roll your own

2011-03-29 Thread Alan Gates
Cloudera's distribution is nice in that they bundle and test it together with Hadoop and other related tools, so you get the whole suite and you know it works together. The downside is that there is a lag between when Pig does a release and Cloudera picks it up, so you have to wait a few m

Re: Unexpected data type -1 found in stream

2011-03-29 Thread Alan Gates
What are you putting in as the value of the map? From the error it looks like your passing in a datatype Pig doesn't understand, and when it tries to write it out to the screen it doesn't know how to. Alan. On Mar 29, 2011, at 1:34 PM, Xavier Stevens wrote: I'm currently getting a really

Re: Unexpected data type -1 found in stream

2011-03-29 Thread Alan Gates
types it has and get back to you. Thanks Alan. On 3/29/11 1:56 PM, Alan Gates wrote: What are you putting in as the value of the map? From the error it looks like your passing in a datatype Pig doesn't understand, and when it tries to write it out to the screen it doesn't know ho

Re: Unexpected data type -1 found in stream

2011-03-29 Thread Alan Gates
3/29/11 1:56 PM, Alan Gates wrote: What are you putting in as the value of the map? From the error it looks like your passing in a datatype Pig doesn't understand, and when it tries to write it out to the screen it doesn't know how to. Alan. On Mar 29, 2011, at 1:34 PM, Xavier Stev

Re: jline and commons-lang - building 0.8.0 download

2011-03-31 Thread Alan Gates
Isn't ivy picking it up for you? That's what is supposed to happen. Alan. On Mar 28, 2011, at 11:32 AM, Jeremy Hanna wrote: Is there a standard way to get jline and commons-lang into pig? I work around by copying them into my build/ivy/lib/Pig directory but didn't know if there was a simp

Re: Geographically Distributed Hadoop Cluster

2011-04-05 Thread Alan Gates
The concern I have with that approach is I don't think you can guarantee that Hadoop will never assign tasks to read from the geographically distributed nodes. At Yahoo we have separate Hadoop clusters in separate geographic locations and use tools such as distcp to move data between them.

Re: Geographically Distributed Hadoop Cluster

2011-04-05 Thread Alan Gates
Same thing as Yahoo or same thing as Deepak suggests? Alan. On Apr 5, 2011, at 2:05 PM, Dmitriy Ryaboy wrote: Google apparently does the same with GFS / MR, at least by my reading of the Megastore paper. On Tue, Apr 5, 2011 at 9:59 AM, Alan Gates wrote: The concern I have with that

Re: Processing fixed length records with Pig

2011-04-06 Thread Alan Gates
PigStorage (the default load and store function) does not handle this case. You would need to write you own load function. Using PigStorage as a model this should not be too difficult. Instead of looking for field separators you just parse out the fields based on length, and then use sta

Re: Setup error: cannot find Hadoop configurations in classpath even when classpath is correctly set

2011-04-12 Thread Alan Gates
Set the environment variable PIG_CLASSPATH=/opt/hadoop/conf. Alan. On Apr 12, 2011, at 1:39 PM, W.P. McNeill wrote: I am having trouble getting Pig to see my Hadoop configuration files despite following the "Classpath in MapReduce Mode" instructions in the Troubleshooting

Re: About Pig Joins

2011-04-13 Thread Alan Gates
You can also take a look at the beta of the _Programming Pig_ book, http://ofps.oreilly.com/titles/9781449302641/ Specialized joins are described in chapter six. How to do semi- joins is also described under Cogroup in the same chapter. Any feedback you have on the descriptions would be wel

Re: Filter on contents of other dataset

2011-04-15 Thread Alan Gates
Is your comparison function equals or is there some transformation that could be applied to hdata and skey so it could be equals? If so you could use semi join instead, which should be much more efficient. Alan. On Apr 14, 2011, at 8:21 PM, Aniket Mokashi wrote: Hi, What would be the bes

Re: pig mismatch hadoop version problem

2011-04-15 Thread Alan Gates
Pig 0.8 only works with Hadoop 0.20.2. There is no version of Pig that works with 0.19 out of the box. Pig 0.4, patched as explained here https://issues.apache.org/jira/browse/PIG-573 can be made to work with 0.19. Alan. On Apr 15, 2011, at 2:32 AM, fengcheng he wrote: Hi friends,

Re: JSONToTuple for pig UDF

2011-04-19 Thread Alan Gates
On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote: A quick question about the UDF's registered at the top of a pig script: does REGISTER myJar.jar distribute the jar across HDFS (like a Hadoop job jar) so that the distribution of the code to the cluster nodes is transparent? In other words

Re: Question about bags and UDFs

2011-04-21 Thread Alan Gates
Starting with Pig 0.9 (not yet released but you can build it off the branch) a UDF can specify a file to put in the distributed cache. You could thus have your UDF pick up the file locally on your box and put it in the distributed cache, and then read it from the distributed cache on the b

Requesting feedback on _Programming Pig_ book

2011-04-23 Thread Alan Gates
As you may know, I am working on a book _Programming Pig_ for O'Reilly. You can view what I've written so far at http://ofps.oreilly.com/titles/9781449302641/ and leave feedback. This is updated regularly as I check in new sections and chapters. Currently I have mostly finished the intr

Re: SAMPLE after a GROUP BY

2011-04-25 Thread Alan Gates
You are not insane. Pig rewrites sample into filter, and then pushes that filter in front of the group. It shouldn't push that filter since the UDF is non-deterministic. If you add "-t PushUpFilter" to your command line when invoking pig this won't happen. Could you file a JIRA for this

Re: TOP ordering

2011-04-26 Thread Alan Gates
topResults = foreach D { srtd = order A by second; top3 = limit srtd 3; generate flatten(top3); }; Alan. On Apr 26, 2011, at 6:11 AM, ugo jardonnet wrote: Hi. I am looking for a way to get the result of top ordered. Is it possible ? Example: A = LOAD 'datatest' USING

Error Executing a Fragment Replicated Join

2011-04-26 Thread Alan Gates
Sent for Renato, since Apache's mail system has decided it doesn't like him. Alan. I am getting an error while trying to execute a simple fragment replicated join on two files (one of 77MB and the other one of 32MB). I am using the 32MB file as the small one to be replicated, but I keep ge

Re: TOP ordering

2011-04-26 Thread Alan Gates
this will give the bottom 3 records. You'll need to change it to 'srtd = order A by second desc;' to get the top 3. Alan. On Apr 26, 2011, at 9:36 AM, ugo jardonnet wrote: 2011/4/26 Alan Gates topResults = foreach D { srtd = order A by second; top3

Re: Error Executing a Fragment Replicated Join

2011-04-27 Thread Alan Gates
Dmitriy tried to replay and say: Renato, can you send along the pig script and pig version? but his got blocked too. Alan. On Apr 27, 2011, at 3:42 PM, Renato Marroquín Mogrovejo wrote: Does anybody have any suggestions? Please??? Thanks again. Renato M. 2011/4/26 Alan Gates Sent for

Re: Confusion looking at source for PigStorage

2011-04-28 Thread Alan Gates
Originally it used a regular expression. At some point we changed that to a single character because it was much faster than a regex. Apparently we missed a spot in the documentation when we made the change. Alan. On Apr 28, 2011, at 8:30 AM, Jonathan Coveney wrote: I'm sure this is we

Pig user meetup at Yahoo 6/30

2011-04-29 Thread Alan Gates
All, Yahoo has offered to host a Pig user meetup on June 30th, the day after the Hadoop summit. It has also volunteered me to plan it. So, I'd like to know what would interest people. What should we focus on? What will be the best way to enable Pig users and developers to connect and

Fwd: [hadoop] Hadoop Summit 2011 by Yahoo!: June 29th, Santa Clara Convention Center. Register and submit abstract for presentation: www.hadoopsummit.org

2011-05-02 Thread Alan Gates
Begin forwarded message: From: Avik Dey Date: May 2, 2011 12:02:52 PM PDT To: "'gene...@hadoop.apache.org'" Subject: [hadoop] Hadoop Summit 2011 by Yahoo!: June 29th, Santa Clara Convention Center. Register and submit abstract for presentation: www.hadoopsummit.org Reply-To: "gene...@ha

Re: Understanding incompatibilities with different versions of hadoop?

2011-05-03 Thread Alan Gates
We, the Yahoo Pig team, test Pig against 0.20.2 Hadoop and the internal Yahoo version of Hadoop (hopefully soon to be released through Apache as 0.20.203). My impression of CHD3 was that it was very close to 0.20.203 with HDFS append added. The Cloudera guys would better be able to answer

Fwd: 33 Days left to Berlin Buzzwords 2011

2011-05-04 Thread Alan Gates
Begin forwarded message: From: Simon Willnauer Date: May 4, 2011 12:38:42 AM PDT To: java-user , java-dev >, "lucy-u...@incubator.apache.org" u...@incubator.apache.org>, "gene...@lucene.apache.org" >, "us...@elasticsearch.com" , "connectors-u...@incubator.apache.org " , "common-u...@hadoop.a

Re: run pig in mode hadoop

2011-05-10 Thread Alan Gates
Are you asking how to point Pig to your Hadoop cluster in grid5000? If so, By default Pig runs in Hadoop mode. In order to point it towards a particular cluster you need to set the PIG_CLASSPATH environment variable to the directory where your mapred-site.xml and hdfs-site.xml configurati

Re: Working with an unknown number of values

2011-05-10 Thread Alan Gates
TOKENIZE takes a string and returns a bag. It's issue is right now it only allows you to split on whitespace. It would make sense to generalize this to take a delimiter. Alan. On May 7, 2011, at 7:55 PM, Jacob Perkins wrote: Dmitriy, I see your point. It would definitely be nice to ha

Re: run pig in mode hadoop

2011-05-10 Thread Alan Gates
Are you asking how to point Pig to your Hadoop cluster in grid5000? If so, By default Pig runs in Hadoop mode. In order to point it towards a particular cluster you need to set the PIG_CLASSPATH environment variable to the directory where your mapred-site.xml and hdfs-site.xml configurati

Re: Pig 0.7 download mirror sites not working

2011-05-12 Thread Alan Gates
Hadoop has removed the release artifacts of its former subprojects (including Pig) from the mirrors. You can still find the release in Apache's archive: http://archive.apache.org/dist/hadoop/pig/pig-0.7.0/ Alan. On May 12, 2011, at 9:16 AM, Subhramanian, Deepak wrote: The mirror sites I

Re: Explain Plan in Pig

2011-05-12 Thread Alan Gates
http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html#dev_tools Alan. On May 12, 2011, at 4:32 PM, sonia gehlot wrote: Hi Guys, Can anyone please tell me how to read Explain plan in pig? When I do explain plan for any of my pig query it gives me really good flow diagram,

Welcome to Aniket Mokashi

2011-05-19 Thread Alan Gates
Please join me in welcoming Aniket Mokashi as a new committer on Pig. Aniket has been contributing to Pig since last summer. He wrote or helped shepherd several major features in 0.8, including the Python UDF work, the new mapreduce functionality, and the custom partitioner. We look forw

Re: General question about pig

2011-05-20 Thread Alan Gates
Pig Latin programs get translated into a set of physical operators which are placed in MapReduce jobs. Whether a particular operator ends up in a map or a reduce depends on a number of factors. We presented a paper at VLDB a few years ago that goes into this in detail, http://www.vldb.org

Re: how to operate a map type

2011-05-24 Thread Alan Gates
Can't you mimic dynamic key support with static keys by making your map have two static keys 'key' and 'value'? Alan. On May 24, 2011, at 3:05 AM, Jameson Li wrote: OK.OK.I know that just write UDFs. I have to write UDFs, and see you.. And I still think there should be grammar support fo

Fwd: Spring Scale-A-Thon RTP 2011 -- June 18th

2011-05-24 Thread Alan Gates
Begin forwarded message: From: Grant Ingersoll Date: May 20, 2011 5:43:12 AM PDT To: "gene...@hadoop.apache.org" Subject: Spring Scale-A-Thon RTP 2011 -- June 18th Reply-To: "gene...@hadoop.apache.org" Those in the Raleigh/Durham/Chapel Hill North Carolina area or those willing to travel

  1   2   3   4   >