Re: Skip Badly Compressed Input Files

2011-01-25 Thread Kim Vogt
sure :-) On Tue, Jan 25, 2011 at 5:54 PM, Dmitriy Ryaboy wrote: > I do it pre-pig. > I think this has to be handled at the RecordReader level if you wanted to > do > it in the framework. > > Hey want to contribute to the erorr handling design discussion? :) We > haven't thought about LoadFuncs y

Re: Skip Badly Compressed Input Files

2011-01-25 Thread Dmitriy Ryaboy
I do it pre-pig. I think this has to be handled at the RecordReader level if you wanted to do it in the framework. Hey want to contribute to the erorr handling design discussion? :) We haven't thought about LoadFuncs yet.. http://wiki.apache.org/pig/PigErrorHandlingInScripts On Tue, Jan 25, 201

Re: python udf doesnt work

2011-01-25 Thread Richard Ding
You're right. There're two issues here. First, the Jython script needs to locate the modules in its search path (e.g. python.path). If you have the right env variable set, Jython script should be able to find and import the module. Second, Pig currently doesn't automatically ship the module file

Re: Skip Badly Compressed Input Files

2011-01-25 Thread Kim Vogt
Do you catch the error when you load with pig, or is that a pre-pig step? If I wanted to catch the error in a pig load, is it possible? Where would that code go? -Kim On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy wrote: > Yeah so the unexpected EOF is the most common one we get (lzo requires

Re: Skip Badly Compressed Input Files

2011-01-25 Thread Dmitriy Ryaboy
Yeah so the unexpected EOF is the most common one we get (lzo requires a footer, and sometimes filehandles are closed before a footer is written, if the network hiccups or something). Right now what we do is scan before moving to the DW, and if not successful, extract what's extractable, catch the

Re: Skip Badly Compressed Input Files

2011-01-25 Thread Kim Vogt
This is the error I'm getting: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)

Re: joining on a group?

2011-01-25 Thread felix gao
last time I checked, I don't think you can do join on groups. But that was like a year ago. On Tue, Jan 25, 2011 at 12:49 PM, Neil Kodner wrote: > I've created a relation by grouping on a composite key. I then join a > similar relation using the grouped key as the join key. > > outgoing = FOREA

Re: Skip Badly Compressed Input Files

2011-01-25 Thread Dmitriy Ryaboy
How badly compressed are they? Problems in the codec, or in the data that comes out of the codec? We've had some lzo corruption problems, and so far have simply been dealing with that by doing correctness tests in our log mover pipeline before moving into the "data warehouse" area. Skipping bad f

Skip Badly Compressed Input Files

2011-01-25 Thread Kim Vogt
Hi, I'm processing gzipped compressed files in a directory, but some files are corrupted and can't be decompressed. Is there a way to skip the bad files with a custom load func? -Kim

Fwd: CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Alan Gates
Begin forwarded message: From: Isabel Drost Date: January 25, 2011 12:53:28 PM PST To: "u...@mahout.apache.org" Cc: "gene...@lucene.apache.org" , "gene...@hadoop.apache.org " , "u...@hbase.apache.org" >, "solr-u...@lucene.apache.org" , "java-u...@lucene.apache.org " , "u...@nutch.apache.or

Re: python udf doesnt work

2011-01-25 Thread Xiaomeng Wan
Hi Daniel, I did put jython.jar in classpath. By comparing other python udfs with this one, I find those udfs which work do not import anything. Could that be the cause? Do I need to anything extra to import module in my udf? Thanks! Shawn On Mon, Jan 24, 2011 at 5:28 PM, Daniel Dai wrote: > P

Re: Unexpected data type -1 found in stream.

2011-01-25 Thread Jonathan Coveney
package squeal.fun; import java.util.Iterator; import java.util.List; import java.util.ArrayList; import java.util.Map; import java.util.HashMap; import java.util.Set; import java.util.HashSet; import java.io.IOException; import org.apache.pig.PigException; import org.apache.pig.backend.executionen

Re: Unexpected data type -1 found in stream.

2011-01-25 Thread Gianmarco
>From what I see the data type of the DataBag is not correcly recognized. I guess that that -1 comes from DataType.findType(), that is returning ERROR. I also assume (but I am not sure) that the concrete type of getValue() should be AccumulativeBag, but for some reason it is something different. Ma

Re: What do you guys write your pig code in?

2011-01-25 Thread Jonathan Coveney
Thanks, perfect. 2011/1/25 Alan Gates > See http://wiki.apache.org/pig/PigTools, which lists editing highlight > scripts for Eclipse, emacs, TextMate, and vim. > > Alan. > > > On Jan 25, 2011, at 10:34 AM, Jonathan Coveney wrote: > > Howdy, >> >> I think I saw in one post that some people use T

Re: What do you guys write your pig code in?

2011-01-25 Thread Alan Gates
See http://wiki.apache.org/pig/PigTools, which lists editing highlight scripts for Eclipse, emacs, TextMate, and vim. Alan. On Jan 25, 2011, at 10:34 AM, Jonathan Coveney wrote: Howdy, I think I saw in one post that some people use TextMate, but what do those among you who use Windows dev

Re: What do you guys write your pig code in?

2011-01-25 Thread Shane Eller
Not all of these are exactly what you were looking for, but there are a few highlighter plugins for the likes of Eclipse, TextMate, Emacs, and Vim. Hope it helps... http://wiki.apache.org/pig/PigTools -- Shane On 01/25/2011 01:34 PM, Jonathan Coveney wrote: Howdy, I think I saw in one post

Unexpected data type -1 found in stream.

2011-01-25 Thread Jonathan Coveney
I've been able to isolate the problem, but have no idea what is causing it. The input is in this form (this is correct): {({(a),(b),(c)}),({(a),(b),(c)}),({(a),(b),(c)})} and the output is in this form: {(b,c,3),(c,a,3),(b,a,3)} which is also correct. By placing prints and whatnot, I can see t

Custom Seq File Loader: ClassNotFoundException

2011-01-25 Thread C 4.6
Hi All, I am having ClassNotFound problems w.r.t. a custom Load function. - I am using Pig-0.7.0 with Hadoop-0.20.2 - Input to the job is a sequence file with custom key/value data - I am including the load UDF source below. Note that the UDF does not care about what is inside the sequence file

Re: Simple Pig query returns inaccurate result size for HBase tables of 1.8m+ rows

2011-01-25 Thread Mr. Lukas
Hello again, I just found something interesting in the logs: INFO org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat: setScan with ranges: 5192296858534827628530496329220096 - 5192343374370748142029900260897474 ( 46515835920513499403931677378) But in my case, it should more be from 1020576