I think the fix is-
tuple.set(0, new DataByteArray(url));
to
tuple.set(0, url);
Thanks,
Aniket
On Fri, April 22, 2011 8:30 pm, Steve Watt wrote:
> Richard, if you're coming to OSCON or Hadoop Summit, please let me know
> so I can buy you a beer. Thanks for the help. This now works for with the
>
If the expected return type of your loader is (String, String, String) you
should just put Strings into the tuple (no conversion to DataByteArrays) and
report your schema to Pig via
an implementation of LoadMetadata.getSchema()
D
On Fri, Apr 22, 2011 at 5:30 PM, Steve Watt wrote:
> Richard, if
Richard, if you're coming to OSCON or Hadoop Summit, please let me know so I
can buy you a beer. Thanks for the help. This now works for with the excite
log using PigStorage();
It is however still not working with my custom LoadFunc and data. For
reference, I am using Pig 0.8. I have written a cus
Hi Daniel,
I did test to see see that it was fixed, and the description (as in
the jira) did not directly seem to apply to this issue (when I did a
cursory search) - hence the query.
Since the columns were getting re-aliased (and after a join in one
case), I was not expecting initial aliase
raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
query:chararray);
queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
dump queries;
On 4/22/11 2:25 PM, "Steve Watt" wrote:
Hi Folks
I've done a load of a dataset and I am attempting to filter out unwanted
records by c
Hi Folks
I've done a load of a dataset and I am attempting to filter out unwanted
records by checking that one of my tuple fields contains a particular
string. I've distilled this issue down to the sample excite.log that ships
with Pig for easy recreation. I've read through the INDEXOF code and I
Hi, Mridul,
Sorry I was confused when you say "alias re-use" :). PIG-1705 happens if
the same column is eventually used twice in a relation. Here in z {m::k,
m::v, y::aa, y::data}, both m::k and y::aa can be traced back to m.k. I
did tried PIG-1705 and verified that is the cause. The patch is n
I may be misunderstanding what you are asking. The tricky part is measuring
MR time *without* wait time, which one cannot control (it depends mostly on
the size and utilization level of your cluster). This tricky bit is what
PigStats helps you with.
If you just want to measure the full time, includ
Follow-up question, how do you add it to the cache in a pig script, and once
it's in there can you access it from the UDF using regular Java file I/O?
That is, it is as simple as saying:
copyFromLocal $localFilePath udfFile.txt
DEFINE someudf org.someudf CACHE('udfFile.txt#udfFile.txt');
And the
I think I may have to go with your second option - but thanks for the info,
I'll keep an eye on 0.9.0.
On Thu, Apr 21, 2011 at 4:16 PM, Alan Gates wrote:
> Starting with Pig 0.9 (not yet released but you can build it off the
> branch) a UDF can specify a file to put in the distributed cache. Yo
Alias vs relation difference.
The bug is about alias issue, not relation iirc.
Everything comes from limited number of relations which are loaded
anyway :-)
- Mridul
On Friday 22 April 2011 06:40 AM, Jianyong Dai wrote:
m is actually reused. z is joining two relations both stemming from m.
11 matches
Mail list logo