query returns sometext instead of none

2011-02-09 Thread Cam Bazz
Hello, I am making a query such that: insert overwrite table selection_hourly_clicks partition (date_hour = PARTNAME) select sel_sid, count(*) cc from (select split(parse_url(iv.referrer_url,'PATH'), '_')[1] sel_sid from item_raw iv where iv.date_hour='PARTNAME' AND iv.referrer_url is not null AN

Re: filtering out crawlers

2011-02-09 Thread Wil -
Hi, There are quite a few databases online with known robots. http://www.robotstxt.org/db.html and http://www.botsvsbrowsers.com/category/1/index.html comes to mind. The hardest part is figuring out the suspect robots which do not identify themselves. From: Ca

RE: for each partition

2011-02-09 Thread Christopher, Pat
CB: does the dynamic partitioning fill your need? I don't totally understand it but if it does, awesome. Otherwise there isn't a for/each construct in HiveQL. You'd have to write an external program. I'm curious though, do you have to reprocess each partition each day or is there a partitio

partitioned views

2011-02-09 Thread John Sichi
One of the impediments for uptake of the CREATE VIEW feature in Hive has been the lack of partition awareness. This made it non-transparent to replace a table with a view, e.g. for renaming purposes. To address this as well as some other use cases, I'm proposing the first steps towards view pa

Re: periodic execution

2011-02-09 Thread Alejandro Abdelnur
Hi Cam, A bit of information that may be useful for you, Cloudera's Oozie has a Hive action that you can use from workflow jobs. Cheers Alejandro On Wed, Feb 9, 2011 at 11:44 AM, Cam Bazz wrote: > Hello, > > I am looking over oozie's coordinator. But meanwhile, I managed to > write a simple j

Re: periodic execution

2011-02-09 Thread Appan Thirumaligai
Try Azkaban - We use it here @ngmoco to run MR Jobs (not Hive Queries) and its pretty good - http://sna-projects.com/azkaban/ Also, it is faster learning / easy to setup. I have never worked on Oozie so I can't compare but you can google it. On Feb 8, 2011, at 7:44 PM, Cam Bazz wrote: > Hello

Re: for each partition

2011-02-09 Thread Namit Jain
You can use dynamic partitioning: insert overwrite table item_view_aggregate partition (date_hour) select iv.sid, count(*), date_hour from item_view iv where (iv.date_hour='2011310116' or date_hour=''' or date_hour='.) group by iv.sid, date_hour; On 2/9/11 5:49 AM, "Cam Bazz" wrote: >We

Re: for each partition

2011-02-09 Thread Cam Bazz
Well, I designed my dataflow to work incrementally based on partitions. But I have a number of datafiles now, and for the first run, I have to for example: insert overwrite table item_view_aggregate partition (date_hour=2011310116) select iv.sid, count(*) from item_view iv where iv.date_hour='2011