Thanks, Do you have this partitioner implemented? Perhaps it would be good to try to get this into Camus as a build in option. HivePartitioner? :)
-Ao > On Mar 11, 2015, at 13:11, Bhavesh Mistry <mistry.p.bhav...@gmail.com> wrote: > > Hi Ad > > You have to implement custom partitioner and also you will have to create > what ever path (based on message eg log line timestamp, or however you > choose to create directory hierarchy from your each message). > > You will need to implement your own Partitioner class implementation: > https://github.com/linkedin/camus/blob/master/camus-api/src/main/java/com/linkedin/camus/etl/Partitioner.java > and use configuration "etl.partitioner.class=CLASSNAME" then you can > organize any way you like. > > I hope this helps. > > > Thanks, > > Bhavesh > > > On Wed, Mar 11, 2015 at 8:36 AM, Andrew Otto <ao...@wikimedia.org> wrote: > >>> e.g File produce by the camus job: /user/[hive.user]/output/ >>> >> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/* >> >> Bhavesh, how do you get Camus to write into a directory hierarchy like >> this? Is it reading the partition values from your messages' timestamps? >> >> >>> On Mar 11, 2015, at 11:29, Bhavesh Mistry <mistry.p.bhav...@gmail.com> >> wrote: >>> >>> HI Yang, >>> >>> We do this today camus to hive (without the Avro) just plain old tab >>> separated log line. >>> >>> We use the hive -f command to add dynamic partition to hive table: >>> >>> Bash Shell Scripts add time buckets into HIVE table before camus job >> runs: >>> >>> for partition in "${@//\//,}"; do >>> echo "ALTER TABLE ${env:TABLE_NAME} ADD IF NOT EXISTS PARTITION >>> ($partition);" >>> done | hive -f >>> >>> >>> e.g File produce by the camus job: /user/[hive.user]/output/ >>> >> *partition_month_utc=2015-03/partition_day_utc=2015-03-11/partition_minute_bucket=2015-03-11-02-09/* >>> >>> Above will add hive dynamic partition before camus job runs. It works, >> and >>> you can have any schema: >>> >>> CREATE EXTERNAL TABLE IF NOT EXISTS ${env:TABLE_NAME} ( >>> SOME Table FIELDS... >>> ) >>> PARTITIONED BY ( >>> partition_month_utc STRING, >>> partition_day_utc STRING, >>> partition_minute_bucket STRING >>> ) >>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >>> STORED AS SEQUENCEFILE >>> LOCATION '${env:TABLE_LOCATION_CAMUS_OUTPUT}' >>> ; >>> >>> >>> I hope this will help ! You will have to construct hive query >> according >>> to partition define. >>> >>> Thanks, >>> >>> Bhavesh >>> >>> On Wed, Mar 11, 2015 at 7:24 AM, Andrew Otto <ao...@wikimedia.org> >> wrote: >>> >>>>> Hive provides the ability to provide custom patterns for partitions. >> You >>>>> can use this in combination with MSCK REPAIR TABLE to automatically >>>> detect >>>>> and load the partitions into the metastore. >>>> >>>> I tried this yesterday, and as far as I can tell it doesn’t work with a >>>> custom partition layout. At least not with external tables. MSCK >> REPAIR >>>> TABLE reports that there are directories in the table’s location that >> are >>>> not partitions of the table, but it wouldn’t actually add the partition >>>> unless the directory layout matched Hive’s default >>>> (key1=value1/key2=value2, etc.) >>>> >>>> >>>> >>>>> On Mar 9, 2015, at 17:16, Pradeep Gollakota <pradeep...@gmail.com> >>>> wrote: >>>>> >>>>> If I understood your question correctly, you want to be able to read >> the >>>>> output of Camus in Hive and be able to know partition values. If my >>>>> understanding is right, you can do so by using the following. >>>>> >>>>> Hive provides the ability to provide custom patterns for partitions. >> You >>>>> can use this in combination with MSCK REPAIR TABLE to automatically >>>> detect >>>>> and load the partitions into the metastore. >>>>> >>>>> Take a look at this SO >>>>> >>>> >> http://stackoverflow.com/questions/24289571/hive-0-13-external-table-dynamic-partitioning-custom-pattern >>>>> >>>>> Does that help? >>>>> >>>>> >>>>> On Mon, Mar 9, 2015 at 1:42 PM, Yang <teddyyyy...@gmail.com> wrote: >>>>> >>>>>> I believe many users like us would export the output from camus as a >>>> hive >>>>>> external table. but the dir structure of camus is like >>>>>> /YYYY/MM/DD/xxxxxx >>>>>> >>>>>> while hive generally expects /year=YYYY/month=MM/day=DD/xxxxxx if you >>>>>> define that table to be >>>>>> partitioned by (year, month, day). otherwise you'd have to add those >>>>>> partitions created by camus through a separate command. but in the >>>> latter >>>>>> case, would a camus job create >1 partitions ? how would we find out >> the >>>>>> YYYY/MM/DD values from outside ? ---- well you could always do >>>> something by >>>>>> hadoop dfs -ls and then grep the output, but it's kind of not >> clean.... >>>>>> >>>>>> >>>>>> thanks >>>>>> yang >>>>>> >>>> >>>> >> >>