Yes …you are correct. Performance will be better with 24 columns. Better add one more column(25th) to keep daily aggregates for 24 hours. This will give a better performance boost.
From: Pariksheet Barapatre [mailto:pbarapa...@gmail.com] Sent: Tuesday, June 23, 2015 8:07 PM To: user@phoenix.apache.org Subject: Re: Data Model Suggestion Hi Puneet, Upsert could be easy - e.g. If we want to load data for 2015062412 --------------------------------------------------- ADID | COUNTRY | HOUR ID | CF.IMP | --------------------------------------------------------------------------------- 1 | US | 2015062412<tel:2015062301> | 2000 | Then upsert will be - UPSERT INTO myTABLE (ADID,COUNTRY,DAYID,IMP12) VALUES(1,'US',20150624,2000) ; Your point is true that HBase will store 24 key-values. With one of ebay blogs it says, think your key value as sorted map. SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>> If we refer this, I guess keeping 24 columns does make a difference versus hour id in row key. Thanks Pari On 23 June 2015 at 19:10, Puneet Kumar Ojha <puneet.ku...@pubmatic.com<mailto:puneet.ku...@pubmatic.com>> wrote: Whether you keep 24 column .Hbase will still store it as 24 records only. There are benefits on use cases. Like breaking into column families , how updates behave for a particular rowkey if you go with 24 columns. From: Pariksheet Barapatre [mailto:pbarapa...@gmail.com<mailto:pbarapa...@gmail.com>] Sent: Tuesday, June 23, 2015 7:07 PM To: user@phoenix.apache.org<mailto:user@phoenix.apache.org> Subject: Re: Data Model Suggestion Thanks Skanda for your suggestion. If we go with 24 columns , won't it reduce number of rows over large data set and will improve scan performance. If I want to get daily report, I just have to scan only one row. Your thoughts? Thanks Pari On 23 June 2015 at 18:57, Skanda <skanda.ganapa...@gmail.com<mailto:skanda.ganapa...@gmail.com>> wrote: Hi Pari For your use-case, having it as part of the rowkey should be a better design than creating so many columns. Regards Skanda On Tue, Jun 23, 2015 at 6:49 PM, Pariksheet Barapatre <pbarapa...@gmail.com<mailto:pbarapa...@gmail.com>> wrote: Hello All, This is more like a HBase question but as I am planning to use Phoenix as a access layer, I hope phoenix user will help me. I would like to create time series data to get on-the-fly analytics. This use case is for adTech. Report - what is houly,daily,weekly impression counts at country level for a given advertisement ID (ADID). I am doing hourly aggregation and loading into a Phoenix table. Primary Key - ADID | COUNTRY | HOUR ID --------------------------------------------------------------------------------- ADID | COUNTRY | HOUR ID | CF.IMP | --------------------------------------------------------------------------------- 1 | US | 2015062301<tel:2015062301> | 3000 | 1 | US | 2015062302<tel:2015062302> | 3421 | 1 | UK | 2015062302<tel:2015062302> | 1212 | --------------------------------------------------------------------------------- Is it a good schema design or shall I create alternate schema as below Primary Key - ADID | COUNTRY | DAY ID ---------------------------------------------------------------------------------------------------- ADID | COUNTRY | DAY ID | CF.IMP01 | CF.IMP02 | ---------------------------------------------------------------------------------------------------- 1 | US | 20150623 | 3000 | 3421 | 1 | UK | 20150623 | NULL | 1212 | ---------------------------------------------------------------------------------------------------- Here, I have taken hour part from hour ID and created 24 columns. I gone through many time-series NoSQL blog posts, most the author suggest to go with wider rows as above. This will reduce the scan, but I don't see much difference in both Data Models in term of latency for scanning. Can anybody please suggest good approach for my use case? Cheers, Pari -- Cheers, Pari -- Cheers, Pari