Yes …you are correct. Performance will be better with 24 columns. Better add 
one more column(25th)  to keep daily aggregates for 24 hours. This will give a 
better performance boost.

From: Pariksheet Barapatre [mailto:pbarapa...@gmail.com]
Sent: Tuesday, June 23, 2015 8:07 PM
To: user@phoenix.apache.org
Subject: Re: Data Model Suggestion

Hi Puneet,
Upsert could be easy -
e.g. If we want to load data for 2015062412
---------------------------------------------------
ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
---------------------------------------------------------------------------------
1                | US                  | 2015062412<tel:2015062301>  | 2000     
   |

Then upsert will be -
UPSERT INTO myTABLE (ADID,COUNTRY,DAYID,IMP12) VALUES(1,'US',20150624,2000) ;

Your point is true that HBase will store 24 key-values. With one of ebay blogs 
it says,  think your key value as sorted map.

SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>
If we refer this, I guess keeping 24 columns does make a difference versus hour 
id in row key.

Thanks
Pari

On 23 June 2015 at 19:10, Puneet Kumar Ojha 
<puneet.ku...@pubmatic.com<mailto:puneet.ku...@pubmatic.com>> wrote:
Whether you keep 24 column .Hbase will still store it as 24 records only.

There are benefits on use cases. Like breaking into column families , how 
updates behave for a particular rowkey if you go with 24 columns.


From: Pariksheet Barapatre 
[mailto:pbarapa...@gmail.com<mailto:pbarapa...@gmail.com>]
Sent: Tuesday, June 23, 2015 7:07 PM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Data Model Suggestion

Thanks Skanda for your suggestion. If we go with 24 columns , won't it reduce 
number of rows over large data set and will improve scan performance. If I want 
to get daily report, I just have to scan only one row.
Your thoughts?
Thanks
Pari

On 23 June 2015 at 18:57, Skanda 
<skanda.ganapa...@gmail.com<mailto:skanda.ganapa...@gmail.com>> wrote:
Hi Pari
For your use-case, having it as part of the rowkey should be a better design 
than creating so many columns.
Regards
Skanda

On Tue, Jun 23, 2015 at 6:49 PM, Pariksheet Barapatre 
<pbarapa...@gmail.com<mailto:pbarapa...@gmail.com>> wrote:
Hello All,

This is more like a HBase question but as I am planning to use Phoenix as a 
access layer, I hope phoenix user will help me.
I would like to create time series data to get on-the-fly analytics.
This use case is for adTech.
Report - what is houly,daily,weekly impression counts at country level for a 
given advertisement ID (ADID).
I am doing hourly aggregation and loading into a Phoenix table.
Primary Key - ADID          | COUNTRY       | HOUR ID

---------------------------------------------------------------------------------
ADID          | COUNTRY       | HOUR ID      |  CF.IMP  |
---------------------------------------------------------------------------------
1                | US                  | 2015062301<tel:2015062301>  | 3000     
   |
1                | US                  | 2015062302<tel:2015062302>  | 3421     
   |
1                | UK                  | 2015062302<tel:2015062302>  | 1212     
   |
---------------------------------------------------------------------------------
Is it a good schema design or shall I create alternate schema as below
Primary Key - ADID          | COUNTRY       | DAY ID
----------------------------------------------------------------------------------------------------
ADID          | COUNTRY       | DAY ID      |  CF.IMP01  | CF.IMP02  |
----------------------------------------------------------------------------------------------------
1                | US                  | 20150623  | 3000        | 3421         
|
1                | UK                  | 20150623  | NULL        | 1212         
 |
----------------------------------------------------------------------------------------------------
Here, I have taken hour part from hour ID and created 24 columns.
I gone through many time-series NoSQL blog posts, most the author suggest to go 
with wider rows as above. This will reduce the scan, but I don't see much 
difference in both Data Models in term of latency for scanning.
Can anybody please suggest good approach for my use case?

Cheers,
Pari




--
Cheers,
Pari



--
Cheers,
Pari

Reply via email to