[ 
https://issues.apache.org/jira/browse/HIVE-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372501#comment-14372501
 ] 

Alan Gates commented on HIVE-9693:
----------------------------------

HiveConf:
METASTORE_HBASE_AGGREGATE_STATS_CACHE_MAX_VARIANCE 0.01 seems like a really low 
default for this.  I was thinking we should start closer to 0.1 or 0.2.

AggregateStatsCache:
For these classes that won't be used outside of the o.a.h.h.metastore.hbase 
package I've been marking both classes and methods as package scoped instead of 
public.  This way it's clear to people that this is not a public API.
I'm nervous about the read/write lock on AggregateStatsList.  I see why you're 
doing it, but I'm nervous that the readers could spend a long time in there 
comparing and block writers.  One thing we could do is make it so that the 
readers only hold the lock for long enough to copy the list and then release 
the lock.  Or we could add a timeout to the write lock so we bound the amount 
of time the writer will wait, and if it can't write in that amount of time it 
just drops it.
Line 46, we know a priori how big the hash table can get from the config value, 
so we should use that in the construction of the hash table as that will avoid 
possibly expensive resizings later.
findBestMatch() This should also check whether an AggregateStats entry is past 
it's ttl.  If so, it should reject it even if it's a match.  This avoids way 
old data getting used in the case where the cache is large relative to the 
system usage and the cleaner hasn't been kicked off recently.
line 149, bogus javadoc, there's no param candidateList
Why have both a cleaner thread and evictOneNode?  It's simpler to spawn the 
cleaner and add the node, even if that puts the cache a few over.  Then the 
cleaner should first remove all timed out nodes, and if it is still above the 
cleaning threshold it should start evicting nodes on an LRU basis.  Actually, 
once it starts cleaning it should make sure it goes a ways below the threshold 
so it doesn't have to start cleaning again immediately.  Thus I'd add a 
{{private float cleanUntil = 0.8f}}
maxFull and cleanUntil should be configurable.  You'll want it for testing 
purposes if nothing else.
In evictOneNode, if I understand the javadoc on ConcurrentHashMap correctly you 
could get a key from the keyset that returns null on get, as some other thread 
could have already removed the key.  Moving evictOneNode into the cleaner 
thread would get rid of this issue as well since it would guarantee only one 
thread was ever removing entries from the list.
line 323, the cast is unnecessary.

HBaseReadWrite
line 1419, 1456 bogus javadoc, param name is tblName not tableName

HBaseUtils
You should break the ColumnStatsAggregator interface and implementing classes 
into separate classes.  This is an area where we'll have more work to do 
(especially in combining NDVs) and it will be convenient to have it in separate 
code rather than letting this class get huge (or more huge, anyway).  You could 
even put it in an hbase.aggregator package if you wanted.

BitVector
line 60, bogus javadoc, no param index

BloomFilter
Are there not existing implementations of a bloom filter?  Do we have to 
implement one ourselves?

I haven't reviewed the tests yet, as you said you were still adding a bunch.


> Introduce a stats cache for HBase metastore  [hbase-metastore branch]
> ---------------------------------------------------------------------
>
>                 Key: HIVE-9693
>                 URL: https://issues.apache.org/jira/browse/HIVE-9693
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Metastore
>            Reporter: Vaibhav Gumashta
>            Assignee: Vaibhav Gumashta
>         Attachments: HIVE-9693.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to