SHORT EXPLANATION: a much higher percentage of queries to phoenix return 
exceptionally slow after querying very heavily for several minutes.

LONGER EXPLANATION:

I've been using Pheonix for about a year as a data store for web-based 
reporting tools and it works well.

Now, I'm trying to use the data in a different (much more request-intensive) 
way and encountering some issues.

The scenario is basically this:

Daily, ingest very large CSV files with data for widgets.

Each input file has hundreds of rows of data for each widget, and tens of 
thousands of unique widgets.

As a first step, I want to de-duplicate this data against my Phoenix-based DB 
(I can't rely on just upserting the data for de-dup because it will go through 
several ETL steps before being stored into Phoenix/HBase).

So, per-widget, I perform a query against Phoenix (the table is keyed against 
the unique widget ID + sample point). I get all the data for a given widget id, 
within a certain period of time, and then I only ingest rows for that widget 
that are new to me.

I'm doing this in Java in a single step: I loop through my input file and 
perform one query per widget, using the same Connection object to Phoenix.

THE ISSUE:

What I'm finding is that for the first several thousand queries, I almost 
always get a very fast (less than 10 ms) response (good).

But after 15-20 thousand queries, the response starts to get MUCH slower. Some 
queries respond as expected, but many take as many as 2-3 minutes, pushing the 
total time to prime the data structure into the 12-15 hour range, when it would 
only take 2-3 hours if all the queries were fast.

The same exact queries, when run manually and not part of this bulk process, 
return in the (expected) < 10 ms.

So it SEEMS like the burst of queries puts Phoenix into some sort of busy state 
that causes it to respond far too slowly.

The connection properties I'm setting are:

Phoenix.query.timeoutMs: 90000
Phoenix.query.keepAliveMs: 90000
Phenix.query.threadPoolSize: 256

Our cluster is 9 (beefy) region servers and the table I'm referencing is 511 
regions. We went through a lot of pain to get the data split extremely well, 
and I don't think Schema design is the issue here.

Can anyone help me understand how to make this better? Is there a better 
approach I could take? A better set of configuration parameters? Is our cluster 
just too small for this?


Thanks!










Reply via email to