SHORT EXPLANATION: a much higher percentage of queries to phoenix return exceptionally slow after querying very heavily for several minutes.
LONGER EXPLANATION: I've been using Pheonix for about a year as a data store for web-based reporting tools and it works well. Now, I'm trying to use the data in a different (much more request-intensive) way and encountering some issues. The scenario is basically this: Daily, ingest very large CSV files with data for widgets. Each input file has hundreds of rows of data for each widget, and tens of thousands of unique widgets. As a first step, I want to de-duplicate this data against my Phoenix-based DB (I can't rely on just upserting the data for de-dup because it will go through several ETL steps before being stored into Phoenix/HBase). So, per-widget, I perform a query against Phoenix (the table is keyed against the unique widget ID + sample point). I get all the data for a given widget id, within a certain period of time, and then I only ingest rows for that widget that are new to me. I'm doing this in Java in a single step: I loop through my input file and perform one query per widget, using the same Connection object to Phoenix. THE ISSUE: What I'm finding is that for the first several thousand queries, I almost always get a very fast (less than 10 ms) response (good). But after 15-20 thousand queries, the response starts to get MUCH slower. Some queries respond as expected, but many take as many as 2-3 minutes, pushing the total time to prime the data structure into the 12-15 hour range, when it would only take 2-3 hours if all the queries were fast. The same exact queries, when run manually and not part of this bulk process, return in the (expected) < 10 ms. So it SEEMS like the burst of queries puts Phoenix into some sort of busy state that causes it to respond far too slowly. The connection properties I'm setting are: Phoenix.query.timeoutMs: 90000 Phoenix.query.keepAliveMs: 90000 Phenix.query.threadPoolSize: 256 Our cluster is 9 (beefy) region servers and the table I'm referencing is 511 regions. We went through a lot of pain to get the data split extremely well, and I don't think Schema design is the issue here. Can anyone help me understand how to make this better? Is there a better approach I could take? A better set of configuration parameters? Is our cluster just too small for this? Thanks!