Hi,
We have sensor input that creates very wide rows and operations on these rows 
have started to timeout regulary. We have been trying to find a solution to 
dividing wide rows but keep hitting limitations that move the problem around 
instead of solving it.
We have a partition key consisting of a sensorUnitId and a sensorId and use a 
time field to access each column in the row. We tried adding a time based 
entry, timeShardId, to the partition key that consists of the year and week of 
year during which the reading was taken. This works for a number of queries but 
for scanning all the readings against a particular sensorUnitId and sensorId 
combination, we seem to be stuck.
We won't know the range of valid values of the timeShardId for a given 
sensorUnitId and sensorId combination so would have to write to an additional 
table to track the valid timeShardId. We suspect this would create tombstone 
accumulation problems given the number of updates required to the same row so 
haven't tried this option.

Alternatively, we hit a different bottleneck in the form of SELECT DISTINCT in 
trying to directly access the partition keys. Since SELECT DISTINCT does not 
allow for a where clause to filter on the partition key values, we have to 
filter several hundred thousand partition keys just to find those related to 
the relevant sensorUnitId and sensorId. This problem will only grow worse for 
us.

Are there any other approaches that can be suggested? We have been looking 
around, but haven't found any references beyond the initial suggestion to add 
some sort of shard id to the partition key to handle wide rows.
Thanks,
Jason

Reply via email to