S3 maybe? On Mon, Apr 11, 2016 at 7:05 PM Robert Wille <rwi...@fold3.com> wrote:
> I do realize its kind of a weird use case, but it is legitimate. I have a > collection of documents that I need to index, and I want to perform entity > extraction on them and give the extracted entities special treatment in my > full-text index. Because entity extraction costs money, and each document > will end up being indexed multiple times, I want to cache them in > Cassandra. The document text is the obvious key to retrieve entities from > the cache. If I use the document ID, then I have to track timestamps. I > know that sounds like a simple workaround, but I’m presenting a > much-simplified view of my actual data model. > > The reason for needing the text in the table, and not just a digest, is > that sometimes entity extraction has to be deferred due to license > limitations. In those cases, the entity extraction occurs on a background > process, and the entities will be included in the index the next time the > document is indexed. > > I will use a digest as the key. I suspected that would be the answer, but > its good to get confirmation. > > Robert > > On Apr 11, 2016, at 4:36 PM, Jan Kesten <j.kes...@enercast.de> wrote: > > > Hi Robert, > > > > why do you need the actual text as a key? I sounds a bit unatural at > least for me. Keep in mind that you cannot do "like" queries on keys in > cassandra. For performance and keeping things more readable I would prefer > hashing your text and use the hash as key. > > > > You should also take into account to store the keys (hashes) in a > seperate table per day / hour or something like that, so you can quickly > get all keys for a time range. A query without the partition key may be > very slow. > > > > Jan > > > > Am 11.04.2016 um 23:43 schrieb Robert Wille: > >> I have a need to be able to use the text of a document as the primary > key in a table. These texts are usually less than 1K, but can sometimes be > 10’s of K’s in size. Would it be better to use a digest of the text as the > key? I have a background process that will occasionally need to do a full > table scan and retrieve all of the texts, so using the digest doesn’t > eliminate the need to store the text. Anyway, is it better to keep primary > keys small, or is C* okay with large primary keys? > >> > >> Robert > >> > > > >