Secondary indexes in Cassandra are not good fit for High Cardinality values
On Fri, Nov 18, 2011 at 7:14 AM, Dan Hendry <[email protected]> wrote: > I they are not limited to repeating values but the Datastax docs[1] on > secondary indexes certainly seem to indicate they would be a poor fit for > this case (high read load, many unique values). > > > > [1] http://www.datastax.com/docs/1.0/ddl/indexes > > > > Dan > > > > From: Maciej Miklas [mailto:[email protected]] > Sent: November-18-11 1:39 > To: [email protected] > Subject: Re: Data Model Design for Login Servie > > > > but secondary index is limited only to repeating values like enums. In my > case I would have performance issue. right? > > On 18.11.2011, at 02:08, Maxim Potekhin <[email protected]> wrote: > > 1122: { > gender: MALE > birthdate: 1987.11.09 > name: Alfred Tester > pwd: e72c504dc16c8fcd2fe8c74bb492affa > alias1: [email protected] > alias2: [email protected] > alias3: [email protected] > } > > ...and you can use secondary indexes to query on anything. > > Maxim > > > On 11/17/2011 4:08 PM, Maciej Miklas wrote: > > Hallo all, > > I need your help to design structure for simple login service. It contains > about 100.000.000 customers and each one can have about 10 different logins > - this results 1.000.000.000 different logins. > > Each customer contains following data: > - one to many login names as string, max 20 UTF-8 characters long > - ID as long - one customer has only one ID > - gender > - birth date > - name > - password as MD5 > > Login process needs to find user by login name. > Data in Cassandra is replicated - this is necessary to obtain all required > login data in single call. Also usually we expect low write traffic and > heavy read traffic - round trips for reading data should be avoided. > Below I've described two possible cassandra data models based on example: we > have two users, first user has two logins and second user has three logins > > A) Skinny rows > - row key contains login name - this is the main search criteria > - login data is replicated - each possible login is stored as single row > which contains all user data - 10 logins for single customer create 10 rows, > where each row has different key and the same content > > // first 3 rows has different key and the same replicated data > [email protected] { > id: 1122 > gender: MALE > birthdate: 1987.11.09 > name: Alfred Tester > pwd: e72c504dc16c8fcd2fe8c74bb492affa > }, > [email protected] { > id: 1122 > gender: MALE > birthdate: 1987.11.09 > name: Alfred Tester > pwd: e72c504dc16c8fcd2fe8c74bb492affa > }, > [email protected] { > id: 1122 > gender: MALE > birthdate: 1987.11.09 > name: Alfred Tester > pwd: e72c504dc16c8fcd2fe8c74bb492affa > }, > > // two following rows has again the same data for second customer > [email protected] { > id: 1133 > gender: MALE > birthdate: 1997.02.01 > name: Manfredus Maximus > pwd: e44c504ff16c8fcd2fe8c74bb492adda > }, > [email protected] { > id: 1133 > gender: MALE > birthdate: 1997.02.01 > name: Manfredus Maximus > pwd: e44c504ff16c8fcd2fe8c74bb492adda > } > > B) Rows grouped by alphabetical prefix > - Number of rows is limited - for example first letter from login name > - Each row contains all logins which benign with row key - row with key 'a' > contains all logins which begin with 'a' > - Data might be unbalanced, but we avoid skinny rows - this might have > positive performance impact (??) > - to avoid super columns each row contains directly columns, where column > name is the user login and column value is corresponding data in kind of > serialized form (I would like to have is human readable) > > a { > [email protected]:"1122;MALE;1987.11.09; > Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa", > > [email protected]@xyz.de:"1122;MALE;1987.11.09; > Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa", > > [email protected]@xyz.de:"1122;MALE;1987.11.09; > Alfred > Tester;e72c504dc16c8fcd2fe8c74bb492affa" > }, > > m { > [email protected]:"1133;MALE;1997.02.01; > Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda" > }, > > r { > [email protected]:"1133;MALE;1997.02.01; > Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda" > > } > > Which solution is better, especially for better read performance? Do you > have better idea? > > Thanks, > Maciej > > > > No virus found in this incoming message. > > Checked by AVG - www.avg.com > Version: 9.0.920 / Virus Database: 271.1.1/4022 - Release Date: 11/17/11 > 02:34:00
