Re: Performance problem with large wide row inserts using CQL

Peter Lin Thu, 20 Feb 2014 15:31:27 -0800

good example Ed.

I'm so happy to see other people doing things like this. Even if the
official DataStax docs recommend don't mix static and dynamic, to me that's
a huge disservice to Cassandra users.


If someone really wants to stick to relational model, then NewSql is a
better fit, plus gives users the full power of SQL with subqueries, like,
and joins. NewSql can't handle these kinds of use cases due to static
nature of relational tables, row size limit and column limit.



On Thu, Feb 20, 2014 at 6:18 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> CASSANDRA-6561 is interesting. Though having statically defined columns
> are not exactly a solution to do everything in "thrift".
>
>
> http://planetcassandra.org/blog/post/poking-around-with-an-idea-ranged-metadata/
>
> Before collections or CQL existed I did some of these concepts myself.
>
> Say you have a column family named AllMyStuff
>
> columns named "friends_" would be a string and they would be a "Map" of
> friends to age
>
> set AllMySuff[edward][friends_bob]=34
>
> set AllMySuff[edward][friends_sara]=33
>
> Column name password could be a string
>
> set AllMySuff[edward][password]='mother'
>
> Columns named phone[00] phone[100] would be an array of phone numbers
>
> set AllMySuff[edward][phone[00]]=555-5555'
>
> It was quite easy for me to slice all the phone numbers
>
> startkey: phone
> endkey: phone[100]
>
> But then every column starting with "action_xxxx" could be a page hit and
> i could have thousands / ten thousands of these
>
> In many cases CQL has nice/nicer abstractions for some of these things.
> But its largest detraction for me is that I can not take this already
> existing column family AllMyStuff and 'explain' it to CQL. Its a perfectly
> valid way to design something, and might be (probably) is more space
> efficient then the system of using composites CQL uses to pack things. I
> feel that as a data access language it dictates too much schema, not only
> what is in row schema, but it controls the format of the data on disk as
> well. Also schema's like mine above are very valid but selecting them into
> a table of fixed rows and columns does not map well.
>
> The way hive handles tackles this problem, is that the metadata is
> interpreted by a SerDe so that the physical data and the logical definition
> are not coupled.
>
>
>
>
> On Thu, Feb 20, 2014 at 5:23 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> Rüdiger
>>
>> "SortedMap<byte[], SortedMap<byte[], Pair<Long, byte[]>>"
>>
>>  When using a RandomPartitioner or Murmur3Partitioner, the outer map is a
>> simple Map, not SortedMap.
>>
>>  The only case you have a SortedMap for row key is when using
>> OrderPreservingPartitioner, which is clearly not advised for most cases
>> because of hot spots in the cluster.
>>
>>
>>
>> On Thu, Feb 20, 2014 at 10:49 PM, Rüdiger Klaehn <rkla...@gmail.com>wrote:
>>
>>> Hi Sylvain,
>>>
>>> I applied the patch to the cassandra-2.0 branch (this required some
>>> manual work since I could not figure out which commit it was supposed to
>>> apply for, and it did not apply to the head of cassandra-2.0).
>>>
>>> The benchmark now runs in pretty much identical time to the thrift based
>>> benchmark. ~30s for 1000 inserts of 10000 key/value pairs each. Great work!
>>>
>>>
>>> I still have some questions regarding the mapping. Please bear with me
>>> if these are stupid questions. I am quite new to Cassandra.
>>>
>>> The basic cassandra data model for a keyspace is something like this,
>>> right?
>>>
>>> SortedMap<byte[], SortedMap<byte[], Pair<Long, byte[]>>
>>>                  ^ row key. determines which server(s) the rest is
>>> stored on
>>>                                              ^ column key
>>>                                                                ^
>>> timestamp (latest one wins)
>>>
>>> ^ value (can be size 0)
>>>
>>> So if I have a table like the one in my benchmark (using blobs)
>>>
>>> CREATE TABLE IF NOT EXISTS test.wide (
>>>   time blob,
>>>   name blob,
>>>   value blob,
>>>   PRIMARY KEY (time,name))
>>>   WITH COMPACT STORAGE
>>>
>>> From reading http://www.datastax.com/dev/blog/thrift-to-cql3 it seems
>>> that
>>>
>>> - time maps to the row key and name maps to the column key without any
>>> overhead
>>> - value directly maps to value in the model above without any prefix
>>>
>>> is that correct, or is there some overhead involved in CQL over the raw
>>> model as described above? If so, where exactly?
>>>
>>> kind regards and many thanks for your help,
>>>
>>> Rüdiger
>>>
>>>
>>> On Thu, Feb 20, 2014 at 8:36 AM, Sylvain Lebresne 
>>> <sylv...@datastax.com>wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 9:38 PM, Rüdiger Klaehn <rkla...@gmail.com>wrote:
>>>>
>>>>>
>>>>> I have cloned the cassandra repo, applied the patch, and built it. But
>>>>> when I want to run the bechmark I get an exception. See below. I tried 
>>>>> with
>>>>> a non-managed dependency to
>>>>> cassandra-driver-core-2.0.0-rc3-SNAPSHOT-jar-with-dependencies.jar, which 
>>>>> I
>>>>> compiled from source because I read that that might help. But that did not
>>>>> make a difference.
>>>>>
>>>>> So currently I don't know how to give the patch a try. Any ideas?
>>>>>
>>>>> cheers,
>>>>>
>>>>> Rüdiger
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalArgumentException:
>>>>> replicate_on_write is not a column defined in this metadata
>>>>>     at
>>>>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>>>>>     at
>>>>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>>>>>     at com.datastax.driver.core.Row.getBool(Row.java:117)
>>>>>     at
>>>>> com.datastax.driver.core.TableMetadata$Options.<init>(TableMetadata.java:474)
>>>>>     at
>>>>> com.datastax.driver.core.TableMetadata.build(TableMetadata.java:107)
>>>>>     at
>>>>> com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:128)
>>>>>     at
>>>>> com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:89)
>>>>>     at
>>>>> com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:259)
>>>>>     at
>>>>> com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:214)
>>>>>     at
>>>>> com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:161)
>>>>>     at
>>>>> com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
>>>>>     at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:890)
>>>>>     at
>>>>> com.datastax.driver.core.Cluster$Manager.newSession(Cluster.java:910)
>>>>>     at
>>>>> com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:806)
>>>>>     at com.datastax.driver.core.Cluster.connect(Cluster.java:158)
>>>>>     at
>>>>> cassandra.CassandraTestMinimized$delayedInit$body.apply(CassandraTestMinimized.scala:31)
>>>>>     at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>>>>>     at
>>>>> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>>>>>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>     at scala.collection.immutable.List.foreach(List.scala:318)
>>>>>     at
>>>>> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>>>>>     at scala.App$class.main(App.scala:71)
>>>>>     at
>>>>> cassandra.CassandraTestMinimized$.main(CassandraTestMinimized.scala:5)
>>>>>     at
>>>>> cassandra.CassandraTestMinimized.main(CassandraTestMinimized.scala)
>>>>>
>>>>
>>>> I believe you've tried the cassandra trunk branch? trunk is basically
>>>> the future Cassandra 2.1 and the driver is currently unhappy because the
>>>> replicate_on_write option has been removed in that version. I'm supposed to
>>>> have fixed that on the driver 2.0 branch like 2 days ago so maybe you're
>>>> also using a slightly old version of the driver sources in there? Or maybe
>>>> I've screwed up my fix, I'll double check. But anyway, it would be overall
>>>> simpler to test with the cassandra-2.0 branch of Cassandra, with which you
>>>> shouldn't run into that.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>
>>>
>>
>

Re: Performance problem with large wide row inserts using CQL

Reply via email to