Re: Performance problem with large wide row inserts using CQL

Edward Capriolo Thu, 20 Feb 2014 13:04:45 -0800

Peter,

I must meet you and shake your hand. I was actually having a debate with a
number of people about a week back claiming there was "no reason to mix
static and dynamic". We do it all the time I am glad someone else besides
me "gets it" and I am not totally mad.


Ed


On Thu, Feb 20, 2014 at 3:26 PM, Peter Lin <[email protected]> wrote:

>
> Hi Duyhai,
>
> yes, I am talking about mixing static and dynamic columns in a single
> column family. Let me give you an example from retail.
>
> Say you're amazon and you sell over 10K different products. How do you
> store all those products with all the different properties like color,
> size, dimensions, etc. With relational databases people use EAV (entity
> attribute value) tables. This means querying for data the system has to
> reconstruct the object by pivot a bunch of rows and flattening it out to
> populate the java object. Typically there are common fields to a product
> like SKU, price, and category.
>
> Using both static and dynamic columns, data can be stored in 1 row and
> queried by 1 row. Anyone that has used EAV approach to build product
> databases will tell you how much that sucks. Another example is from auto
> insurance. Typically a policy database will allow 1 or more types of items
> for property insurance. Property insurance is home/auto insurance.
>
> Each insurance carrier supports different number of insurable items,
> coverages and endorsements. Many systems use the same EAV approach, but the
> problem is bigger. Typically a commercial auto policy may have hundreds of
> drivers and vehicles. Each policy may have dozens or hundreds of coverages
> and endorsements. It is common for an auto insurance model to have hundreds
> of coverage and endorsements with different properties. Using the old ORM
> approach, it's usually mapped table-per-class. Problem is, that results in
> query explosion for polymorphic queries. This is a known problem with
> polymorphic queries using traditional techniques.
>
> Given that Cassandra + thrift gives developers the ability to store
> dynamic columns of different types, it solves the performance issues
> inherent in EAV technique.
>
> The point I was trying to make in my first response is that going with
> pure CQL makes it much harder to take advantage of the COOL features of
> Cassandra. It does require building a framework to make it "mostly"
> transparent to developers, but it is worth it in my opinion to learn and
> understand both thrift and cql. I use annotations in my framework and
> delegates to handle the serialization. This way, the developer only needs
> annotate the class and the framework handles serialization and
> deserialization.
>
>
>
>
>
> On Thu, Feb 20, 2014 at 3:05 PM, DuyHai Doan <[email protected]> wrote:
>
>> "Developers can use what ever type they want for the name or value in a
>> dynamic column and the framework will handle it appropriately."
>>
>>  What do you mean by "dynamic" column ? If you want to be able to insert
>> an arbitrary number of columns in one physical row, CQL3 clustering is
>> there and does pretty well the job.
>>
>>  If by "dynamic" you mean a column whose validation type can change at
>> runtime (like the dynamic composite type :
>> http://hector-client.github.io/hector/build/html/content/composite_with_templates.html)
>> then why don't you just use blob type and serialize it yourself at client
>> side ?
>>
>>  More pratically, in your previous example :
>>
>>   - insert into myColumnFamily(staticColumn1, staticColumn2, 20 as int,
>> dynamicColumn as string) into ('text1','text2',30.55 as double, 3500 as
>> long)
>>
>>  I can't see real sensible use-case where you need to mix static and
>> dynamic columns in the same column family. If you need to save domain
>> model, use skinny row with a fixed number of columns known before hand. If
>> you want to store time series or timeline of data, wide row is there.
>>
>>
>> On Thu, Feb 20, 2014 at 8:55 PM, Peter Lin <[email protected]> wrote:
>>
>>>
>>> my apologies Sylvain, I didn't mean to misquote you. I still feel that
>>> even if someone is only going to use CQL, it is "worth it" to learn thrift.
>>>
>>> In the interest of discussion, I looked at both jira tickets and I don't
>>> see how that makes it so a developer can specify the name and value type
>>> for a dynamic column.
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-6561
>>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>>
>>> Am I missing something? If the grammar for insert statements doesn't
>>> give users the ability declare the name and value type, it means the
>>> developer has to default name and value to bytes. In their code, they have
>>> to handle that manually or build their own framework. I built my own
>>> framework, which handles this for me. Developers can use what ever type
>>> they want for the name or value in a dynamic column and the framework will
>>> handle it appropriately.
>>>
>>> To me, developers should take time to learn both and use both. I realize
>>> it's more work to understand both and take time to read the code. Not
>>> everyone is crazy enough spend time reading cassandra code base or spend
>>> hundreds of hours studying hector and other cassandra clients. I will say
>>> this, if I hadn't spend time studying cassandra and reading Hector code, I
>>> wouldn't have been able to help one of DataStax customer port Hector to
>>> .Net. I also wouldn't have been able to port Hector to C# natively in 3
>>> months.
>>>
>>> Rather than recommend people be lazy, it would be more useful to list
>>> the pros/cons. To my knowledge, there isn't a good writeup on the pros/cons
>>> of thrift and cql on cassandra.apache.org. I don't know if the DataStax
>>> docs have a detailed write up of it, does it?
>>>
>>>
>>>
>>>
>>> On Thu, Feb 20, 2014 at 12:46 PM, Sylvain Lebresne <[email protected]
>>> > wrote:
>>>
>>>> On Thu, Feb 20, 2014 at 6:26 PM, Peter Lin <[email protected]> wrote:
>>>>
>>>>>
>>>>> I disagree with the sentiment that "thrift is not worth the trouble".
>>>>>
>>>>
>>>> Way to quote only part of my sentence and get mental on it. My full
>>>> sentence was "it's probably not worth the trouble to start with thrift if
>>>> you're gonna use CQL later".
>>>>
>>>>
>>>>>
>>>>> CQL and all SQL inspired dialects limit one's ability to use arbitrary
>>>>> typed data in dynamic columns. With thrift it's easy and straight forward.
>>>>> With CQL there is no way to tell Cassandra the type of the name and value
>>>>> for a dynamic column. You can only set the default type. That means using 
>>>>> a
>>>>> "pure cql" approach you can deviate from the default type. Cassandra will
>>>>> throw an exception indicating the type is different than the default type.
>>>>>
>>>>
>>>>> Until such time that CQL abandons the shackles of SQL and adds the
>>>>> ability to indicate the column and value type. Something like this
>>>>>
>>>>
>>>>> insert into myColumnFamily(staticColumn1, staticColumn2, 20 as int,
>>>>> dynamicColumn as string) into ('text1','text2',30.55 as double, 3500 as
>>>>> long)
>>>>>
>>>>> This is one area where Thrift is superior to CQL. Having said that,
>>>>> it's valid to use Cassandra "as if" it was a relational database, but then
>>>>> you'd miss out on some of the unique features.
>>>>>
>>>>
>>>> Man, if I had a nickel every time someone came on that mailing list
>>>> pretending that something was possible with thrift and not CQL ... I will
>>>> claim this: with CASSANDRA-6561 and CASSANDRA-4851 that just got in, there
>>>> is *nothing* that thrift can do that CQL cannot. But well, what do I know
>>>> about Cassandra.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 20, 2014 at 12:12 PM, Sylvain Lebresne <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> On Thu, Feb 20, 2014 at 2:16 PM, Edward Capriolo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> For what it is worth you schema is simple and uses compact storage.
>>>>>>> Thus you really dont need anything in cassandra 2.0 as far as i can 
>>>>>>> tell.
>>>>>>> You might be happier with a stable release like 1.2.something and just
>>>>>>> hector or astyanax. You are really dealing with many issues you should 
>>>>>>> not
>>>>>>> have to just to protoype a simple cassandra app.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Of course, if everyone was using that reasoning, no-one would ever
>>>>>> test new features and report problems/suggest improvement. So thanks to
>>>>>> anyone like Rüdiger that actually tries stuff and take the time to report
>>>>>> problems when they think they encounter one. Keep at it, *you* are the 
>>>>>> one
>>>>>> helping Cassandra to get better everyday.
>>>>>>
>>>>>> And you are also right Rüdiger that it's probably not worth the
>>>>>> trouble to start with thrift if you're gonna use CQL later. And you
>>>>>> definitively should use CQL, it is Cassandra's future.
>>>>>>
>>>>>> --
>>>>>> Sylvain
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Thursday, February 20, 2014, Sylvain Lebresne <
>>>>>>> [email protected]> wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Wed, Feb 19, 2014 at 9:38 PM, Rüdiger Klaehn <[email protected]>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> I have cloned the cassandra repo, applied the patch, and built
>>>>>>> it. But when I want to run the bechmark I get an exception. See below. I
>>>>>>> tried with a non-managed dependency to
>>>>>>> cassandra-driver-core-2.0.0-rc3-SNAPSHOT-jar-with-dependencies.jar, 
>>>>>>> which I
>>>>>>> compiled from source because I read that that might help. But that did 
>>>>>>> not
>>>>>>> make a difference.
>>>>>>> >>
>>>>>>> >> So currently I don't know how to give the patch a try. Any ideas?
>>>>>>> >>
>>>>>>> >> cheers,
>>>>>>> >>
>>>>>>> >> Rüdiger
>>>>>>> >>
>>>>>>> >> Exception in thread "main" java.lang.IllegalArgumentException:
>>>>>>> replicate_on_write is not a column defined in this metadata
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>>>>>>> >>     at com.datastax.driver.core.Row.getBool(Row.java:117)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.TableMetadata$Options.<init>(TableMetadata.java:474)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.TableMetadata.build(TableMetadata.java:107)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:128)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:89)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:259)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:214)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:161)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.Cluster$Manager.init(Cluster.java:890)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.Cluster$Manager.newSession(Cluster.java:910)
>>>>>>> >>     at
>>>>>>> com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:806)
>>>>>>> >>     at com.datastax.driver.core.Cluster.connect(Cluster.java:158)
>>>>>>> >>     at
>>>>>>> cassandra.CassandraTestMinimized$delayedInit$body.apply(CassandraTestMinimized.scala:31)
>>>>>>> >>     at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>>>>>>> >>     at
>>>>>>> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>>>>>>> >>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>>> >>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>>> >>     at scala.collection.immutable.List.foreach(List.scala:318)
>>>>>>> >>     at
>>>>>>> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>>>>>>> >>     at scala.App$class.main(App.scala:71)
>>>>>>> >>     at
>>>>>>> cassandra.CassandraTestMinimized$.main(CassandraTestMinimized.scala:5)
>>>>>>> >>     at
>>>>>>> cassandra.CassandraTestMinimized.main(CassandraTestMinimized.scala)
>>>>>>> >
>>>>>>> > I believe you've tried the cassandra trunk branch? trunk is
>>>>>>> basically the future Cassandra 2.1 and the driver is currently unhappy
>>>>>>> because the replicate_on_write option has been removed in that version. 
>>>>>>> I'm
>>>>>>> supposed to have fixed that on the driver 2.0 branch like 2 days ago so
>>>>>>> maybe you're also using a slightly old version of the driver sources in
>>>>>>> there? Or maybe I've screwed up my fix, I'll double check. But anyway, 
>>>>>>> it
>>>>>>> would be overall simpler to test with the cassandra-2.0 branch of
>>>>>>> Cassandra, with which you shouldn't run into that.
>>>>>>> > --
>>>>>>> > Sylvain
>>>>>>>
>>>>>>> --
>>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>>> check than usual.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Performance problem with large wide row inserts using CQL

Reply via email to