you mean 100MB (MegaBytes)? Also the data in each of my column is about 1KB so in that case the optimal size 100K columns (since 100K * 1KB = 100MB) right?
On Sat, Oct 15, 2016 at 4:26 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > "2) so what is optimal limit in terms of data size?" > > --> Usual recommendations for Cassandra 2.1 are: > > a. max 100Mb per partition size > b. or up to 10 000 000 physical columns for a partition (including > clustering columns etc ...) > > Recently, with the work of Robert Stupp (CASSANDRA-11206) and also with > the huge enhancement from Michael Kjellman (CASSANDRA-9754) it will be > easier to handle huge partition in memory, especially with a reduce memory > footprint with regards to the JVM heap. > > However, as long as we don't have repair and streaming processes that can > be "resumed" in a middle of a partition, the operational pains will still > be there. Same for compaction > > > > On Sat, Oct 15, 2016 at 12:00 PM, Kant Kodali <k...@peernova.com> wrote: > >> 1) It will be great if someone can confirm that there is no limit >> 2) so what is optimal limit in terms of data size? >> >> Finally, Thanks a lot for pointing out all the operational issues! >> >> On Sat, Oct 15, 2016 at 2:39 AM, DuyHai Doan <doanduy...@gmail.com> >> wrote: >> >>> "But is there still 2B columns limit on the Cassandra code?" >>> >>> --> I remember some one the committer saying that this 2B columns >>> limitation comes from the Thrift era where you're limited to max 2B >>> columns to be returned to the client for each request. It also applies to >>> the max size of each "page" of data >>> >>> Since the introduction of the binary protocol and the paging feature, >>> this limitation does not make sense anymore. >>> >>> By the way, if your partition is too wide, you'll face other operational >>> issues way before reaching the 2B columns limit: >>> >>> - compaction taking looooong time --> heap pressure --> long GC pauses >>> --> nodes flapping >>> - repair & over-streaming, repair session failure in the middle that >>> forces you to re-send the whole big partition --> the receiving node has a >>> bunch of duplicate data --> pressure on compaction >>> - bootstrapping of new nodes --> failure to stream a partition in the >>> middle will force to re-send the whole partition from the beginning again >>> --> >>> the receiving node has a bunch of duplicate data --> pressure on compaction >>> >>> >>> >>> On Sat, Oct 15, 2016 at 9:15 AM, Kant Kodali <k...@peernova.com> wrote: >>> >>>> compacting 10 sstables each of them have a 15GB partition in what >>>> duration? >>>> >>>> On Fri, Oct 14, 2016 at 11:45 PM, Matope Ono <matope....@gmail.com> >>>> wrote: >>>> >>>>> Please forget the part in my sentence. >>>>> For more correctly, maybe I should have said like "He could compact 10 >>>>> sstables each of them have a 15GB partition". >>>>> What I wanted to say is we can store much more rows(and columns) in a >>>>> partition than before 3.6. >>>>> >>>>> 2016-10-15 15:34 GMT+09:00 Kant Kodali <k...@peernova.com>: >>>>> >>>>>> "Robert said he could treat safely 10 15GB partitions at his >>>>>> presentation" This sounds like there is there is a row limit too not >>>>>> only columns?? >>>>>> >>>>>> If I am reading this correctly 10 15GB partitions means 10 >>>>>> partitions (like 10 row keys, thats too small) with each partition of >>>>>> size >>>>>> 15GB. (thats like 15 million columns where each column can have a data of >>>>>> size 1KB). >>>>>> >>>>>> On Fri, Oct 14, 2016 at 11:30 PM, Kant Kodali <k...@peernova.com> >>>>>> wrote: >>>>>> >>>>>>> "Robert said he could treat safely 10 15GB partitions at his >>>>>>> presentation" This sounds like there is there is a row limit too >>>>>>> not only columns?? >>>>>>> >>>>>>> If I am reading this correctly 10 15GB partitions means 10 >>>>>>> partitions (like 10 row keys, thats too small) with each partition of >>>>>>> size >>>>>>> 15GB. (thats like 10 million columns where each column can have a data >>>>>>> of >>>>>>> size 1KB). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Oct 14, 2016 at 9:54 PM, Matope Ono <matope....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks to CASSANDRA-11206, I think we can have much larger >>>>>>>> partition than before 3.6. >>>>>>>> (Robert said he could treat safely 10 15GB partitions at his >>>>>>>> presentation. https://www.youtube.com/watch?v=N3mGxgnUiRY) >>>>>>>> >>>>>>>> But is there still 2B columns limit on the Cassandra code? >>>>>>>> If so, out of curiosity, I'd like to know where the bottleneck is. >>>>>>>> Could anyone let me know about it? >>>>>>>> >>>>>>>> Thanks Yasuharu. >>>>>>>> >>>>>>>> >>>>>>>> 2016-10-13 1:11 GMT+09:00 Edward Capriolo <edlinuxg...@gmail.com>: >>>>>>>> >>>>>>>>> The "2 billion column limit" press clipping "puffery". This >>>>>>>>> statement seemingly became popular because highly traffic traffic-ed >>>>>>>>> story, >>>>>>>>> in which a tech reporter embellished on a statement to make a splashy >>>>>>>>> article. >>>>>>>>> >>>>>>>>> The effect is something like this: >>>>>>>>> http://www.healthnewsreview.org/2012/08/iced-tea-kidney-ston >>>>>>>>> es-and-the-study-that-never-existed/ >>>>>>>>> >>>>>>>>> Iced tea does not cause kidney stones! Cassandra does not store >>>>>>>>> rows with 2 billion columns! It is just not true. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Oct 12, 2016 at 4:57 AM, Kant Kodali <k...@peernova.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Well 1) I have not sent it to postgresql mailing lists 2) I >>>>>>>>>> thought this is an open ended question as it can involve ideas from >>>>>>>>>> everywhere including the Cassandra java driver mailing lists so >>>>>>>>>> sorry If >>>>>>>>>> that bothered you for some reason. >>>>>>>>>> >>>>>>>>>> On Wed, Oct 12, 2016 at 1:41 AM, Dorian Hoxha < >>>>>>>>>> dorian.ho...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also, I'm not sure, but I don't think it's "cool" to write to >>>>>>>>>>> multiple lists in the same message. (based on postgresql mailing >>>>>>>>>>> lists >>>>>>>>>>> rules). >>>>>>>>>>> Example I'm not subscribed to those, and now the messages are >>>>>>>>>>> separated. >>>>>>>>>>> >>>>>>>>>>> On Wed, Oct 12, 2016 at 10:37 AM, Dorian Hoxha < >>>>>>>>>>> dorian.ho...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> There are some issues working on larger partitions. >>>>>>>>>>>> Hbase doesn't do what you say! You have also to be carefull on >>>>>>>>>>>> hbase not to create large rows! But since they are >>>>>>>>>>>> globally-sorted, you can >>>>>>>>>>>> easily sort between them and create small rows. >>>>>>>>>>>> >>>>>>>>>>>> In my opinion, cassandra people are wrong, in that they say >>>>>>>>>>>> "globally sorted is the devil!" while all fb/google/etc actually >>>>>>>>>>>> use >>>>>>>>>>>> globally-sorted most of the time! You have to be careful though >>>>>>>>>>>> (just like >>>>>>>>>>>> with random partition) >>>>>>>>>>>> >>>>>>>>>>>> Can you tell what rowkey1, page1, col(x) actually are ? Maybe >>>>>>>>>>>> there is a way. >>>>>>>>>>>> The most "recent", means there's a timestamp in there ? >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Oct 12, 2016 at 9:58 AM, Kant Kodali <k...@peernova.com >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi All, >>>>>>>>>>>>> >>>>>>>>>>>>> I understand Cassandra can have a maximum of 2B rows per >>>>>>>>>>>>> partition but in practice some people seem to suggest the magic >>>>>>>>>>>>> number is >>>>>>>>>>>>> 100K. why not create another partition/rowkey automatically >>>>>>>>>>>>> (whenever we >>>>>>>>>>>>> reach a safe limit that we consider would be efficient) with >>>>>>>>>>>>> auto >>>>>>>>>>>>> increment bigint as a suffix appended to the new rowkey? so that >>>>>>>>>>>>> the >>>>>>>>>>>>> driver can return the new rowkey indicating that there is a new >>>>>>>>>>>>> partition >>>>>>>>>>>>> and so on...Now I understand this would involve allowing partial >>>>>>>>>>>>> row key >>>>>>>>>>>>> searches which currently Cassandra wouldn't do (but I believe >>>>>>>>>>>>> HBASE does) >>>>>>>>>>>>> and thinking about token ranges and potentially many other >>>>>>>>>>>>> things.. >>>>>>>>>>>>> >>>>>>>>>>>>> My current problem is this >>>>>>>>>>>>> >>>>>>>>>>>>> I have a row key followed by bunch of columns (this is not >>>>>>>>>>>>> time series data) >>>>>>>>>>>>> and these columns can grow to any number so since I have 100K >>>>>>>>>>>>> limit (or whatever the number is. say some limit) I want to break >>>>>>>>>>>>> the >>>>>>>>>>>>> partition into level/pages >>>>>>>>>>>>> >>>>>>>>>>>>> rowkey1, page1->col1, col2, col3...... >>>>>>>>>>>>> rowkey1, page2->col1, col2, col3...... >>>>>>>>>>>>> >>>>>>>>>>>>> now say my Cassandra db is populated with data and say my >>>>>>>>>>>>> application just got booted up and I want to most recent value of >>>>>>>>>>>>> a certain >>>>>>>>>>>>> partition but I don't know which page it belongs to since my >>>>>>>>>>>>> application >>>>>>>>>>>>> just got booted up? how do I solve this in the most efficient >>>>>>>>>>>>> that is >>>>>>>>>>>>> possible in Cassandra today? I understand I can create MV, other >>>>>>>>>>>>> tables >>>>>>>>>>>>> that can hold some auxiliary data such as number of pages per >>>>>>>>>>>>> partition and >>>>>>>>>>>>> so on..but that involves the maintenance cost of that other table >>>>>>>>>>>>> which I >>>>>>>>>>>>> cannot afford really because I have MV's, secondary indexes for >>>>>>>>>>>>> other good >>>>>>>>>>>>> reasons. so it would be great if someone can explain the best way >>>>>>>>>>>>> possible >>>>>>>>>>>>> as of today with Cassandra? By best way I mean is it possible >>>>>>>>>>>>> with one >>>>>>>>>>>>> request? If Yes, then how? If not, then what is the next best way >>>>>>>>>>>>> to solve >>>>>>>>>>>>> this? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> kant >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >