Spark can count a regular table. Spark sql would be the easiest thing to get started with most likely.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md Go down to the spark sql section to get some idea of the ease of use. On Dec 22, 2014 10:00 PM, "ziju feng" <pkdog...@gmail.com> wrote: > Thanks for the advise, I'll definitely take a look at how Spark works and > how it can help with counting. > > One last question: My current implementation of counting is 1) increment > counter 2) read counter immediately after the write 3) write counts to > multiple tables for different query paths and solr. If I switch to Spark, > do I still needs to use counter or counting will be done by spark on > regular table? > > On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla <rsvi...@datastax.com> > wrote: > >> increment wouldn't be idempotent from the client unless you knew the >> count at the time of the update (which you could do with LWT but that has >> pretty harsh performance), that particular jira is about how they're laid >> out and avoiding race conditions between nodes, which was resolved in 2.1 >> beta 1 (which is now in officially out in the 2.1.x branch) >> >> General improvements on counters in 2.1 are laid out here >> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters >> >> As for best practice the answer is multiple tables for multiple query >> paths, or you can use something like solr or spark, take a look at the >> spark cassandra connector for a good way to count on lots of data from lots >> of different query paths >> https://github.com/datastax/spark-cassandra-connector. >> >> >> >> On Mon, Dec 22, 2014 at 9:22 PM, ziju feng <pkdog...@gmail.com> wrote: >> >>> I just skimmed through JIRA >>> <https://issues.apache.org/jira/browse/CASSANDRA-4775> and it seems >>> there has been some effort to make update idempotent. Perhaps the problem >>> can be fixed in the near future? >>> >>> Anyway, what is the current best practice for such use case? (Counting >>> and displaying counts in different queries) I don't need a 100% accurate >>> count and strong consistency. Performance and application complexity is my >>> main concern. >>> >>> Thanks >>> >>> On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla <rsvi...@datastax.com> >>> wrote: >>> >>>> You can cheat it by using the non counter column as part of your >>>> primary key (clustering column specifically) but the cases where this could >>>> work are limited and the places this is a good idea are even more rare. >>>> >>>> As for using counters in batches are already a not well regarded >>>> concept and counter batches have a number of troubling behaviors, as >>>> already stated increments aren't idempotent and batch implies retry. >>>> >>>> As for DSE search its doing something drastically different internally >>>> and the type of counting its doing is many orders of magnitude faster ( >>>> think bitmask style matching + proper async 2i to minimize fanout cost) >>>> >>>> Generally speaking counting accurately while being highly available >>>> creates an interesting set of logical tradeoffs. Example what do you do if >>>> you're not able to communicate between two data centers, but both are up >>>> and serving "likes" quite happily? Is your counting down? Do you keep >>>> counting but serve up different answers? More accurately since problems are >>>> rarely data center to data center but more frequently between replicas, how >>>> much availability are you willing to give up in exchange for a globally >>>> accurate count? >>>> On Dec 22, 2014 6:00 AM, "DuyHai Doan" <doanduy...@gmail.com> wrote: >>>> >>>>> It's not possible to mix counter and non counter columns because >>>>> currently the semantic of counter is only increment/decrement (thus NOT >>>>> idempotent) and requires some special handling compared to other C* >>>>> columns. >>>>> >>>>> On Mon, Dec 22, 2014 at 11:33 AM, ziju feng <pkdog...@gmail.com> >>>>> wrote: >>>>> >>>>>> I was wondering if there is plan to allow creating counter column >>>>>> and standard column in the same table. >>>>>> >>>>>> Here is my use case: >>>>>> I want to use counter to count how many users like a given item in my >>>>>> application. The like count needs to be returned along with details of >>>>>> item >>>>>> in query. To support querying items in different ways, I use both >>>>>> application-maintained denormalized index tables and DSE search for >>>>>> indexing. (DSE search is also used for text searching) >>>>>> >>>>>> Since current counter implementation doesn't allow having counter >>>>>> columns and non-counter columns in the same table, I have to propagate >>>>>> the >>>>>> current count from counter table to the main item table and index tables, >>>>>> so that like counts can be returned by those index tables without sending >>>>>> extra requests to counter table and DSE search is able to build index on >>>>>> like count column in the main item table to support like count related >>>>>> queries (such as sorting by like count). >>>>>> >>>>>> IMHO, the only way to sync data between counter table and normal >>>>>> table within a reasonable time (sub-seconds) currently is to read the >>>>>> current value from counter table right after the update. However it >>>>>> suffers >>>>>> from several issues: >>>>>> 1. Read-after-write may not return the correct count when replication >>>>>> factor > 1 unless consistency level ALL/LOCAL_ALL is used >>>>>> 2. There are two extra non-parallelizable round-trips between the >>>>>> application server and cassandra, which can have great impact on >>>>>> performance. >>>>>> >>>>>> If it is possible to store counter in standard column family, only >>>>>> one write will be needed to update like count in the main table. Counter >>>>>> value will also be eventually synced between replicas so that there is no >>>>>> need for application to use extra mechanism like scheduled task to get >>>>>> the >>>>>> correct counts. >>>>>> >>>>>> A related issue is lifting the limitation of not allowing updating >>>>>> counter columns and normal columns in one batch, since it is quite common >>>>>> to not only have a counter for statistics but also store the details, >>>>>> such >>>>>> as storing the relation of which user likes which items in my user case. >>>>>> >>>>>> Any idea? >>>>>> >>>>>> >>>>> >>> >> >> >> -- >> >> [image: datastax_logo.png] <http://www.datastax.com/> >> >> Ryan Svihla >> >> Solution Architect >> >> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png] >> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the >> database technology and transactional backbone of choice for the worlds >> most innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >