Re: reported bloom filter FP ratio

2011-12-26 Thread Radim Kolar

Dne 25.12.2011 20:58, Peter Schuller napsal(a):

Read Count: 68844

[snip]

why reported bloom filter FP ratio is not counted like this

10/68844.0

0.00014525594096798558

Because the read count is total amount of reads to the CF, while the
bloom filter is per sstable. The number of individual reads to
sstables will be higher than the number of reads to the CF (unless you
happen to have exactly one sstable or no rows ever span sstables).
but reported ratio is  Bloom Filter False Ratio: 0.00495 which is higher 
than my computed ratio 0.000145. If you were true than reported ratio 
should be lower then mine computed from CF reads because there are more 
reads to sstables then to CF.


from investigation of bloom filter FP ratio it seems that default bloom 
filter FP ratio (soon user configurable) should be higher. Hbase 
defaults to 1% cassandra defaults to 0.000744. bloom filters are using 
quite a bit memory now.


Re: Doubts related to composite type column names/values

2011-12-26 Thread Edward Capriolo
I would go with composites because cassandra can do better validation. Also
with composites you have a few more options for your slice start; key
inclusive start key exclusive etc. If you are going to concat, tilde is a
better option then : because of It's ASCII value.

On Wednesday, December 21, 2011, aaron morton 
wrote:
> Keys are sorted by their token, when using the RandomPartitioner this is
a MD5 hash. So they are essentially randomly sorted.
> I would use CompositeTypes as keys if they make sense for your app. e.g.
 you are storing time series data and the row key is the time stamp and the
length of the time span. In this case you have a stable known format of
.  The advantage here is the same as any time you introduce type
awareness into a system, somewhere some code notice if you try to store a
key of the wrong form.
> If you have keys that have a variable number of elements, such as a path
hierarchy it would not make sense to model that as a CompositeType (IMHO).
> Cheers
>
> -
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 22/12/2011, at 1:26 AM, R. Verlangen wrote:
>
> Is it true that you can also just get the same results as when you pick a
UTF8 key with this content:
> keyA:keyB
> Of should you really use the composite keys? If so, what is the big
advantage of composite over combined utf-8 keys?
> Robin
>
> 2011/12/21 Sylvain Lebresne 
>>
>> On Tue, Dec 20, 2011 at 9:33 PM, Maxim Potekhin  wrote:
>> > Thank you Aaron! As long as I have plain strings, would you say that I
would
>> > do almost as well with catenation?
>>
>> Not without a concatenation aware comparator. The padding aaron is
talking of
>> is not a mixed type problem only. What I mean here is that if you use a
simple
>> string comparator (UTF8Type, AsciiType or even BytesType), then you will
have
>> the following sorting:
>> "foo24:bar"
>> "foo:bar"
>> "foobar:bar"
>> because ':' is between '2' and 'b' in ascii, you could use another
separator but
>> you get the point. In other words, concatenating strings doesn't make the
>> comparator aware of that fact.
>> CompositeType on the other hand sorts each component separately, so it
will
>> sort:
>> "foo"  : "bar"
>> "foo24"  : "bar"
>> "foobar" : "bar"
>> which is usually what you want.
>>
>> --
>> Sylvain
>>
>> >
>> > Of course I realize that mixed types are a very different case where
the
>> > composite is very useful.
>> >
>> > Thanks
>> >
>> > Maxim
>> >
>> >
>> >
>> > On 12/20/2011 2:44 PM, aaron morton wrote:
>> >
>> > Component values are compared in a type aware fashion, an Integer is an
>> > Integer. Not a 10 character zero padded string.
>> >
>> > You can also slice on the components. Just like with string concat, but
>> > nicer.  . e.g. If you app is storing comments for a thing, and the
column
>> > names have the form  or   you can
slice
>> > for all properties of a comment or all properties for comments between
two
>> > comment_id's
>> >
>> > Finally, the client library knows what's going on.
>> >
>> > Hope that helps.
>> >
>> > -
>> > Aaron Morton
>> > Freelance Developer
>> > @aaronmorton
>> > http://www.thelastpickle.com
>> >
>> > On 21/12/2011, at 7:43 AM, Maxim Potekhin wrote:
>> >
>> > With regards to static, what are major benefits as it compares with
>> > string catenation (with some convenient separator inserted)?
>> >
>> > Thanks
>> >
>> > Maxim
>> >
>> >
>> > On 12/20/2011 1:39 PM, Richard Low wrote:
>> >
>> > On Tue, Dec 20, 2011 at 5:28 PM, Ertio Lew  wrote:
>> >
>> > With regard to the composite columns stuff in Cassandra, I have the
>> >
>> > following doubts :
>> >
>> >
>> > 1. What is the storage overhead of the composite type column
names/values,
>> >
>> > The values are the same.  For each dimension, there is 3 bytes
overhead.
>> >
>> >
>> > 2. what exactly is the difference between the DynamicComposite and
Static
>> >
>> > Composite ?
>> >
>> > Static composite type has the types of each dimension specified in the
>> >
>> > column family definition, so all names within that column family have
>> >
>> > the same type.  Dynamic composite type lets you specify the type for
>> >
>> > each column, so they can be different.  There is extra storage
>> >
>> > overhead for this and care must be taken to ensure all column names
>> >
>> > remain comparable.
>> >
>> >
>> >
>> >
>> >
>
>
>


Re: reported bloom filter FP ratio

2011-12-26 Thread Peter Schuller
> but reported ratio is  Bloom Filter False Ratio: 0.00495 which is higher
> than my computed ratio 0.000145. If you were true than reported ratio should
> be lower then mine computed from CF reads because there are more reads to
> sstables then to CF.

The ratio is the ratio of false positives to true positives *per
sstable*. It's not the amount of false positives in each sstable *per
cf read*. Thus, there is no expectation of higher vs. lower, and the
magnitude of the discrepancy is easily explained by the fact that you
only have 10 false positives. That's not a statistically significant
sample set.

> from investigation of bloom filter FP ratio it seems that default bloom
> filter FP ratio (soon user configurable) should be higher. Hbase defaults to
> 1% cassandra defaults to 0.000744. bloom filters are using quite a bit
> memory now.

I don't understand how you reached that conclusion. There is a direct
trade-off between memory use and false positive hit rate, yes. That
does not mean that hbase's 1% is magically the correct choice.

I definitely think it should be tweakable (and IIRC there's work
happening on a JIRA to make this an option now), but a 1% false
positive hit rate will be completely unacceptable in some
circumstances. In others, perfectly acceptable due to the decrease in
memory use and few reads.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Newbie question about writer/reader consistency

2011-12-26 Thread Vladimir Mosgalin
Hello everybody.

I am developer of financial-related application, and I'm currently evaluating
various nosql databases for our current goal: storing various views which show
state of the system in different aspects after each transaction.

The write load seems to be bigger than typical SQL database would handle
without problems - under test load of tens of transactions per second, each
transaction generates changes in dozen of views, which generates hundreds
messages per second total. Each message ("change") for each view must be
stored, as well as resulting view (generated as kind-of update of old view); it
means multiple inserts & updates per message which go as single transaction. I
started to look into nosql databases. I'm a bit puzzled by guarantees of
atomicity and isolation that Cassandra provides, so my question will be about
how to (if possible at all) attain required level of consistency in Cassandra.
I've read various documents and introductions into Cassandra's data model but
still can't understands basics about data consistency.  This discussion
http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-n
makes me feel disappointed about consistency in Cassandra, but I wonder is
there is a way to work around it.

The requirements are like this. There is one writer, which modifies two
"tables" (I'm sorry for using "SQL" terms, I just don't want to create
more confusion for mapping them into Cassandra terms at this stage). For
the first table, it's a simple insert; index is unique SCN which is
guaranteed to be larger than previous one.

Let's say it inserts
SCN DATA
1   AAA
2   BBB
3   CCC

The goal for the client (reader) is to get all the data from scn N to scn M
without gaps. It is fine if it can't see the very latest SCN yet, that is, gets
"1:AAA" and "2:BBB" on request "SCN: 1..END"; what is NOT fine is to get
something "1:AAA" and "3:CCC". In other words, does Cassandra provide
consistency between writer and reader regarding the order of changes? Or under
some conditions (say, very fast writes - but always from single writer - and
many concurrent reads or something) it might be possible to get that kind of 
gap?

The second question is similar, but on bigger scale. The second table must be
modified in more complicated way; both insert and update of old data are
required. Sometimes it's few insert and few updates, which must be done
atomically - under no conditions reader should be able to see the mid-state of
these inserts/updates. Fortunately, all these new changes will have a new key
(new SCNs), so if it would be just possible to use a column in separate table
which stores "last safe SCN" it would work - but I have no faith that Cassandra
offers such level of consistency. In example, let's say it works like this

current last safe SCN: 1000

update (must be viewed as an atomic "transaction"):
SCN   DATA
1001  AAA
1002  BBB
800   1001
1003  DDD

new last safe SCN: 1003

Here, readers need a mean to filter out lines with SCN>1000 until the writer is
done writing "1003:DDD" line. They also need to filter out "800:1001" line
because it references SCN which is after current "last safe" one.

"last safe SCN" is stored somewhere, and for this pattern to work I once again
need "execution order" consistency - no reader should ever see "last safe:
1003" line before all the previous lines were commited; and any reader who saw
"last safe: 1003" line must be able to see all the lines from that update just
like they are right now.

Is this possible to do in Cassandra?



Re: reported bloom filter FP ratio

2011-12-26 Thread Radim Kolar
my missunderstanding of FP ratio was based on assumption that ratio is 
counted from node start, while it is getRecentBloomFilterFalseRatio()


> I don't understand how you reached that conclusion.

On my nodes most memory is consumed by bloom filters. Also 1.0 creates 
larger bloom filters than 0.8 leading to higher memory consumption, i 
just checked few sstables for index to bloom filter ratio on same 
dataset. in 0.8 bloom filters are about 13% of index size and in 1.0, 
its about 16%. Key used in CF is fixed size 4byte integer.


Cassandra does not measure memory used by index sampling yet, i suspect 
that it will be memory hungry too and can be safely lowered by default i 
see very little difference by changing index sampling from 64 to 512.


Basic problem with cassandra daily administration which i am currently 
solving is that memory consumption grows with your dataset size. I dont 
really like this design - you put more data in and cluster can OOM. This 
makes cassandra not optimal solution for use in data archiving. It will 
get better after tunable bloom filters will be committed.


Merging 3 rows that are mostly read together from CF into single rows with composite col names ?

2011-12-26 Thread Asil Klin
If 3 rows in a column family need to be read together 'always', is it
preferable to just merge them in 1 row using composite col names(instead of
keeping in 3 rows) ? Does this improve read performance, anyway ?


better anti OOM

2011-12-26 Thread Radim Kolar

If node is low on memory 0.95+ heap used it can do:

1. stop repair
2. stop largest compaction
3. reduce number of compaction slots
4. switch compaction to single threaded

flushing largest memtable/ cache reduce is not enough


Re: reported bloom filter FP ratio

2011-12-26 Thread Peter Schuller
>> I don't understand how you reached that conclusion.
>
> On my nodes most memory is consumed by bloom filters. Also 1.0 creates

The point is that just because that's the problem you have, doesn't
mean the default is wrong, since it quite clearly depends on use-case.
If your relative amounts of rows is low compared to the cost of
sustaining a read-heavy workload, the trade-off is different.

> Cassandra does not measure memory used by index sampling yet, i suspect that
> it will be memory hungry too and can be safely lowered by default i see very
> little difference by changing index sampling from 64 to 512.

Bloom filters and index sampling are the two major contributors to
memory use that scale with the number of rows (and thus typically with
data size). This is known. Index sampling can indeed be significant.

The default is 128 though, not 64. Here again it's a matter of
trade-offs; 512 may have worked for you, but it doesn't mean it's an
appropriate default (I am not arguing for 128 either, I am just saying
that it's more complex than observing that in your particular case you
didn't see a problem with 512). Part of the trade-off is additional
CPU usage implied in streaming and deserializing a larger amount of
data per average sstable index read; part of the trade-off is also
effects on I/O; a sparser index sampling could result in a higher
amount of seeks per index lookup.

> Basic problem with cassandra daily administration which i am currently
> solving is that memory consumption grows with your dataset size. I dont
> really like this design - you put more data in and cluster can OOM. This
> makes cassandra not optimal solution for use in data archiving. It will get
> better after tunable bloom filters will be committed.

That is a good reason for both to be configurable IMO.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: better anti OOM

2011-12-26 Thread Peter Schuller
> If node is low on memory 0.95+ heap used it can do:
>
> 1. stop repair
> 2. stop largest compaction
> 3. reduce number of compaction slots
> 4. switch compaction to single threaded
>
> flushing largest memtable/ cache reduce is not enough

Note that the "emergency" flushing is just a stop-gap. You should run
with appropriately sized heaps under normal conditions; the emergency
flushing stuff is intended to mitigate the effects of having a too
small heap size; it is not expected to avoid completely the
detrimental effects.

Also note that things like compaction does not normally contribute
significantly to the live size on your heap, but it typically does
contribute to allocation rate which can cause promotion failures or
concurrent mode failures if your heap size is too small and/or
concurrent mark/sweep settings not aggressive enough. Aborting
compaction wouldn't really help anything other than short-term
avoiding a fallback to full GC.

I suggest you describe exactly what the problem is you have and why
you think stopping compaction/repair is the appropriate solution.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: better anti OOM

2011-12-26 Thread Radim Kolar
I suggest you describe exactly what the problem is you have and why you 
think stopping compaction/repair is the appropriate solution.


compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to 
heap, node logs messages like:


 WARN [ScheduledTasks:1] 2011-12-27 00:20:57,972 GCInspector.java (line 
146) Heap is 0.9712791382159731 full.  You may need to reduce memtable 
and/or cache sizes.  Cassandra will now flush up to the two largest 
memtables to free up memory.  Adjust flush_largest_memtables_at 
threshold in cassandra.yaml if you don't want Cassandra to do this 
automatically
 INFO [ScheduledTasks:1] 2011-12-27 00:21:12,362 StorageService.java 
(line 2608) Unable to reduce heap usage since there are no dirty column 
families


And its pretty dead, killing compaction will make it alive again.

After node boot
Heap Memory (MB) : 1157.98 / 1985.00

disabled gossip + thrift, only compaction running
Heap Memory (MB) : 1981.00 / 1985.00


Re: better anti OOM

2011-12-26 Thread Peter Schuller
> I suggest you describe exactly what the problem is you have and why you
> think stopping compaction/repair is the appropriate solution.
>
> compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to heap,
> node logs messages like:

I don't know what you are basing that on. It seems unlikely to me that
the working set of a compaction is 600 MB. However, it may very well
be that the allocation rate is such that it contributes to an
additional 600 MB average heap usage after a CMS phase has completed.

> After node boot
> Heap Memory (MB) : 1157.98 / 1985.00
>
> disabled gossip + thrift, only compaction running
> Heap Memory (MB) : 1981.00 / 1985.00

Using "nodetool info" to monitor heap usage is not really useful
unless done continuously over time and observing the free heap after
CMS phases have completed. Regardless, the heap is always expected to
grow in usage to the occupancy trigger which kick-starts CMS. That
said, 1981/1985 does indicate a non-desirable state for Cassandra, but
it does not mean that compaction is "using" 600 mb as such (in terms
of live set). You might say that it implies >= 600 mb extra heap
required at your current heap size and GC settings.

If you want to understand what's happening I suggest attaching with
visualvm/jconsole and looking at the GC behavior, and run with
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps. When attached with visualvm/jconsole you can
hit "perform gc" and see how far it drops, to judge what the actual
live set is.

Also, you say it's "pretty dead". What exactly does that mean? Does it
OOM? I suspect you're just seeing fallbacks to full GC and long pauses
because you're allocating and promoting to old-gen fast enough that
CMS is just not keeping up; rather than it having to do with memory
"use" per say.

In your case, I suspect you simply need to run with a bigger heap or
reconfigure CMS to use additional threads for concurrent marking
(-XX:ParallelCMSThreads=XXX - try XXX = number of CPU cores for
example in this case). Alternatively, a larger young gen to avoid so
much getting promoted during compaction.

But really, in short: The easiest fix is probably to increase the heap
size. I know this e-mail doesn't begin to explain details but it's
such a long story.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Retrieve all composite columns from a row, whose composite name's first component matches from a list of Integers

2011-12-26 Thread Aditya
I need to store data of all activities by user's followies in single row. I
am trying to do that making use of composite column names in a single user
specific row named 'rowX'.

On any activity by a user's followie on an item, a column is stored in
'rowX'. The column has a composite type column name made up of
itemId+userId (which makes it unique col. name) in rowX. (& column value
contains the activity data related to that item by that followie)


Now I want to retrieve activity by all users on a list of items. So I need
to retrieve all composite columns with composite's first component matching
the itemId. Is it possible to do such a query to Cassandra ? I am using
Hector.


Peregrine: A new map reduce framework for iterative/pipelined jobs.

2011-12-26 Thread Kevin Burton
I'm pleased to announce Peregrine 0.5.0 - a new map reduce framework
optimized
for iterative and pipelined map reduce jobs.

http://peregrine_mapreduce.bitbucket.org/

This originally started off with some internal work at Spinn3r to build a
fast
and efficient Pagerank implementation.  We realized that what we wanted was
a MR
runtime optimized for this type of work which differs radically from the
traditional Hadoop design.

Peregrine implements a partitioned distributed filesystem where key/value
pairs
are routed to defined partitions.  This enables work to be joined against
previous iterations or different units of work by the same key on the same
local
system.

Peregrine is optimized for ETL jobs where the primary data storage system
is an
external database such as Cassandra, Hbase, MySQL, etc.  Jobs are then run
as a
Extract, Transform and Load stages with intermediate data being stored in
the
Peregrine FS.

We enable features such as Map/Reduce/Merge as well as some additional
functionality like ExtractMap and ReduceLoad (in ETL parlance).

A key innovation here is a partitioning layout algorithm that can support
fast
many to many recovery similar to HDFS but still support partitioned
operation
with deterministic key placement.

We've also tried to optimize for single instance performance and use modern
IO
primitives as much as possible.  This includes NOT shying away from
operating
specific features such as mlock, fadvise, fallocate, etc.

There is still a bit more work I want to do before I am ready to benchmark
it
against Hadoop.  Instead of implementing a synthetic benchmark we wanted to
get
a production ready version first which would allow people to port existing
applications and see what the before / after performance numbers looked
like in
the real world.

For more information please see:

http://peregrine_mapreduce.bitbucket.org/

As well as our design documentation:

http://peregrine_mapreduce.bitbucket.org/design/



-- 
-- 

Founder/CEO Spinn3r.com 

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*