Inefficiency with large set of small documents?

onlinespending Mon, 25 Nov 2013 13:19:52 -0800

I’m trying to decide what noSQL database to use, and I’ve certainly decided 
against mongodb due to its use of mmap. I’m wondering if Cassandra would also 
suffer from a similar inefficiency with small documents. In mongodb, if you 
have a large set of small documents (each much less than the 4KB page size) you 
will require far more RAM to fit your working set into memory, since a large 
percentage of a 4KB chunk could very easily include infrequently accessed data 
outside of your working set. Cassandra doesn’t use mmap, but it would still 
have to intelligently discard the excess data that does not pertain to a small 
document that exists in the same allocation unit on the hard disk when reading 
it into RAM. As an example lets say your cluster size is 4KB as well, and you 
have 1000 small 256 byte documents that are scattered on the disk that you want 
to fetch on a given query (the total number of documents is over 1 billion). I 
want to make sure it only consumes roughly 256,000 bytes for those 1000 
documents and not 4,096,000 bytes. When it first fetches a cluster from disk it 
may consume 4KB of cache, but it should ultimately only ideally consume the 
relevant amount of bytes in RAM. If Cassandra just indiscriminately uses RAM in 
4KB blocks than that is unacceptable to me, because if my working set at any 
given time is just 20% of my huge collection of small sized documents, I don’t 
want to have to use servers with 5X as much RAM. That’s a huge expense.


Thanks,
Ben

P.S. Here’s a detailed post I made this morning in the mongodb user group about 
this topic.

People have often complained that because mongodb memory maps everything and 
leaves memory management to the OS's virtual memory system, the swapping 
algorithm isn't optimized for database usage. I disagree with this. For the 
most part, the swapping or paging algorithm itself can't be much better than 
the sophisticated algorithms (such as LRU based ones) that OSes have refined 
over many years. Why reinvent the wheel? Yes, you could potentially ensure that 
certain data (such as the indexes) never get swapped out to disk, because even 
if they haven't been accessed recently the cost of reading them back into 
memory will be too costly when they are in fact needed. But that's not the 
bigger issue.

It breaks down with small documents << than page size

This is where using virtual memory for everything really becomes an issue. 
Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that are 
much smaller than the virtual memory page size (e.g. 4KB). Now let's say that 
you've determined that your working set (e.g. those documents in your 
collection that constitute say 99% of those accessed in a given hour) to be 
20GB. But your entire collection size is actually 100GB (it's just that 20% of 
your documents are much much likely to be accessed in a given time period. It's 
not uncommon that a small minority of documents will be accessed a large 
majority of the time). If your collection is randomly distributed (such as 
would happen if you simply inserted new documents into your collection) then in 
this example only about 20% of the documents that fit onto a 4KB page will be 
part of the working set (i.e. the data that you need frequent access to at the 
moment). The rest of the data will be made up of much less frequently accessed 
documents, that should ideally be sitting on disk. So there's a huge 
inefficiency here. 80% of the data that is in RAM is not even something I need 
to frequently access. In this example, I would need 5X the amount of RAM to 
accommodate my working set.

Now, as a solution to this problem, you could separate your documents into two 
(or even a few) collections with the grouping done by access frequency. The 
problem with this, is that your working set can often change as a function of 
time of day and day of week. If your application is global, your working set 
will be far different during 12pm local in NY vs 12pm local in Tokyo. But more 
even more likely is that the working set is constantly changing as new data is 
inserted into the database. Popularity of a document is often viral. As an 
example, a photo that's posted on a social network may start off infrequently 
accessed but then quickly after hundreds of "likes" could become very 
frequently accessed and part of your working set. You'd need to actively 
monitor your documents and manually move a document from one collection to the 
other, which is very inefficient.

Quite frankly this is not a burden that should be placed on the user anyways. 
By punting the problem of memory management to the OS, mongodb requires the 
user to essentially do its job and group data in a way that patches the 
inefficiencies in its memory management. As far as I'm concerned, not until 
mongodb steps up and takes control of memory management can it be taken 
seriously for very large datasets that often require many small documents with 
ever changing working sets.

Inefficiency with large set of small documents?

Reply via email to