I recently created a test database with about 400 million small records. The disk space consumed was about 30 GB, or 75 bytes per record.
From: onlinespending <onlinespend...@gmail.com> Reply-To: <user@cassandra.apache.org> Date: Monday, November 25, 2013 at 2:18 PM To: <user@cassandra.apache.org> Subject: Inefficiency with large set of small documents? I¹m trying to decide what noSQL database to use, and I¹ve certainly decided against mongodb due to its use of mmap. I¹m wondering if Cassandra would also suffer from a similar inefficiency with small documents. In mongodb, if you have a large set of small documents (each much less than the 4KB page size) you will require far more RAM to fit your working set into memory, since a large percentage of a 4KB chunk could very easily include infrequently accessed data outside of your working set. Cassandra doesn¹t use mmap, but it would still have to intelligently discard the excess data that does not pertain to a small document that exists in the same allocation unit on the hard disk when reading it into RAM. As an example lets say your cluster size is 4KB as well, and you have 1000 small 256 byte documents that are scattered on the disk that you want to fetch on a given query (the total number of documents is over 1 billion). I want to make sure it only consumes roughly 256,000 bytes for those 1000 documents and not 4,096,000 bytes. When it first fetches a cluster from disk it may consume 4KB of cache, but it should ultimately only ideally consume the relevant amount of bytes in RAM. If Cassandra just indiscriminately uses RAM in 4KB blocks than that is unacceptable to me, because if my working set at any given time is just 20% of my huge collection of small sized documents, I don¹t want to have to use servers with 5X as much RAM. That¹s a huge expense. Thanks, Ben P.S. Here¹s a detailed post I made this morning in the mongodb user group about this topic. People have often complained that because mongodb memory maps everything and leaves memory management to the OS's virtual memory system, the swapping algorithm isn't optimized for database usage. I disagree with this. For the most part, the swapping or paging algorithm itself can't be much better than the sophisticated algorithms (such as LRU based ones) that OSes have refined over many years. Why reinvent the wheel? Yes, you could potentially ensure that certain data (such as the indexes) never get swapped out to disk, because even if they haven't been accessed recently the cost of reading them back into memory will be too costly when they are in fact needed. But that's not the bigger issue. It breaks down with small documents << than page size This is where using virtual memory for everything really becomes an issue. Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that are much smaller than the virtual memory page size (e.g. 4KB). Now let's say that you've determined that your working set (e.g. those documents in your collection that constitute say 99% of those accessed in a given hour) to be 20GB. But your entire collection size is actually 100GB (it's just that 20% of your documents are much much likely to be accessed in a given time period. It's not uncommon that a small minority of documents will be accessed a large majority of the time). If your collection is randomly distributed (such as would happen if you simply inserted new documents into your collection) then in this example only about 20% of the documents that fit onto a 4KB page will be part of the working set (i.e. the data that you need frequent access to at the moment). The rest of the data will be made up of much less frequently accessed documents, that should ideally be sitting on disk. So there's a huge inefficiency here. 80% of the data that is in RAM is not even something I need to frequently access. In this example, I would need 5X the amount of RAM to accommodate my working set. Now, as a solution to this problem, you could separate your documents into two (or even a few) collections with the grouping done by access frequency. The problem with this, is that your working set can often change as a function of time of day and day of week. If your application is global, your working set will be far different during 12pm local in NY vs 12pm local in Tokyo. But more even more likely is that the working set is constantly changing as new data is inserted into the database. Popularity of a document is often viral. As an example, a photo that's posted on a social network may start off infrequently accessed but then quickly after hundreds of "likes" could become very frequently accessed and part of your working set. You'd need to actively monitor your documents and manually move a document from one collection to the other, which is very inefficient. Quite frankly this is not a burden that should be placed on the user anyways. By punting the problem of memory management to the OS, mongodb requires the user to essentially do its job and group data in a way that patches the inefficiencies in its memory management. As far as I'm concerned, not until mongodb steps up and takes control of memory management can it be taken seriously for very large datasets that often require many small documents with ever changing working sets.