socketexception means this is coming from the network, not the sstables knowing the full error message would be nice, but just about any problem on that end should be fixed by adding connection pooling to your client.
(moving to user@) On Wed, Jul 14, 2010 at 5:09 AM, Thomas Downing <tdown...@proteus-technologies.com> wrote: > On 7/13/2010 9:20 AM, Jonathan Ellis wrote: >> >> On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing >> <tdown...@proteus-technologies.com> wrote: >> >>> >>> On a related note: I am running some feasibility tests looking for >>> high ingest rate capabilities. While testing Cassandra the problem >>> I've encountered is that it runs out of file handles during compaction. >>> >> >> This usually just means "increase the allowed fh via ulimit/etc." >> >> Increasing the memtable thresholds so that you create less sstables, >> but larger ones, is also a good idea. The defaults are small so >> Cassandra can work on a 1GB heap which is much smaller than most >> production ones. Reasonable rule of thumb: if you have a heap of N >> GB, increase both the throughput and count thresholds by N times. >> >> > > Thanks for the suggestion. I gave it a whirl, but no go. The file handles > in > in use stayed at around 500 for the first 30M or so mutates, then within > 4 seconds they jumped to about 800, stayed there for about 30 seconds, > then within 5 seconds went over 2022, at which point the server entered > the cycle of "SocketException: Too many open files. Interesting thing is > that the file limit for this process is 32768. Note the numbers below as > well. > > If there is anything specific you would like me to try, let me know. > > Seems like there's some sort of non-linear behavior here. This behavior is > the same as before I multiplied the Cassandra params by 4 (number of G); > which leads me to think that increasing limits, whether files or Cassandra > parameters is likely to be a tail-chasing excercise. > > This causes time-out exceptions at the client. On this exception, my client > closes the connection, waits a bit, then retries. After a few hours of this > the server still had not recovered. > > I killed the clients, and watched the server after that. The file handles > open > dropped by 8, and have stayed there. The server is, of course, not throwing > SocketException any more. On the other hand, the server is not doing any > thing at all. > > When there is no client activity, and the server is idle, there are 155 > threads > running in the JVM. The all are in one of three states, almost all blocked > at > futex( ), a few blocked at accept( ) , a few cycling over timeout on > futex(), > gettimeofday(), futex() ... None are blocked at IO. I can't attach a > debugger, > I get IO exceptions trying either socket or local connections, no surprise, > so I don't know of a way to get the Java code where the threads are > blocking. > > More than one fd can be open on a given file, and many of open fd's are > on files that have been deleted. The stale fd's are all on Data.db files in > the > data directory, which I have separate from the commit log directory. > > I haven't had a chance to look at the code handling files, and I am not any > sort of Java expert, but could this be due to Java's lazy resource clean up? > I wonder if when considering writing your own file handling classes for > O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help. > > A restart of the client causes immediate SocketExceptions at the server and > timeouts at the client. > > I noted on the restart that the open fd's jumped by 32, despite only making > 4 connections. At this point, there were 2028 open files - more than there > where when the exceptions began at 2002 open files. So it seems like the > exception is not caused by the OS returning EMFILE - unless it was returning > EMFILE for some strange reason, and the bump in open files is due to an > increase in duplicate open files. (BTW, it's not ENFILE!). > > I also noted that although the TimeoutExceptions did not occur immediately > on the client, the SocketExceptions began immediately on the server. This > does not seem to match up. I am using the org.apache.cassandra.thrift API > directly, not any higher level wrapper. > > Finally, this jump to 2028 on the restart caused a new symptom. I only had > the client running a few seconds, but after 15 minutes, the server is still > throwing > exceptions, even though the open file handles immediately dropped from > 2028 down to 1967. > > Thanks for your attention, and all your work, > > Thomas Downing > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com