Trying to better understand the problem I tried some variations, but first my setup:
1. hmaster: runs the hadoop namenode, jobtracker, a tasktracker and a datanode, also it runs Cassandra and is the first node in the seedlist in the client configuration (CassandraStorage for Pig) 2. hslave02 and hslave03: runs a hadoop tasktracker and a datanode, and Cassandra, and is in the seedlist of the client configuration So what I tried is the following: 1. Start all 3 cassandra nodes, but let hadoop run only on hmaster: no errors at all and the job completes correctly 2. Same as 1 but hslave02 as another tasktracker: errors about connection refused start showing up and somehow the progress in the maps is strange (after 5 hours I had 1487.50% completion as seen in [1]) and it never finished 3. Same as 1 but stop the cassandra node on hmaster (first in config seed list): it stops immediately complaining that it can't create the splits. >From 3 my guess is that the CassandraStorage does not even try to connect to the other nodes in the seedlist, which is bad for reliability and possibly for load balancing since pumping data from 3 nodes would probably be faster than just from a single one. And before you ask, yes I did try to see if I can connect using telnet to all 3 ports on all 3 machines: - 7000: internal API port - 9160: thrift port - 10036 (was 8080): JMX port While no longer a show stopper (I can test my pig scripts running setup 1) it is still important to me to figure out what happens for the production system. Regards, Chris [1] http://snyke.net/tmp/screenshot_004.png -- Christian Decker Software Architect http://blog.snyke.net On Wed, Aug 18, 2010 at 2:17 PM, Christian Decker < decker.christ...@gmail.com> wrote: > Hi all, > I'm trying to get Pig scripts to work on data in Cassandra and right now I > want to simply run the example-script.pig on a different Keyspace/CF > containing ~6'000'000 entries. I got it running but then the job aborts > after quite some time, and when I look at the logs I see hundreds of these: > > java.lang.RuntimeException: >> org.apache.thrift.transport.TTransportException: java.net.ConnectException: >> Connection refused >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:133) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:224) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:101) >> at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) >> at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:95) >> at org.apache.cassandra.hadoop.pig.CassandraStorage.getNext(Unknown >> Source) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) >> at >> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) >> at >> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> Caused by: org.apache.thrift.transport.TTransportException: >> java.net.ConnectException: Connection refused >> at org.apache.thrift.transport.TSocket.open(TSocket.java:185) >> at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:129) >> ... 13 more >> Caused by: java.net.ConnectException: Connection refused >> at java.net.PlainSocketImpl.socketConnect(Native Method) >> at >> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310) >> at >> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176) >> at >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163) >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:381) >> at java.net.Socket.connect(Socket.java:537) >> at java.net.Socket.connect(Socket.java:487) >> at org.apache.thrift.transport.TSocket.open(TSocket.java:180) >> ... 14 more > > > and > >> > > java.lang.RuntimeException: TimedOutException() > > at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:174) > > at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:224) > > at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:101) > > at >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) > > at >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) > > at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:95) > > at org.apache.cassandra.hadoop.pig.CassandraStorage.getNext(Unknown >> Source) > > at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142) > > at >> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423) > > at >> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Caused by: TimedOutException() > > at >> org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11030) > > at >> org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623) > > at >> org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597) > > at >> org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:151) > > ... 13 more > > > I checked that the cassandra cluster is running and all my 3 nodes are up > and working. As far as I see it the Jobtracker retries when it get those > errors but aborts once a large portion have failed. Any idea on why the > Cluster keeps dropping connections or timing out? > > Regards, > Chris > > -- > Christian Decker > Software Architect > http://blog.snyke.net >