Re: Mutation Rejected exception with server Error 1

mohit.kaushik Wed, 23 Dec 2015 05:21:51 -0800

Thanks for the beautiful explanation Eric, so this means that if I getMutations rejected exception due to tablet server failure, thebatchwriter will resend them to some other server and I do not haveworry about them. Great...

But what is the case when we get mutations rejected exception and noserver failure. Today also I faced the mutations rejected exceptionswith *"server error 1*" due to mainly two reasons. while there is norelated exception in tablet server logs.(1) Failed to connect to zookeeper (192.168.10.122) within 2x zookeepertimeout period 30000(2) org.apache.accumulo.core.client.impl.AccumuloServerException: Erroron server orkash3:9997

//org.apache.accumulo.core.client.MutationsRejectedException: #constraint violations : 0 security codes: {} # server errors 0 #exceptions 1 atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)atorg.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)atcom.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) atcom.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) atjava.lang.Thread.run(Thread.java:745) Caused by:java.lang.RuntimeException:////Failed to connect to zookeeper (192.168.10.122) within 2x zookeepertimeout period 30000 atorg.apache.accumulo.fate.zookeeper.ZooSession.connect(ZooSession.java:117)//

//

//org.apache.accumulo.core.client.MutationsRejectedException: #constraint violations : 0 security codes: {} # server errors 2 #exceptions 2 atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)atorg.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)atcom.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)at com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570) atcom.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145) atjava.lang.Thread.run(Thread.java:745) Caused by:org.apache.accumulo.core.client.impl.AccumuloServerException: Error onserver orkash3:9997 atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:937)atorg.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.access$1600(TabletServerBatchWriter.java:616)/

both exceptions appears at clientside. I have three zookeeper nodes(version 3.4.6) deployed on the same nodes at which tservers run. I gotthese exceptions more than 12000 times which I can see on kibana dashboard.


Thanks
Mohit Kaushik


On 12/23/2015 06:22 PM, Eric Newton wrote:

The accumulo batch writer will re-send mutations if a tablet serverfails, or rejects the mutations because the tablet has moved. There'snothing you have to do to recover from fail-overs and re-balancing.

I'm not a kernel expert, but I believe that a swappiness setting of"1" is equivalent to "0".

The error you are seeing is part of the failing tablet serverscenario. This is a bit complicated, so I'm going to name your threetablet servers A, B and C.


Tablet server A is hosting a tablet, let's call it a-tablet.
Tablet server B is hosting a metadata tablet, let's call it m-tablet.
m-tablet records the information about a-tablet:

  * where it is hosted
  * what files it it has, and their approximate sizes
  * book-keeping related to bulk ingest
  * etc.. I think the OReilly Accumulo book has some great details

Now when A ingests some data, it eventually flushes the updates frommemory to a file.Tablet server A then writes this new information to m-tablet, onTablet server B.


Now for the failure:

Tablet server A does a java memory garbage collection, and startspulling data from swap. That makes it go really slow, and it loosesits zookeeper session.

But, it's running so slowly, that it takes a moment to realize itshould die.

In the mean time, the thread that is flushing memory, attempts toupdate m-tablet with the new file information.

Fortunately there's a constraint on m-tablet. The constraint is thatmutations must contain a valid zookeeper session. This prevents tabletserver A from making updates to m-tablet when it no long has the rightto host the tablet.

Your initial error is from tablet server A making an update to tabletserver B's m-tablet. It's getting a constraint violation: tabletserver A has lost its zookeeper session, and will fail momentarily.


To make this extra confusing: A and B might be the same server.

-Eric

On Tue, Dec 22, 2015 at 11:31 PM, mohit.kaushik<[email protected] <mailto:[email protected]>> wrote:



    I have 3 tablet servers having around 1.4K tablets. If a tablet
    server loses its session with zookeeper and killed itself. The
    system takes some time to move all hosted tablets to other servers.

    In this case if a ingest in process then what should happen with
    the mutations going to tablets hosted by that tablet server?
    Is it the reason for the first exception?Should they not be
    redirected to other servers?
    nd I had set the system swappiness to 1. Should I keep it 0 in
    this case? I will check further.

    Thanks for the reply

    -Mohit Kaushik


    On 12/22/2015 08:17 PM, Eric Newton wrote:

    A tablet server is given the rights to manage a tablet.

    It is critical that no other server uses the tablet to maintain
    consistency.

    To maintain the right to access a tablet, it must maintain a
    zookeeper session. The zookeeper session periodically exchanges
    keep-alive messages. If either party fails to get a keep-alive,
    zookeeper will close the connection. The client can attempt to
    reconnect, but if it fails to do so, the session will timeout.

    If the tablet server loses its session with zookeeper, the rest
    of the system can take over its tablets.

    When a tablet detects that it lost its zookeeper session, it
    kills itself to avoid doing anything with the tablets it no long
    has the right to host.

    What you are seeing here is the first step in that process, and
    it is probably due to the tablet server not sending a keep-alive
    message to zookeeper in time.

    There are many reasons for a tablet server to be delayed in
    sending a keep-alive message. By far the most common is that your
    system is over-subscribed for memory, and part of the tablet
    server's memory swapped out. Once the java garbage collection
    cycle swapped it back in, there was a considerable delay.

    However, there can be other things going on. This is just a best
    guess.  Monitor swap usage, as a first diagnostic step.

    -Eric



    On Tue, Dec 22, 2015 at 8:30 AM, mohit.kaushik
    <[email protected] <mailto:[email protected]>> wrote:

        Dear All,

        The mutations rejected exception can be seen at client side
        with server error 1.
        /*org.apache.accumulo.core.client.MutationsRejectedException:
        # constraint violations : 0  security codes: {}  # server
        errors 1 # exceptions 1\n\tat
        
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537)\n\tat
        
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249)\n\tat
        
org.apache.accumulo.core.client.impl.MultiTableBatchWriterImpl$TableBatchWriter.addMutation(MultiTableBatchWriterImpl.java:64)\n\tat
        
com.orkash.accumulo.IngestionWithoutServiceOnCondition.main(IngestionWithoutServiceOnCondition.java:235)\n\tat
        com.orkash.db.DBQuery.insertLookUpDB(DBQuery.java:570)\n\tat
        com.orkash.Crawling.CrawlerThread.run(CrawlerThread.java:145)\n\tat
        java.lang.Thread.run(Thread.java:745)\nCaused by:
        org.apache.accumulo.core.client.impl.AccumuloServerException:
        Error on server orkash1:9997\n\tat */

        I also found exceptions in Monitor related to Tracing.

        *Tracing spans are being dropped because there are already 5000 spans 
queued for delivery.
        This does not affect performance, security or data integrity, but 
distributed tracing information is being lost.**
        **
        **and**6458 times**
        **Got an IOException in internalRead!
                java.io.IOException: Connection reset by peer
                        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                        at 
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                        at 
sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                        at 
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
                        at 
org.apache.thrift.transport.TNonblockingSocket.read(TNonblockingSocket.java:141)
                        at 
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.internalRead(AbstractNonblockingServer.java:537)
                        at 
org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.read(AbstractNonblockingServer.java:338)
                        at 
org.apache.thrift.server.AbstractNonblockingServer$AbstractSelectThread.handleRead(AbstractNonblockingServer.java:203)
                        at 
org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.select(CustomNonBlockingServer.java:228)
                        at 
org.apache.accumulo.server.rpc.CustomNonBlockingServer$SelectAcceptThread.run*



        I am facing the following exceptions in tserver logs and one
        tserver goes dead.

        *2015-12-22 09:37:27,173 [zookeeper.ZooCache] WARN : Saw
        (possibly) transient exception communicating with ZooKeeper,
        will retry**
        **org.apache.zookeeper.KeeperException$ConnectionLossException:
        KeeperErrorCode = ConnectionLoss for
        
/accumulo/f8708e0d-9238-41f5-b948-8f435fd01207/tables/16/conf/table.split.threshold**
        **        at
        org.apache.zookeeper.KeeperException.create(KeeperException.java:99)**
        **        at
        org.apache.zookeeper.KeeperException.create(KeeperException.java:51)**
        **        at
        org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)**
        **        at
        org.apache.accumulo.fate.zookeeper.ZooCache$2.run(ZooCache.java:264)**
        **        at
        org.apache.accumulo.fate.zookeeper.ZooCache.retry(ZooCache.java:162)**
        **        at
        org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:289)**
        **        at
        org.apache.accumulo.fate.zookeeper.ZooCache.get(ZooCache.java:238)**
        **        at
        
org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:117)**
        **        at
        
org.apache.accumulo.server.conf.ZooCachePropertyAccessor.get(ZooCachePropertyAccessor.java:103)**
        **        at
        
org.apache.accumulo.server.conf.TableConfiguration.get(TableConfiguration.java:99)**
        **        at
        
org.apache.accumulo.core.conf.AccumuloConfiguration.getMemoryInBytes(AccumuloConfiguration.java:197)**
        **        at
        
org.apache.accumulo.tserver.tablet.Tablet.findSplitRow(Tablet.java:1604)**
        **        at
        org.apache.accumulo.tserver.tablet.Tablet.needsSplit(Tablet.java:1772)**
        **        at
        
org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1853)**
        **        at
        
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)**
        **        at java.lang.Thread.run(Thread.java:745)**
        *
        These are creating problems in continuously ingesting data
        and I also experienced some delay in queries and table create
        commands.
        Please comment what could be the cause of these exceptions?

        Thanks
        Mohit Kaushik

        **

Re: Mutation Rejected exception with server Error 1

Reply via email to