The admin stop command issues a graceful shutdown to Accumulo for that
tserver. There is a force option you could try {"-f", "--force"} that will
remove the lock. But these are more graceful than a linux kill -9 command,
which you may have to do if the admin command doesn't kill the process
entirely.On Wed, Aug 18, 2021 at 7:31 AM Ligade, Shailesh [USA] < [email protected]> wrote: > Thank you for good explanation! I really appreciate that. > > Yes I need to remove the hardware, meaning I need to stop everything on > the server (tserver and datanode) > > One quick question: > > What is the difference between accumulo admin stop <tserver>:9997 and > stopping tserver linux service? > > When I issue admin stop, I can see, from the monitor, hosted tablets count > from the tserver in the question goes down to 0, however it doesn't stop > the tserver process or service. > > In your steps, you are stopping datanode service first (adding into > exclude file and then running refreshNodes and then stop the service), I > was thinking to stop accumulo tserver and let it handle hosted tablets > first, before touching datanode, will there be any difference? Just trying > to understand how the relationship between accumulo and hadoop is. > > Thank you! > > -S > ------------------------------ > *From:* [email protected] <[email protected]> > *Sent:* Tuesday, August 17, 2021 2:39 PM > *To:* [email protected] <[email protected]> > *Subject:* [External] RE: how to decommission tablet server > > > Maybe you could clarify. Decommissioning tablet servers and hdfs > replication are separate and distinct issues. Accumulo will generally be > unaware of hdfs replication and table assignment does not change the hdfs > replication. You can set the replication factor for a tablet – but that is > used on writes to hdfs – Accumulo will assume that on any successful write, > on return hdfs is managing the details. > > > > When a tablet is assigned / migrated, the underlying files in hdfs are not > changed – the file references are reassigned in a metadata operation, but > the files themselves are not modified. They will maintain whatever > replication factor that was assigned and whatever the namenode decides. > > > > If you are removing servers that have both data nodes and tserver > processes running: > > > > If you stop / kill the tserver, the tablets assigned to that server will > be reassigned rather quickly. It is only an metadata update. The exact > timing will depend on your ZooKeeper time-out setting, but the “dead” > tserver should be detected and reassigned in short order. The reassignment > may cause some churn of assignments if the cluster becomes un-balanced. > The manager (master) will select tablets from tservers that are > over-subscribed and then assign them to tservers that have fewer tablets – > you can monitor the manager (master) debug log to see the migration > progress. If you want to be gentile, stop a tserver, wait for the number > of unassigned tables to hit zero and migration to settle and then repeat. > > > > If you want to stop the data nodes, you can do that independently of > Accumulo – just follow the Hadoop data node decommission process. Hadoop > will move the data blocks assigned to the data node so that it is “safe” to > then stop the data node process. This is independent of Accumulo and > Accumulo will not be aware that the blocks are moving. If you are running > compactions, Accumulo may try to write blocks locally, but if the data node > is rejecting new block assignments (which I rather assume that it would > when in decommission mode) then Accumulo still would not care. If somehow > new blocks where written it may just delay the Hadoop data node > decommissioning. > > > > If you are running ingest while killing tservers – things should mostly > work – there may be ingest failures, but normally things would get retried > and the subsequent effort should succeed – the issue may be that if by bad > luck the work keeps getting assigned to tservers that are then killed, you > could end up exceeding the number of retries and the ingest would fail out > right. If you can pause ingest, then this limits that chance. If you can > monitor your ingest and know when an ingest failed you could just > reschedule the ingest (for bulk import) If you are doing continuous > ingest, it may be harder to determine if a specific ingest fails, so you > may need to select an appropriate range for replay. Overall it may mostly > work – it will depend on your processes and your tolerance for any > particular data loss on an ingest. > > > > The modest approach (if you can accept transient errors): > > > > 1 Start the data node decommission process. > > 2 Pause ingest and cancel any running user compactions. > > 3 Stop a tserver and wait for unassigned tablets to go back to 0. Wait > for the tablet migration (if any) to quiet down. > > 4 Repeat 3 until all tserver processes have been stopped on the nodes you > are removing. > > 5 Restart ingest – rerun any user compactions if you stopped any. > > 6 Wait for the hdfs decommission process to finish moving / replicating > blocks. > > 7 stop the data node process. > > 8 do what you want with the node. > > > > You do not need to schedule down time – if you can accept transient errors > – say that a client scan is running and that tserver is stopped – the > client may receive an error for the scan. If the scan is resubmitted and > the tablet has been reassigned it should work – it may pause for the > reassignment and / or timeout if the assignment takes some time. You are > basically playing a number game here – the number of tablets, the number of > unassigned tablets, the odds that a scan would be using a particular tablet > for the duration that it is unavailable. It’s not guaranteed that it will > fail, its just that there is a greater than 0 chance that it could – if > that is unacceptable then: > > > > 1 Stop ingest – wait for all to finish or mark which ones will need to be > rescheduled > > 2 Stop Accumulo > > 3 Remove the tservers from the servers list > > 4 Start Accumulo without starting the decommissioned tserver nodes. > > > > Do what you want with the data node decommissioning. > > > > The later approach removes possible transient issues. It is up to you to > determine your tolerance for possible transient issues for the duration > that tservers are being stopped vs a complete outage for the duration that > Accumulo is down. If it is a large cluster and just a few tservers, the > odds of a specific tablet being off line for a short duration may be very > low. If it is a small cluster or the percentage of tservers that you are > stopping is large then the odds increase, but the issues will still be > transient. You need to decide which is acceptable to you and your > circumstances. > > > > *From:* Shailesh Ligade <[email protected]> > *Sent:* Tuesday, August 17, 2021 11:26 AM > *To:* [email protected] > *Subject:* RE: how to decommission tablet server > > > > It will be helpful to know that when you are decommissioning tablets (one > at a time for underlying hdfs to replicate), do we need accumulo downtime? > Can accumulo be ingesting while we are decommissioning tablets? > > > > Thanks > > > > -S > > > > *From:* Shailesh Ligade <[email protected]> > *Sent:* Tuesday, August 17, 2021 8:52 AM > *To:* [email protected] > *Subject:* [EXTERNAL EMAIL] - how to decommission tablet server > > > > Hello, > > > > I am using accumulo 1.10 and want to remove few tablet server > > > > I saw in the documentation that I need to run > > > > accumulo admin stop <tserver>:9997 > > > > That command comes back quickly, not sure how long, if any I have to wait > for before I stop tserver service? When is the time to stop datanode > service (running on the same tablet server)? And when to update slaves > files (for accumulo and hdfs)? > > > > Any guidelines on this? > > > > Thanks > > > > -S > > > > > > >
