To amplify Christopher's comments that 33k+ tablets is a lot of work for a 
tserver to manage, and may be causing some of your stability issues.  Besides 
consolidating tablets into fewer files / tablets you probably should look into 
running multiple tservers per node.  You will need to adjust the memory 
allocations so that everything fits.

33k+ is really a high number - I would suggest that you take an all of the 
above approach to reduce the number - consolidate tablets to the extent 
possible, use larger split sizes and run multiple tservers on a node.

Ed Coleman

-----Original Message-----
From: Christopher <ctubb...@apache.org> 
Sent: Friday, March 4, 2022 9:14 AM
To: accumulo-user <user@accumulo.apache.org>
Subject: Re: [External] Re: accumulo 1.10.0 unassigned tablets issue

On Fri, Mar 4, 2022 at 8:36 AM Ligade, Shailesh [USA] <ligade_shail...@bah.com> 
wrote:
>
EDIT again: Thanks Chris[topher],


>
> Appreciate your support!
>
> Not sure why volumes.replacement was set, especially since we have HA 
> namenode and that’s the only hdfs targeted. The volumes.replacement 
> was set to the same url though e.g. nameservice/accumulo, 
> nameservice:8020/accumulo

That explains the relocation messages.

>
> Regardless, when tserver went down, even though if we set 
> table.suspend.duration=15m, I was seeing volume replacement messages in the 
> master log for every tablet hosted and that is taking looong time (hours for 
> 33k tablets/tserver). So how best to remove this volumes? There is no 
> delete-volumes, I see only add-volumes under accumulo init. Is there anything 
> I need to do after I remove entire instance.volumes.replacement section from 
> accumulo-site.xml?

Just restart any server that had that replacements config, so they don't try to 
unnecessarily update metadata that is already correct.
Updating volume references using the replacements config is just a metadata 
update, though, not a lot of I/O. I'm not sure it would explain things taking a 
long time. It's possible that it's contributing to the slowness, I suppose, 
perhaps the tserver hosting the metadata tablet for the tablet whose metadata 
is being updated is too managing 33k other tablets.

In the past, I think we've recommended around 100 up to 1K tablets per server. 
I'm not sure if that's still a good recommendation or not. In any case, you 
can't reduce the number of tablets you have without doing merges, or deleting 
entire ranges, or compacting and bulk importing into a new table with more 
reasonable split points. And you probably shouldn't try that until you have 
your current situation under control. But, that's sorta why I was previously 
suggesting to examine your whole config. Maybe think about your whole 
architecture, to figure out where you want to go, and compare with where you 
are now, so you can figure out how to get to your target setup from your 
current setup.

>
> I will have to look at each and every property to ensure it makes sense for 
> sure..
>
> Thanks
>
> -S
>
> -----Original Message-----
> From: Christopher <ctubb...@apache.org>
> Sent: Wednesday, March 2, 2022 3:09 PM
> To: accumulo-user <user@accumulo.apache.org>
> Subject: Re: [External] Re: accumulo 1.10.0 unassigned tablets issue
>
> On Wed, Mar 2, 2022 at 1:51 PM Ligade, Shailesh [USA] 
> <ligade_shail...@bah.com> wrote:
> >
> EDIT > Thanks Chris[topher],
> >
> > I do have instance.volume.replacement overridden
> >
> > Does that mean it will not work with table.suspend.duration property?
>
> No. It's just that's where the RecoveryManager message is coming from.
>
> >
> > uhmm thinking about it i am not sure why we set that as we have only one 
> > hdfs and we have less than 10 beefy nodes...
> >
> > may be I can remove this property after i set table.suspend.duration, and 
> > stop/reboot tserver. After i am done, i can restore the property. Please 
> > advise.
>
> I have no idea why you would set that if you're not replacing one volume with 
> another. I think you would probably benefit from reviewing all of your 
> configuration. Please check the documentation for an explanation of each 
> property. If you have a specific question regarding them, you can ask here, 
> but I would start by reviewing your configs against the docs.
>
> >
> > Thanks
> >
> > -S
> >
> >
> > ________________________________
> > From: Christopher <ctubb...@apache.org>
> > Sent: Wednesday, March 2, 2022 1:32 PM
> > To: accumulo-user <user@accumulo.apache.org>
> > Subject: [External] Re: accumulo 1.10.0 unassigned tablets issue
> >
> > The replacements message should only appear if you have 
> > instance.volumes.replacements set in your configuration.
> >
> > On Wed, Mar 2, 2022 at 11:02 AM Ligade, Shailesh [USA] 
> > <ligade_shail...@bah.com> wrote:
> > >
> > > Hello,
> > >
> > > I need reboot a tserver with 34k hosted tablets.
> > >
> > > I set table.supend.duration to 15 min and stop tserver and rebooted the 
> > > machine.
> > >
> > > As soon as tablet server came on line the its hosted tablets counts went 
> > > from 0 to 34k, however, on the master i see 34k unassigned tablets, 
> > > although the count is going down it is taking hours.
> > > not sure why master is stating unassigne dtablets when the tablet server 
> > > has correct hosted tablet server count?
> > >
> > > Also in the master log i see
> > >
> > > recovery.RecoveryManager INFO: Volume replaced hdfs://xxxx -> hdfs://xxxx 
> > >   the issue is both from and to hdfs urls are identical, so why master is 
> > > trying to do that??
> > >
> > > Is the cluster safe to use? I can reboot another tablet server before 
> > > this unassigned tablet count goes to 0? I can reboot entire cluster if i 
> > > have to, will that help?
> > >
> > > Thanks in advance.
> > >
> > > -S

Reply via email to