> On Jul 23, 2018, at 4:35 PM, Gav <ipv6g...@gmail.com> wrote:
> 
> Thanks Allen,
> 
> Some of our nodes are only 364GB in total size, so you can see that this is
> an issue.

        Ugh.

> For the H0-H12 nodes we are pretty fine currently with 2.4/2.6TB disks -
> therefore  the  urgency is on the Hadoop nodes H13 - H18 and the non Hadoop 
> nodes.
> 
> I propose therefore H0-H12 be trimmed on a monthly basis got mtime +31 in
> the workspace and the H13-H18 + the remaining nodes with 500GB disk and less 
> by done
> weekly
> 
> Sounds reasonable ?

        Disclosure: I’m not really doing much with the Hadoop project anymore 
so someone from that community would need to step forward. 

        But If I Were King:

        For the small nodes in the Hadoop queue, I’d request they either get 
pulled out or put into ‘Hadoop-small’ or some other similar name.  Doing a 
quick pass over the directory structure via Jenkins, with only one or two 
outliers, everything there is ‘reasonable’.. ie., 400G drives are just 
under-spec’ed for the full workload that the ‘Hadoop’ nodes are expected to do 
these days. 7 days isn’t going to do it.  Putting JUST the nightly jobs on them 
(hadoop qbt, hbase nightly, maybe a handful of other jobs) would eat plenty of 
disk space.

        7 days then the workspace dir goes away is probably reasonable based on 
the other nodes though.  But it looks to me like there are jobs running on the 
non-Hadoop nodes that probably should be in the Hadoop queue (Ambari, HBase, 
Ranger, Zookeeper, probably others). Vice-versa is probably also true.  It 
might also be worthwhile to bug some of the vendors involved to see if they can 
pony up some machines/cash for build server upgrades like Y!/Oath did/does.

        That said, I potentially see some changes that the Apache Yetus project 
can do to lessen the disk space load for those projects that use it.  I’ll need 
to experiment a bit first to be sure.  Looking at 10s of G freed up if my 
hypotheses are correct. That might be enough to not move nodes around in the 
Hadoop queue but I can’t see that lasting long.

        Jenkins allegedly has the ability to show compressed log files.  It 
might be worthwhile investigating doing something in this space on a global 
level.  Just gzip up every foo.log in workspace dirs after 24 hours or 
something.

        One other thing to keep in mind:  the modification time on a directory 
only changes if a direct child of that directory changes.  There are likely 
many jobs that have a directory structure such that the parent workspace 
directory time is not modified.  Any sort of purge job is going to need to be 
careful not to nuke a directory structure like this that is being used. :)

HTH.

Reply via email to