> On Jul 23, 2018, at 4:35 PM, Gav <ipv6g...@gmail.com> wrote: > > Thanks Allen, > > Some of our nodes are only 364GB in total size, so you can see that this is > an issue.
Ugh. > For the H0-H12 nodes we are pretty fine currently with 2.4/2.6TB disks - > therefore the urgency is on the Hadoop nodes H13 - H18 and the non Hadoop > nodes. > > I propose therefore H0-H12 be trimmed on a monthly basis got mtime +31 in > the workspace and the H13-H18 + the remaining nodes with 500GB disk and less > by done > weekly > > Sounds reasonable ? Disclosure: I’m not really doing much with the Hadoop project anymore so someone from that community would need to step forward. But If I Were King: For the small nodes in the Hadoop queue, I’d request they either get pulled out or put into ‘Hadoop-small’ or some other similar name. Doing a quick pass over the directory structure via Jenkins, with only one or two outliers, everything there is ‘reasonable’.. ie., 400G drives are just under-spec’ed for the full workload that the ‘Hadoop’ nodes are expected to do these days. 7 days isn’t going to do it. Putting JUST the nightly jobs on them (hadoop qbt, hbase nightly, maybe a handful of other jobs) would eat plenty of disk space. 7 days then the workspace dir goes away is probably reasonable based on the other nodes though. But it looks to me like there are jobs running on the non-Hadoop nodes that probably should be in the Hadoop queue (Ambari, HBase, Ranger, Zookeeper, probably others). Vice-versa is probably also true. It might also be worthwhile to bug some of the vendors involved to see if they can pony up some machines/cash for build server upgrades like Y!/Oath did/does. That said, I potentially see some changes that the Apache Yetus project can do to lessen the disk space load for those projects that use it. I’ll need to experiment a bit first to be sure. Looking at 10s of G freed up if my hypotheses are correct. That might be enough to not move nodes around in the Hadoop queue but I can’t see that lasting long. Jenkins allegedly has the ability to show compressed log files. It might be worthwhile investigating doing something in this space on a global level. Just gzip up every foo.log in workspace dirs after 24 hours or something. One other thing to keep in mind: the modification time on a directory only changes if a direct child of that directory changes. There are likely many jobs that have a directory structure such that the parent workspace directory time is not modified. Any sort of purge job is going to need to be careful not to nuke a directory structure like this that is being used. :) HTH.