I can confirm this same behavior exists on non docker environments. I went back to my trusty dev vm running in virtual box and see the same behavior. Here is the config of the vm:
deploy@app1[local]:~$ uname -a Linux app1 4.4.0-169-generic #198-Ubuntu SMP Tue Nov 12 10:38:00 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux deploy@app1[local]:~$ cat /etc/issue Ubuntu 16.04.6 LTS \n \l And monitoring the filesystem behavior of /dev/sda1: Starting a conversion: Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019 Filesystem Size Used Avail Use% Mounted on /dev/sda1 26G 8.9G 18G 35% / total 166384 drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef -rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv -rw------- 1 deploy deploy 0 Dec 5 03:48 90ctcu3h.csv drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8 -rw-rw-r-- 1 deploy deploy 12357632 Dec 5 03:48 90ctcu3h.csvtmphbUHkN At end of conversion: Every 1.0s: df -H /; ls -ltr /tmp Thu Dec 5 03:49:26 2019 Filesystem Size Used Avail Use% Mounted on /dev/sda1 26G 16G 11G 60% / total 270324 drwxr-xr-x 3 www-data www-data 4096 Dec 5 03:41 resources drwxr-xr-x 6 root root 4096 Dec 5 03:41 vagrant-chef -rw-rw-r-- 1 deploy deploy 158002235 Dec 5 03:46 k42tn4xv.csv -rw-rw-r-- 1 deploy deploy 118785163 Dec 5 03:48 90ctcu3h.csv drwx------ 2 deploy deploy 4096 Dec 5 03:48 pspp92Vei8 You’ll see the file system dropped 7gb but only 118mb is written. I will try pspp-convert as Ben suggested and report back. Cheers Dave On Dec 4, 2019, 2:06 PM -0600, Dave Trollope <d...@knowledgehound.com>, wrote: > Once the conversion is complete the space is returned so its not a long term > problem - only during the conversion. This became an issue because in > kubernetes you control your resources much more tightly and that’s why this > was highlighted. > > I’m not sure there is anything special about the SAV files, so yes I would > expect it to be easily reproducible - but at this point I don’t know what I > don’t know that might be relevant ;-) > > I will try running the same thing on a regular ec2 vs docker as mentioned in > my earlier email and verify if this is truly unique to docker based > environments - but my gut tells me it is not, we just didn’t notice before > because we had lots of space on the machine. > > Cheers > Dave > On Dec 4, 2019, 11:15 AM -0600, Alan Mead <am...@alanmead.org>, wrote: > > I'm curious to see what the devs say. I think they use Debian, but I don't > > know about docker. > > > > So is the excessive disk space used and then returned and when pspp is > > done, so only 150MB are consumed? Or is it that many GB of storage seem to > > disappear (so maybe the file shows a CSV file size of 150MB but the docker > > container 7gb bigger)? > > > > If I wanted to replicate the behavior, are there any special aspects to the > > datafiles? I'd create a SAV file with a few columns and enough rows of > > random data to make a 1GB SAV file. Right? > > Then I'd run your script to create the CSV. Right? And if I did this on a > > stock Linux host without docker/ramfs/etc., I wouldn't see 7GB of space > > consumed during the conversion, but if I then arranged to do the same test > > using docker or ramfs, I would? Is that correct? > > > > If so, that seems to indicate something to do with docker/ramfs, right? Or, > > you're saying this would affect a physical linux host equally? > > > > -Alan > > > > > > On 12/4/2019 9:24 AM, Dave Trollope wrote: > > > Hi Alan, > > > > > > Sorry, yes I forgot to mention this is linux, Debian GNU/Linux 9 > > > Linux e1e6db1d8408 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 > > > x86_64 GNU/Linux > > > > > > I’ve reproduced this behavior in kubernetes and outside kubernetes in a > > > raw docker container so its not kubernetes specific but may be related to > > > the way the containerized image is built in docker. > > > > > > We haven’t observed this on our standard ec2, but to be honest we haven’t > > > monitored in the same way - I can try that and see. We have enough space > > > there that it could have gone unnoticed. I will try. > > > > > > What I'm doing is watching the filesystem as the SAVE TRANSLATE command > > > is running, using watch -n 0.5 "df -H; ls -ltr /tmp" > > > > > > The only file being written is the csv but the filesystem used space is > > > dropping at a much higher rate than data being written. No other temp > > > files are being placed in /tmp > > > > > > I also reproduced this using a ram based fs - if you watch the usage it > > > behaves the same so I don't think its specific to dockerized filesystems, > > > but I might yet be wrong on that. > > > > > > The link you share is a common problem when starting out with containers > > > where the build process creates lots of images. As you build lots of > > > images, you have to cleanup. Its one of the first things you learn as you > > > step in to the container world! > > > > > > Appreciate the quick reply. It certainly was a shocking observation when > > > I found it :-) > > > > > > Cheers > > > Dave > > > > > > > > > On Dec 4, 2019, 8:29 AM -0600, Alan Mead <am...@alanmead.org>, wrote: > > > > Wow, that's a lot. Do you mean that 7GB of space are needed (for, I > > > > guess temporary files)? And you did not observe that previously? > > > > > > > > Maybe the devs are familiar with kubernetes; I only know the name. Can > > > > you describe the environment (e.g., OS)? And pspp version? How many > > > > conversions have you observed this behavior? > > > > > > > > And you're sure this isn't a kubernetes problem (like it's making > > > > snapshots as it writes the file or something)? I ask because when I > > > > google about this, it looks like there are sharp edges; glancing > > > > through, these don't seem to directly and specifically address the > > > > behavior you're seeing, but it looks like there could be these kinds of > > > > issues with kubernetes and the PSPP devs wouldn't be able to help > > > > unless they knew kubernetes: > > > > > > > > https://cntnr.io/whats-eating-my-disk-docker-system-commands-explained-d778178f96f1 > > > > https://softwareengineeringdaily.com/2019/01/11/why-is-storage-on-kubernetes-is-so-hard/ > > > > > > > > -Alan > > > > > > > > > > > > On 12/4/2019 6:40 AM, Dave Trollope wrote: > > > > > We just moved Pspp to Kubernetes containers where we use it to > > > > > extract csvs from sav files. The sav files are about 1gb and each csv > > > > > is about 150mb. > > > > > > > > > > We’ve watched the file system as it does it and over 7gb of the file > > > > > system is used while writing 150mb. I assume the SAVE command is > > > > > doing lots of seeks and insertions in the file magnifying the file > > > > > system usage. Any options to limit this behavior? > > > > > > > > > > Here is the script we are using > > > > > GET FILE = "{}" > > > > > > > > > > SAVE TRANSLATE > > > > > /OUTFILE="{}" > > > > > /TYPE=CSV > > > > > /FIELDNAMES > > > > > /REPLACE > > > > > /KEEP={} > > > > > /MISSING=RECODE > > > > > /CELLS=LABELS. > > > > > Cheers > > > > > Dave > > > > > > > > > > > > > -- > > > > > > > > Alan D. Mead, Ph.D. > > > > President, Talent Algorithms Inc. > > > > > > > > science + technology = better workers > > > > > > > > http://www.alanmead.org > > > > > > > > The irony of this ... is that the Internet is > > > > both almost-infinitely expandable, while at the > > > > same time constrained within its own pre-defined > > > > box. And if that makes no sense to you, just > > > > reflect on the existence of Facebook. We have > > > > the vastness of the internet and yet billions > > > > of people decided to spend most of them time > > > > within a horribly designed, fake-news emporium > > > > of a website that sucks every possible piece of > > > > personal information out of you so it can sell it > > > > to others. And they see nothing wrong with that. > > > > > > > > -- Kieren McCarthy, commenting on why we are not > > > > all using IPv6 > > > > -- > > > > Alan D. Mead, Ph.D. > > President, Talent Algorithms Inc. > > > > science + technology = better workers > > > > http://www.alanmead.org > > > > The irony of this ... is that the Internet is > > both almost-infinitely expandable, while at the > > same time constrained within its own pre-defined > > box. And if that makes no sense to you, just > > reflect on the existence of Facebook. We have > > the vastness of the internet and yet billions > > of people decided to spend most of them time > > within a horribly designed, fake-news emporium > > of a website that sucks every possible piece of > > personal information out of you so it can sell it > > to others. And they see nothing wrong with that. > > > > -- Kieren McCarthy, commenting on why we are not > > all using IPv6