Re: [vpp-dev] [FD.io Helpdesk #35687] Re: Verify job failure(s)

Luke, Chris Tue, 24 Jan 2017 13:30:23 -0800

The short version is that given the non-deterministic time for updates to 
package dependencies submitted to ci-management to actually arrive in live 
images, when someone needs to add a dependency within VPP they are faced with 
waiting for N-days (where N is typically a number typically in the double 
digits) before even trying to submit their work.


Or, to preserve forward momentum, they end up just submitting it anyway and 
letting it install the missing packages, waiting for ci-management to catch up. 
Problem is, of course, since it “works”, not everyone does the ci-management 
part.

Chris.

From: vpp-dev-boun...@lists.fd.io [mailto:vpp-dev-boun...@lists.fd.io] On 
Behalf Of Thanh Ha
Sent: Thursday, January 19, 2017 10:45 PM
To: Dave Wallace <dwallac...@gmail.com>
Cc: Vanessa Valderrama <vvalderr...@linuxfoundation.org>; vpp-dev@lists.fd.io
Subject: Re: [vpp-dev] [FD.io Helpdesk #35687] Re: Verify job failure(s)

FWIW in OpenDaylight we don't typically run yum update or apt-get update in our 
init-scripts on VM spinup. At the job level we only install dependencies needed 
by the build. I'm not sure why fd.io<http://fd.io> is running upgrades but it 
was existing in the script when I looked at it. System upgrades during VM 
spinup is not something the OpenDaylight project does at least.

Regards,
Thanh

On Thu, Jan 19, 2017 at 10:38 PM, Dave Wallace 
<dwallac...@gmail.com<mailto:dwallac...@gmail.com>> wrote:
Ed, Thanh, Vanessa,

IMHO, updating the ubuntu packages every time a VM is spun up is a bug wrt. 
being able to reproduce some (hopefully rare) build/test issues.  Since every 
VM is potentially running with different versions of OS components, when a 
failure occurs (e.g. in "make test"), then it may be necessary to recreate the 
exact run-time environment in order to reproduce the failure. Unless the 
complete package list is being archived for every VM instance that is spun up, 
this may not be possible.

My experience is that those rare cases where a tool or environment issue causes 
a failure, the cost to find the issue is extraordinarily high if you do not 
have the ability to recreate the EXACT build/run-time environment.  This is why 
CSIT does not update OS components in the VM initialization scripts and the VM 
images are built from a specific package list instead of pulling the latest 
versions from the apt repositories.

My recommendation is that the VM images be updated periodically (weekly or 
whenever a new security update is released) and the package lists archived for 
each VM image version.  Each VM image should also be verified against a known 
good VPP commit version as is done with CSIT branches.  Ideally we should build 
a fully automated continuous deployment model to reduce the amount of work to 
update the VM images to running a Jenkins job to build/test/deploy a new VM 
image from the latest packages versions.

With that automation in place, this mechanism could be extended for use by CSIT 
as well as "make test", thus ensuring that all of our testing was done with the 
same OS component version.  Ideally, all projects should be using the same OS 
components to ensure that everything is tested in the same run-time environment.

Thanks,
-daw-
On 1/19/2017 8:31 PM, Thanh Ha via RT wrote:

The issue with the 16.04 Ubuntu image is fixed now (but we may require some 
additional actions which I'll send to Vanessa to in case this issue comes up 
again). We fixed this issue tonight by rebuilding ubuntu1604 and deploying the 
new image.



I'm going to close this ticket as resolved and we'll take the additional task 
to find a way to ensure this doesn't appear again off of this ticket.



If you're not interested in the detailed analysis you can stop reading now.



For those interested I suspect that the lock issue will appear again (although 
I could be wrong). The reason I believe so is that our vm init script runs 
"apt-get update" as an initialization step when the VM boots up at creation 
time via this script [0]. Ed mentioned that we didn't see this in the past and 
it only started appear again recently as we deployed another patch to disable 
Ubuntu's unattended updates.



I believe a possible reason we will see this issue appear again due to [0] is 
because of we switched from using JClouds to OpenStack Jenkins plugins for node 
spinnup and there's difference in how the init-script is executed depending on 
which plugin is being used.



JClouds Plugin:



1) boot vm

2) wait for ssh access

3) copies init-script into vm via ssh

4) executes init-script, and doesn't continue processing until script is 
complete

5) once init-script is complete, passes vm over to job and job starts



OpenStack Plugin:



1) boot vm and passes init-script in as User Data

2) init-script runs inside vm without Jenkins intervention, thus is a 
non-blocking function

3) in parallel jenkins waits for ssh access to vm

4) ssh's into vm and passes vm over to job and job starts running



In the OpenStack plugin case step 4 can execute while step 2 is still running 
apt-get update in the background because it was a non-blocking function.



A few ideas I have to get around this.



a) Allow init-script to continue running apt-get update however have a shell 
script at the start of Ubuntu jobs that waits for the lock to get released 
before allowing the job to start



b) Remove apt-get update from init-script and make the job run apt-get update 
at the beginning of it's execution



c) Regularly update VMs to ensure that apt-get update always runs quickly



 Regards,

Thanh



[0] https://git.fd.io/ci-management/tree/jenkins-scripts/basic_settings.sh#n14





On Thu Jan 19 19:23:59 2017, hagbard wrote:

FYI... helpdesk is on it, and its being worked in #fdio-infra on IRC



Ed



On Thu, Jan 19, 2017 at 4:31 PM, Ed Warnicke 
<hagb...@gmail.com><mailto:hagb...@gmail.com> wrote:



Looping in help desk.

On Thu, Jan 19, 2017 at 4:16 PM Dave Barach (dbarach) 
<dbar...@cisco.com><mailto:dbar...@cisco.com>

wrote:



Folks,







See https://jenkins.fd.io/job/vpp-verify-master-ubuntu1604/3378/console







11:00:46 E: Could not get lock /var/lib/dpkg/lock - open (11: Resource

temporarily unavailable)



11:00:46 E: Unable to lock the administration directory (/var/lib/dpkg/),

is another process using it?







I recognize this failure from my own Ubuntu 16.04 system: a cron-job

starts “apt-get -q”, which for whatever reason does not terminate. As a

workaround, “sudo killall apt-get || true” before trying to acquire build

dependencies...







HTH... Dave





_______________________________________________



vpp-dev mailing list



vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>



https://lists.fd.io/mailman/listinfo/vpp-dev

_______________________________________________

vpp-dev mailing list

vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>

https://lists.fd.io/mailman/listinfo/vpp-dev

_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev

Re: [vpp-dev] [FD.io Helpdesk #35687] Re: Verify job failure(s)

Reply via email to