[Ubuntu-phone] Catching CPU run-aways on Touch

2013-09-04 Thread Evan Dandrea
Hi folks,

In another discussion, James Hunt raised the possibility of
periodically checking for runaway processes on Touch, killing those
consuming 100% CPU while creating a report to be sent to
https://errors.ubuntu.com.

I've summarised the key points of that discussion here into a
proposal. The hope of this is that it gives everyone a chance to
provide input.

== Examples ==

There are a few examples of this problem biting us already.

The original bug James ran into was:
https://bugs.launchpad.net/ubuntu/+source/bluetooth-touch/+bug/1217865

Martin Pitt also raised one where two rogue system service processes
constantly used 150% CPU (i. e. 1.5 cores):
https://launchpad.net/bugs/1188404

A few weeks ago there was a nasty timing bug which caused ueventd to
use 100% CPU:
https://bugs.launchpad.net/touch-preview-images/+bug/1190792

Whoopsie also had a memory corruption bug which caused 100% CPU usage
around the same time as the ueventd bug:
https://bugs.launchpad.net/touch-preview-images/+bug/1211417

Note that this is not really about power consumption. Colin King has
done analysis of power consumption on Touch devices and the biggest
bang for the buck is ensuring that sensors are turned off when they
are not needed, not minimising CPU usage. Instead, please consider
this proposal an attempt to better ensure the stability and
performance of Touch systems out in the wild.

== Implementation ==

We will enable the sampling and reporting of high CPU usage in
background processes on Touch devices when the device is not in
developer mode.

Foreground processes will be ignored by this check. They will instead
be handled by an "application not responding" (ANR) implementation in
Mir. They will be allowed to use 100% CPU unless they block the UI
thread for an unreasonable amount of time.

With the application lifecycle work, background applications will be
suspended and get no CPU time at all, so this check will only apply to
system processes.

Each background process will be periodically sampled for its CPU
usage. If the process is using a large amount of CPU consistently
across several of these samplings, it will be killed and an apport
report will be created.

An outstanding question is what the threshold should be for high CPU usage.

== Where will this check live? ==

It has been suggested that the task of periodically checking for
runaway processes live inside a long-running and lightweight C
process. Whoopsie was suggested as a potential candidate.

libprocps was raised as potentially helpful, but James pointed out
that CPU percentage needed to be calculated by the caller, so another
approach may prove easier.

== How will we group reports of the same underlying problem? ==

https://errors.ubuntu.com will need to receive a string that
represents the problem (a signature) to which this instance of a
runaway process belongs. This lets the website group together the
instances of a problem onto a single page and increment the count for
the problem on the front page leaderboard.

Whoopsie, or whatever process holds this check, will use the ptrace
system call to generate a stack trace of the runaway process which
apport can then use to generate a crash signature:
http://bazaar.launchpad.net/~apport-hackers/apport/trunk/view/head:/apport/report.py#L1199

Martin suggested we could do three stack traces each 1 s ±  apart, and then chop away the differing part at the top, so
that we only keep the common bit.

Since we would be generating multiple stack traces, we cannot just
build the report and stack trace through the traditional means of
triggering apport kernel core pipe handler by sending SIGABRT.

== Bringing the check to the Ubuntu desktop ==

It was suggested that we could also have this check on the Ubuntu
desktop, but it was quickly pointed out that great care would need to
be taken to prevent reporting when gcc or Firefox uses 100% CPU.

This would be particularly annoying since the desktop currently
presents a dialog whenever an error occurs. There are plans underway
to group errors that do not need your immediate attention
(non-application crashes, e.g. package installation failures) into a
single dialog with the next error that does require your attention
(Firefox crashing); however, a quicker solution would be to only
report these desktop runaway processes on systems that have automatic
error reporting enabled.

We could then create a blacklist of processes that are known to be
intensive but safe using the data gathered from Touch and automatic
reporting systems and eventually bring reporting of runaway processes
to all Ubuntu systems (save servers).

A whitelist was considered, but determined to not save us from
problems like the ueventd bug.

Thanks,
Evan

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Catching CPU run-aways on Touch

2013-09-04 Thread Evan Dandrea
On 4 September 2013 12:25, Thomas Voß  wrote:
> +1, the respective grace/timeout period would need to be determined
> from empirical data, too.

Agreed.

>> With the application lifecycle work, background applications will be
>> suspended and get no CPU time at all, so this check will only apply to
>> system processes.
>>
>
> True, but to determine the CPU percentage, we would need to have the
> CPU usage of all processes for a certain amount of time available.
> That is, essentially parsing all of proc iiuc. I'm hopefully wrong,
> but if not, could we resort to an approach that just considers the
> per-process user and system CPU time consumed in a given time
> interval?

Yes, I didn't mean to imply that we filter them out from any
calculation, but rather that we do not consider reporting them because
they'll either be foregrounded or suspended.

> Yup, but we should start over with measuring before we start
> classifying CPU usage.

Yes, definitely. It absolutely makes sense to run this in a
measure-only mode before we flip on the out-of-control killer.

> I would rather want to keep it out of whoopsie and integrate it with
> the component implementing the lifecycle policy for two reasons.
> However, if we are only considering to monitor system/session
> services, it would be ok to start over with whoopsie. However, I would
> prefer an implementation that is easily reusable by other components
> (Mir/Unity8).

Yes, and I believe this was suggested in the prior discussion by
Steve. I forgot to include it in the summary; sorry.

I am definitely behind this living in the lifecycle policy system.
While whoopsie is long-running and does operate in the area of error
reporting, it does not already have code to poke at the brains of
processes. That's all handled by apport via the kernel core pipe. So
it felt like scope creep to me.

> Indeed. I think we would need to consider visibility (in terms of UI)
> and ANR in the policy and the component implementing the lifecycle
> policy, i.e., Mir and Unity8, is a better place to implement the
> behavior.

Definitely, though care still would need to be taken for things like
GCC on the desktop.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Catching CPU run-aways on Touch

2013-09-04 Thread Evan Dandrea
On 4 September 2013 16:22, Ted Gould  wrote:
> We should probably start with reporting bugs, and then next step start
> killing.  "Would have been killed" bugs might be an interesting metric :-)

Just to be pedantic: send error reports, not file bugs. Bugs can most
certainly be created from a problem set off https://errors.ubuntu.com
with the click of a button, but if we point the firehose of crash data
at Launchpad bugs again, I'm going to be very sad.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Catching CPU run-aways on Touch

2013-09-05 Thread Evan Dandrea
Stuart Langridge brought up an interesting idea for this on IRC, which
I'm copying here rather than having more side discussions:

(10:13:20) aquarius: ev, just reading your thread about runaway
processes on the touch mailing list (I'm not subscribed to the list,
so there's no good way of replying), and I had a thought: it seems to
be primarily about *accidental* runaways (I put an infinite loop in my
code by mistake), and all the discussion is "how do we know if this
process is actually accidental, or if it's deliberately using 100% CPU
because it's properly
(10:13:20) aquarius: busy doing a lot? The thought is: provide an API
call yesIAmReallyBusy which you can call every few seconds while
you're busy and then the runaway-killer will know you're not a runaway
and will ignore you. This is how screensaver stuff works -- your
movie-playing app calls "ignoreTheScreenSaverForAMinute" every minute,
and then you don't have to worry about holding locks or anything, and
if your app crashes the
(10:13:20) aquarius: screensaver doesn't stay disabled.
(10:14:22) Evan: aquarius: did you see cking's reply? It seems to be
going down this route of "can we tell between accidental runaway and
purposeful"
(10:15:55) aquarius: ev, I did. The discussion is about being clever
around trying to identify from outside the runaway process whether it
is runaway or not, which is a useful thing to have if you can do it
certainly. What I'm suggesting is making it explicit -- a system
service should throw a yesIReallyAmBusy() into its high-CPU processing
loop, and then we don't *have* to dwim it; we'll know.
(10:20:32) aquarius: ev, I brought it up in case clever people like
you and cking might say "we can't do that because $REASONS"; if you
think it's worth bringing it up for discussion then I think that's a
good idea. Two reasons I can see against it: the first is that if I
drop yesIReallyAmBusy() into my main processing loop and that loop has
a bug which makes it run infinitely by accident then I'm now immume
from being killed, which
(10:20:32) aquarius: is bad. Secondly, it suggests that if this comes
to the desktop that things like Firefox would need to drop a
yesIReallyAmBusy() all over the place, which they ought to do (because
they want to be a good citizen on Ubuntu) but probably won't (because
being a good citizen on Ubuntu isn't a big enough deal to them).

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Catching CPU run-aways on Touch

2013-09-05 Thread Evan Dandrea
On 5 September 2013 18:35, Steve Langasek  wrote:
>> Is this a proposal for 13.10?
>
> I think it's unrealistic to think anything discussed here would land for
> 13.10.  We already have plenty of other things on our plate that are on the
> critical path for 13.10. :)

Agreed :)

> Anyway, point taken that we shouldn't deploy something that could cause
> processes that were previously perfectly reliable to suddenly be killed by
> some other process which arbitrarily decides they're "misbehaving", thus
> sending the whole system into turmoil.  If we're going to go around killing
> system processes, we should be sure that the cure isn't worse than the
> disease.

We should also be careful to not spin out of control ourselves, trying
to play whack-a-mole with an out of control process. Presumably this
is handled by upstart's respawn stanza?

> Certainly; any salient examples are going to be bugs we already know about,
> and thus which are likely to be fixed or in progress.
>
> The question I have is: would a monitor/killer for runaway processes have
> improved our response to these bugs?  Would it have resulted in earlier
> detection?  Easier diagnosis?  Faster fixing?  Would such monitoring tell us
> about other such bugs that we are currently unaware of and need to be?

I could not agree more with the data-driven approach here. I think
you're absolutely spot on to suggest this needs to prove its value
with some concrete numbers.

> I'm not convinced that the answer is "yes" to any of these.  Obviously, the
> only way to know if it would tell us about bugs we're unaware of is to try
> it and see :), but I think the fact that we are currently unaware of them is
> already a strong indicator that they should not be a high priority, because
> if they were high-impact they would organically rise to our attention.

So knowing what problems are out there is half of what something like
this gives us. https://errors.ubuntu.com has discovered lots of
serious problems not caught by our pre-release QA.

The other half is knowing how critical each problem is. Some subset of
the problems out there may rise to our attention, but we wont know how
important they are because we wont have a clear picture of how many
systems they affect. Engineering resource is finite. We have to make
tough decisions on which issues to fix are going to get the most bang
for the buck.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Flashing a nexus device from Mac OS X using phablet-flash works.

2013-09-12 Thread Evan Dandrea
On 12 September 2013 20:55, Mike McCracken  wrote:
> In my experience, flashing the device with phablet-flash is not simple from
> within VBox because the VM will drop USB connection when the device reboots,
> then phablet-flash will time out.

You can add the device under USB Device Filters in the settings for
the VM so that VirtualBox automatically connects the device to the
guest.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Manually retracing crashes on Touch images

2013-10-09 Thread Evan Dandrea
Hi everyone,

Loïc asked me to clarify how you can retrace crashes on Touch devices
manually, since we're still waiting on armhf retracers to appear on
Prodstack.

You'll first need some apt sources for apport to download -dbg and
-dbgsym packages from. The configurations we use for the production
retracers can be found in the Daisy project, so grab those:

bzr branch lp:daisy

You will need to add your PPA to the 13.10 sources if you have have
debugging symbols enabled.

Next, tell apport to collect some additional information for the
crash. This adds the packaging information and runs any hooks:

apport-cli /var/crash/your_crash_file.crash
Hit view (V), then hit keep (K).

Finally, retrace your crash:

apport-retrace -S ~/bzr/daisy/retracer/config
/var/crash/your_crash_file.crash -o updated_crash_file.crash

You can pass the -C option if you're going to run this over a number
of crash files and want to cache the unpacked ddebs between runs.

updated_crash_file.crash should now have full Stacktrace and
ThreadStacktrace fields.

You can throw that at Launchpad by running apport-cli
updated_crash_file.crash, then hitting S to send the report.

Let me know if you require further assistance.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Manually retracing crashes on Touch images

2013-10-10 Thread Evan Dandrea
On 9 October 2013 19:45, Michał Sawicz  wrote:
> On 09.10.2013 19:04, Evan Dandrea wrote:
>>
>> bzr branch lp:daisy
>
>
> FYI, I had to comment out saucy-updates and -security from ddebs in
> retracer/config/Ubuntu 13.10/armhf, and add corresponding deb-src for
> apport-retrace to play nice with me.

This is fixed now, thanks to Steve Langasek and Martin Pitt.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] CI vanguard rotation now in place

2013-10-30 Thread Evan Dandrea
This past Monday the Continuous Integration team started a rotation of
shifts as the vanguard in #ubuntu-ci-eng on Freenode and #ci on
Canonical IRC.

The person listed as the vanguard in the channel topic is on the hook
for answering your immediate questions about the CI infrastructure. If
you have a pressing issue, please ping them in the channel and they
will coordinate a fix.

They may not solve the problem directly themselves, but they will
ensure that you can carry on with your work while they work with those
responsible for specific pieces to bring resolution.

The full schedule can be found in the CI team calendar:
http://bit.ly/ubuntu-ci-vanguard

Please note that while we have continuous coverage from the start of
the UK morning, there are some spans of time where we have no one
available to help you. We will work to decrease the size of these
spans as we grow experience in this process.

More information about what our vanguard does can be found here:
https://wiki.canonical.com/UbuntuEngineering/CI#Vanguard

Thanks!

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Working with the CI team

2013-11-06 Thread Evan Dandrea
Hi,

You may from time to time need things from the Continuous Integration
team. You have a few options:

1) For critical issues needing immediate attention, please continue to
ask the vanguard in #ci and #ubuntu-ci-eng.

2) For specific issues or needs from our infrastructure, please file a
bug here: https://bugs.launchpad.net/ubuntu-ci-services-itself/+filebug

3) For work requiring a number of tasks or more of a conversation
between the engineers on our teams, let me know and I'll stand up a
"CI and Your Team" project in Asana. We already have these in place
for the Kernel, QA, and Apps teams.

Thanks,
Evan

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Fwd: Update on the lab

2013-11-11 Thread Evan Dandrea
Happy Monday to those of you in green parts of not-America. As
cautioned by Larry on Thursday ("CI/QA Lab consolidation update"), the
1SS datacenter move is continuing today.

We are not at the point where we can provide an estimate time that
specific services will return online, but rest assured we are working
as hard as we can to get everything back up as quickly as possible.
Larry and Rick have been working all day and all night all weekend
long.

I will continue to provide periodic updates as we have progress to
report. As before, I ask that you leave Larry and Rick to it. If you
have questions, I'll happily hear them.

Thanks!

-- Forwarded message --
From: Larry Works 
Date: 11 November 2013 08:12
Subject: Update on the lab
To: Evan Dandrea 


Evan,

As you may have noticed (or very soon will) the lab is still down. The
day went far slower that we had thought. At any rate here is where we are:

 - We have power to all of the racks
 - 95% of the hardware from Lexington has been moved to 1SS (the
phablets and bootspeed machines are still at Lex)
 - 80% of the server style hardware have been racked (or re-racked as
the case may be)
 - 100% of the desktop systems (that can be) have had their internals
transferred into the rack mount cases
 - 0% of the desktops (in rack mount cases) have been racked.
 - We still have a lot of cable (power and network) to run.


Since it's shortly after 3:00 AM I am going to head over to the hotel to
get a few hours of sleep and will be back at it later this morning. We
will get the remaining few servers racked then start on the desktops in
rack mount clothing. We will also get power run to the servers and get
the network up.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Fwd: Lab update from 11 November

2013-11-12 Thread Evan Dandrea
Good morning, everyone. The 1SS move continues. The nitty gritty details
can be found below. Larry's goal is to have everything powered on and
networked by the end of business today (US Eastern).

-- Forwarded message --
From: *Larry Works*
Date: Tuesday, November 12, 2013
Subject: Lab update from 11 November
To: Evan Dandrea 


Evan,

The networking configuration stopped shortly after the switches were
stacked and the link in was connected. Brian didn't like the way the
racks were populated and thought all of the CDUs needed to be flipped
endwise so the input power cables were going straight up to the top of
the racks instead of being looped up from the bottom. I uninstalled and
reinstalled all of the CDUs while he and Rick started shifting some of
the servers around within the various racks. After a bite of lunch I
headed back to Lex to pick up the last of the hardware (the 4 new Intel
SDPs for the kernel team, our 2 DMZ systems and IS' gateway server for
the DMZ. When I got back from that a more extreme shifting of hardware
was ongoing. I couldn't stay to see how extensive as I had to get the
rental back to the airport. When I got back from the rental turn in most
of the re-racks had been completed. I helped finish up the last 18 to 20
systems and then proceeded to rack most of the hardware that was just
brought in from Lex.

The end result of the day was:
  - The switches are all in place and the Cisco switches are stacked.
  - The network link for our internal network has been run.
  - All of the big hardware is at 1SS.
  - All of the server and test systems at 1SS are now racked.
  - The only hardware remaining at Lex are the phones/tablets and the 4
Veritons used for bootspeed testing (those probably won't get brought
over until Wednesday).
  - Brian was going to run the network link for the openstack suite
(since he will not be onsite on Tuesday) and maybe the DMZ link (if he
gets their gateway server racked).

For Tuesday, and I made this clear, networking in the systems is the
first order of business and the primary operation for the day. I will
get all of the infrastructure re-IP'd and connected to the switches,
Rick will ensure they are on the appropriate VLANs. We will start
bringing up systems (the internal DNS/DHCP server initially) and we will
test connectivity to them through the VLAN. Once we are satisfied we can
reach all of the required network segments we will start bring up
additional servers then test clients. Once they are all on and can be
reached over the VPN link (and can communicate to each other internally)
we will start enabling jenkins instances. At that point the remote hands
will be called upon to start running basic tests and verifying
connectivity from remote locations but that probably won't be until late
afternoon/early evening our time. My goal is to have everything powered
on and on the network before I end my day tomorrow. I'm not going to
make plans for what tasks I will work on Wednesday until after I see
where we are tomorrow (well, later today really).

Now to get some sleep. 10 hours over the past three days isn't quite
cutting it; might see if I can squeeze in 5 tonight/this morning.

~w
-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Fwd: Lab update from 12 November

2013-11-13 Thread Evan Dandrea
Hello everyone,

Apologies for missing the update yesterday. The 1SS move work is
continuing today. Rick, Larry, and Brian are continuing to work long
shifts in the lab and we're very close to having all the hardware
complete with some software services already online. The aim is to
start bringing some Jenkins systems online today, such as autopkgtest.
To be clear, we will not have everything online by the end of the day,
but tomorrow's update will include a list of what services we already
have operational and what we anticipate to be running shortly.

-- Forwarded message --
From: Larry Works 
Date: 13 November 2013 14:22
Subject: Lab update from 12 November
To: Evan Dandrea 


Greetings,

We have hardware up on the network. And, even better, it is reachable
via the VPN. We do still have a minor network issue getting the HP
managed switches tied into the rest of the Cisco network but I imagine
we'  get that resolved this morning.

 - We have 100% of the racks KVM cabled
 - We have 80% of the racks network cabled, the remaining 20% I held off
on until we get the HP to Cisco piece straight.
 - We have 60% of the hardware in the racks connected to power
 - We have to KVM switches tiered so all systems should be accessible
from a single URL


What we are going to do today:

 - Complete patching in network to the remaining 20% of systems
 - Complete connecting the remaining 40% of systems to power
 - Start bringing systems up

~w

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] 1SS move update

2013-11-14 Thread Evan Dandrea
Good morning everyone,

We are nearly there.

The Jenkins instance that used to reside at http://10.98.0.1:8080 is
now back up and running under http://d-jenkins.ubuntu-ci:8080. Some of
the other Jenkins systems are up (s-jenkins, q-jenkins, m-jenkins),
but without Jenkins running yet. We're still working on ensuring the
slaves are up and the systems themselves are functioning without
issue.

You'll have to update /etc/openvpn/update-cilab-conf for this to work.
dhcp-option DNS is now 10.98.3.6. I've updated the wiki to reflect
this:

https://wiki.canonical.com/UbuntuEngineering/QA/VPN

The Dell, HP, and Lenovo systems used by the server, cloud, and MaaS
teams should be operational again. Please let me know if this is not
the case.

We will have most things up later today, including the phones. We
should have what remains operational by Friday. We will prioritise
bringing up the most used services first.

Quoting Larry:

One note to put out there; the CDU that all of the kernel SRU/smoke
test systems (alkaid, phact, rukbah, tarf, onza, onibi, zmeu and
zuijin) are connected to for power is not our normal ServerTech CDU.
The only outlets above the rack those systems are hosted in were 30amp
outlets which we cannot use with out regular CDUs. Instead that rack
is supported by an APC PDU. We did not get to configure those two PDUs
as they take a different type (6 pin, probably rj-25) of connector
that the normal 8 pin rj-45 that our normal console cables have. Once
we track down one of those we will configure those PDU so that they
are network accessible. It will possible also involve modifying the
jenkins jobs or cobbler profiles to use the correct connection method.

The Dell stack, HP stack and Lenovo stacks used by various factions
(server team, cloud, MaaS) should be online for those folks to start
back into their testing. I still need to turn the kernel team systems
back on and make sure their jenkins instance is starting up and
reachable over the network.

I will work on getting the other jenkins instances (s-jenkins,
q-jenkins, m-jenkins and dev-jenkins) up and running [this] afternoon
but probably won't have them all up until Friday. Also note, we have
two different cobbler instances from the original labs; one from
jiufeng (replaced by tachash) and the other from magners-orchestra (to
be replaced by jatayu). I will consolidate that all into a single
cobbler instance in the very near future (migrating all of the systems
and profiles) so we can just use the DNS alias cobbler.ubuntu-ci.

I also need to consolidate the two nagios and munin servers into a
single server, move m-jenkins to a different server, move dev-jenkins
to a physical server and Rick and I need to fix naartjie. Phones and
tablets will be moved [today] as should the four Acer Veritons used
for bootspeed testing.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] 1SS move evening update

2013-11-14 Thread Evan Dandrea
Hello,

I just wanted to provide you with a quick update as Europe winds down
for the evening.

The phones are being delivered to 1SS by Rick as we speak and he's
going to spend the day wiring them up. As he needs to leave promptly
at 6pm today, there is a slight risk that all of the phones may not be
in place for the AM. We'll just have to work with what we have for the
landing task force in the morning, if that ends up being the case.

Meanwhile, the CI team is working really hard to get all of the
remaining services back online, especially those that support the
landing task force. There's some slight rearranging that needs to be
done to support the phones in the DC and the team is laying the
groundwork for that as I type. Larry will be online shortly to further
coordinate the efforts and I ask that as before, you leave him to it.
If you have critical questions, please direct them through the
vanguard in #ci.

Phoenix, ashes, and okiku are in need of some replacement hardware.
Rick is on this.

I'll be providing another update in the morning. Until then, thanks
for your continued patience while we finish off the last pieces.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Ubuntu CI infrastructure scheduled outages

2013-12-12 Thread Evan Dandrea
In order to provide you with more predictable service from the CI
infrastructure, we're going to be sending notices of planned service
interruptions two days in advance of any work. The hope is that this
will give you adequate time to prepare and an opportunity to let us
know if the date overlaps with a critical event for you.

We will do our best to provide enough information for you to assess
what impact this has on your own work, and will endeavour to target
these mails to only the mailing lists of the teams we've determined to
be affected. As this latter point is an imprecise science, we will err
on the side of notifying more people than necessary. Over time we'll
try to make this more and more accurate.

If you wish to always receive these notifications of service
interruption from us, I encourage you to subscribe to the following
mailing list:

https://launchpad.net/~canonical-ci-announce

Consider this mailing list the safety net. Please have at least one
person from your team subscribed to it.

Thank you.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] Vanguard shifts during the CI/Juju sprint (EU hours)

2013-12-17 Thread Evan Dandrea
This Tuesday, Wednesday, and Thursday, the EU-based members of the CI
team are meeting in Bluefin to train up on Juju.

We will continue to try to address your issues as quickly as is
feasible, but we will not have true vanguard coverage prior to 1300
UTC each of these days. If there is an emergency situation, please
ping me directly on IRC (ev).

Thanks!

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Touch image 120 results

2014-01-10 Thread Evan Dandrea
On 10 January 2014 09:15, Alexander Sack  wrote:
> On Fri, Jan 10, 2014 at 6:23 AM, Paul Larson  
> wrote:
>> = Mako =
>> 100% pass (no reruns for anything)
>> But we saw several crashes - dialer-app (which has been going on for a
>> while) as well as unity8 crash in default and in click-image-tests.
>> Default tests also saw a crash in whoopsie:
>> http://ci.ubuntu.com/smokeng/trusty/touch/mako/120:20140109.1:20140107.1/5973/click_image_tests/
>
>
> Why do we see whoopsie crashes? Thought we disabled it to not auto
> process crashes on the phone sometimes last cycle.

I'm being pedantic, but they're technically apport crashes. Whoopsie
is just the daemon that shovels .crash files to
https://daisy.ubuntu.com.

The phone does not presently do a second-phase processing of crash
files (adding package information, hooks, etc), nor does it feed the
crash files to whoopsie (using whoopsie-upload-all), as Steve
established the upstart job that ran whoopsie-upload-all was busted:

https://bugs.launchpad.net/ubuntu/+source/apport/+bug/1235436

However, Brian Murray has fixed the bug in question, so we should be
largely ready to go on accepting crash reports from phones. The last
remaining piece is getting armhf retracers online as part of the move
of the retracing infrastructure to Prodstack. I've asked Brian to take
this task from me and finish it up. All that's needed is working with
webops to verify that the stagingstack deployment is functional:

https://rt.admin.canonical.com//Ticket/Display.html?id=58019

Now, to your question of why we're seeing whoopsie-upload-all crashes
collected in the CI infrastructure. As Michał points out, that script
is being run over a corrupted crash file. I've filed this bug to
better deal with that particular case:

https://bugs.launchpad.net/ubuntu/+source/apport/+bug/1267774

There's a deeper problem here. Didier informs me that they were seeing
a lot of crashes in unity8 with a smashed stacktrace. They realised
the dying unity process was getting reaped and restarted by upstart
while still being processed by apport because it was taking a long
time to collect and process the core file. They set a timeout 30s
(data/unity8.override in unity8-autopilot), which seemed to work
around the problem, but perhaps that value is being exceeded.

We need a better solution than increasing a timeout. James, does
upstart provide us with a better mechanism for telling it to not kill
a process in this state? Can we add one if not? :)

Thanks everyone.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


[Ubuntu-phone] CI Train requests

2014-02-25 Thread Evan Dandrea
We're nearly finished transitioning the CI Train to a production
service. When and if issues arise, please direct them as follows:

- For the day-to-day operations of the CI Train, your point of contact
is the Landing Task Force. For example: if you need someone to
reconfigure a silo.

- For service issues with the CI Train, your point of contact is the
CI Engineering team. For example: if ci-train.ubuntu.com appears to be
down.

Thanks!

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] whoopsie-upload-all

2014-06-23 Thread Evan Dandrea
On 23 June 2014 13:26, Alan Pope  wrote:
> On Mon, Jun 23, 2014 at 12:10 PM, Alexander Sack  wrote:
>> good that you ask this. We don't have that hooked into our process
>> yet, but are working on this as we speak. We basically have to first
>> get a view worked into errors.ubuntu.com that allows us to more
>> effectively use that tracker for touch images.
>
> Related:- https://bugs.launchpad.net/ubuntu/+source/whoopsie/+bug/1332925
>
> Whoopsie has never uploaded crash reports as far as I can see. You
> *have* to manually upload them, at least on mobile.

This hasn't been our experience, but Brian is now looking into the bug
to find out why this is happening for you.

It is a bit of a moot point as all the stacktraces are corrupt on
Touch right now. Matthias and Brian have been looking into this as it
appears to be a low-level bug.

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp


Re: [Ubuntu-phone] Planned Maintenance Advisory: CI Lab outages the week of September 9th

2014-08-21 Thread Evan Dandrea
On 21 August 2014 17:19, Evan Dandrea  wrote:
> The DCEs have been working hard on some improvements to our CI Lab
> network infrastructure. Most notably, they'll be normalising the VLAN
> numbering, moving inter-VLAN routing to batuan, and taking control of
> the remaining pieces of our network (KVMs, etc).
>
> The work will take place on the 9th through the 11th, causing about
> two hours of downtime each day. During these small windows, CI-driven
> testing *will not* progress. We will work to get back up and running
> quickly, and no action on your part will be necessary to resume any
> jobs.
>
> As always, you can check the topic in #ci on the day or ask the
> channel vanguard directly.
>
> I'll provide additional details as they become available. If you have
> any questions, please don't hesitate to reach out.
>
> With these changes in place, we'll be able to put more focus on your
> testing needs, having moved the Kernel and Server team lab
> environments to GSA control. We'll be able to more easily expand the
> pool of testbed hardware in 1SS, and we'll be able to complete the
> migration of our own systems into Prodstack and MAAS.
>
> The nitty gritty can be found in the RTs linked to from these two:
> https://rt.admin.canonical.com/Ticket/Display.html?id=74144
> https://rt.admin.canonical.com/Ticket/Display.html?id=69254

-- 
Mailing list: https://launchpad.net/~ubuntu-phone
Post to : ubuntu-phone@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-phone
More help   : https://help.launchpad.net/ListHelp