Hi Harry,


The usual clue I look for is:

Mar 18 17:21:56 opensol genunix: [ID 672855 kern.notice] syncing file systems...
Mar 18 17:21:57 opensol genunix: [ID 904073 kern.notice]  done
Mar 18 17:22:50 opensol genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 Version snv_110 64-bit

This at usually means the kernel was instructed to quit, and did so relatively cleanly (at least syncing filesystems before resetting the host and rebooting). This could be due to halt, reboot, shutdown, init, but also uadmin command (or uadmin(2) syscall into the kernel). It can also be due to a panic.

Look at the lines immediately above it, and you will hopefully find something like this:

Mar 18 17:21:50 opensol syslogd: going down on signal 15
Mar 18 17:21:51 opensol rpcbind: [ID 564983 daemon.error] rpcbind terminating on signal.

Again, a clue that a clean shutdown was issued rather than panic or power loss. Other clues include lines like:

Apr 1 09:49:35 lab662 reboot: [ID 330035 auth.crit] initiated by root on /dev/console

This was the /usr/sbin/reboot command. A similar line is printed for /usr/sbin/halt.


Also, the output from "last -10 reboot" will show roughly what time the system went down (in case you didn't already know). This should be accurate to within 60 seconds in the case of an unclean shutdown (e.g. a crash, power loss, or reset button). Sadly, it cannot distinguish between a clean or unclean shutdown at present.

This isn't an exhaustive list of things, but hopefully a starting point.

Other big clues are warnings in /var/adm/messages relating to temperature, or the string "panic", usually accompanied by a line on reboot along the lines of "reboot after panic...", and the generation of a large vmcore.X file in /var/crash/<hostname>.


The time-slider messages are probably not relevant to this issue (but there have been bugs against time-slider in the past...).

Hope that helps move you forward.
Brian



Harry Putnam wrote:
setup:
Athlon64 2.2ghz 3400+ - AK86-L Aopen mobo (Topped out at 3gb ram)
4 500gb IDE drives on IDE controllers
2 750gb SATA drives on PCI sata controller (adaptec 1205sa [Sil3112a chip])
Currently: osol-2008.11 build 110
=====     *     =====     *     =====     *     =====
First, this is not something that I can say is related to build 110.
It was going on before I upgraded from 109.

I'm experiencing spontaneos shutdowns and am not finding anything in
the logs /var/adm/messages or /var/log/syslog that I recognize as
being a clue to why.

I can post an extract including the time frame of shutdown but to me
it looks totally normal... (I'm not experienced in debugging though)

Also I'm not really sure where to look for clues beyond
/var/log/syslog and /var/adm/messages.

I've got a hunch this may be about hdd overheating, but is only
because I feel what seem to me to be abnormal heat when I touch
drives. Especially 2 sata drives on an
  Adaptec 1205sa (Sil3112achip).

It may be normal heat... I'm not sure... but I really have no idea
what else might provoke a shutdown ... not really sure overheating of
hdd would do that (force a shutdown).

The biggest change I've made most recently was to upgrade the size of
a mirrored 200gb pool to 750gb.  Those drives are on the sata
controller referenced above.  But the 200gb had been running on that
controller for some time.

I also had to flash the bios of that controller during the upgrade, to
make it recognize the new 750gb Sata II drives.

I don't remember seeing a spontaneous shutdwon before making those
changes.
However, I am getting some errors from something to do with the
timeslider mechanism.  I see them on boot up from the `startd'
service.  Where

  svc:/application/time-slider:default

is moved to maintenance by request of time-slider `frequent' and
`hourly' services.

Attempting to restart time-slider service results it being moved to
`Maintenance mode' again.   The `frequent' timeslider service is not
finding a crontab according to that services log.
That sounds like some kind of permissions problem and not something
that would invoke a shutdown.

I guess that might be related to the shutdowns though, so inlined the
output of `svcs -vx' below:

If that isn't it, where else should I look for clues, and are there
other logs I should be examining?

svcs -vx:
  svc:/system/filesystem/zfs/auto-snapshot:frequent (ZFS auto snap..)
   State: maintenance since Tue Mar 31 12:20:19 2009
Reason: Maintenance requested by
  "svc:/system/filesystem/zfs/auto-snapshot:frequent"
See: /var/svc/log/system-filesystem-zfs-auto-sn..:frequent.log
     See: http://sun.com/msg/SMF-8000-R4
     See: /var/svc/log/system-filesystem-zfs-auto-sn..:frequent.log
  Impact: 1 dependent service is not running:
          svc:/application/time-slider:default
svc:/system/filesystem/zfs/auto-snapshot:hourly (ZFS auto sn..) State: maintenance since Tue Mar 31 12:20:17 2009 Reason: Maintenance requested by
     "svc:/system/filesystem/zfs/auto-snapshot:hourly" See:
     /var/svc/log/system-filesystem-zfs-auto-snapshot:hourly.log see:
     http://sun.com/msg/SMF-8000-R4 See:
     /var/svc/log/system-filesystem-zfs-auto-snapshot:hourly.log
     1 dependent service is not running:
     svc:/application/time-slider:default

_______________________________________________
indiana-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/indiana-discuss

--
Brian Ruthven                                        Sun Microsystems UK
Solaris Revenue Product Engineering             Tel: +44 (0)1252 422 312
Sparc House, Guillemont Park, Camberley, GU17 9QG

_______________________________________________
indiana-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/indiana-discuss

Reply via email to