Package: boinc-client Version: 6.13.1+dfsg-2 Severity: normal I hope someone can help me track this down. Relevant information about this is:
- I use World Community Grid. - I have a Intel Core i7 2.8GHz (i.e. 4 cores, each with hyperthreading, so I can run 8 WUs in parallel) - I've symlinked /var/lib/boinc-client into a directory of a 1.8-Terabyte ext4 filesystem. My backup system (backuppc), running on the same machine, also does backups (of my home network) into that same filesystem (in a different directory). - The occurrence of this bug *appears* to coincide with significant I/O load on the system - in particular, all the boinc WUs appear to crash (simultaneously) roughly 10-15 minutes after the start of backuppc's nightly pool cleanup (BackupPC_nightly). (A while ago I also often saw the WUs crash when using aptitude to upgrade stuff, but this no longer seems to trigger it, perhaps a kernel upgrade or something made a difference there.) - Obviously, boinc keeps my CPU permanently at its thermal limit, with my syslog full of MCE messages about automatic throttling. I won't rule out hardware failures, but if it was, I'd expect to see other things fail as well (which doesn't seem to happen), or that the crashes would happen less predictably than they do. If the problem isn't in userspace, it'd seem more likely to be an ext4 bug or something. Anyway, recently, after starting to see a pattern, I tried to attach strace to some of the processes before the nightly backup thing started. It showed sudden SIGSEGVs without anything extraordinary before them, so the next night I tried to attach gdb to a process and wait for it to crash. When it did, I saw that the stack pointer (%esp) was out of limit for what appeared to be a 16K thread stack. It appeared that the stack had overflowed. But since the WCG applications don't have debug symbols, it wasn't clear why. It'd be interesting to try to increase the stack size, but I'm not sure how to tell boinc to do that. Besides, since the crash happens in all running WUs simultaneously, regardless of application (they wouldn't all use the stack in the exact same way, would they?), perhaps it wouldn't help much. Perhaps there is something like an infinite recursion problem common to all WCG applications, though? Any ideas on how to proceed? -- Package-specific info: -- Contents of /etc/default/boinc-client: # This file is /etc/default/boinc-client, it is a configuration file for the # /etc/init.d/boinc-client init script. # Set this to 1 to enable and to 0 to disable the init script. ENABLED="1" # Set this to 1 to enable advanced scheduling of the BOINC core client and # all its sub-processes (reduces the impact of BOINC on the system's # performance). SCHEDULE="1" # The BOINC core client will be started with the permissions of this user. BOINC_USER="boinc" # This is the data directory of the BOINC core client. BOINC_DIR="/var/lib/boinc-client" # This is the location of the BOINC core client, that the init script uses. # If you do not want to use the client program provided by the boinc-client # package, you can specify here an alternative client program. #BOINC_CLIENT="/usr/local/bin/boinc" BOINC_CLIENT="/usr/bin/boinc" # Here you can specify additional options to pass to the BOINC core client. # Type 'boinc --help' or 'man boinc' for a full summary of allowed options. #BOINC_OPTS="--allow_remote_gui_rpc" BOINC_OPTS="" -- System Information: Debian Release: wheezy/sid APT prefers testing APT policy: (900, 'testing'), (600, 'stable'), (1, 'unstable') Architecture: i386 (i686) Kernel: Linux 3.0.0-1-686-pae (SMP w/8 CPU cores) Locale: LANG=nb_NO.utf8, LC_CTYPE=nb_NO.utf8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Versions of packages boinc-client depends on: ii adduser 3.113 ii ca-certificates 20110502+nmu1 ii debconf [debconf-2.0] 1.5.40 ii libc6 2.13-21 ii libcurl3 7.21.7-3 ii libgcc1 1:4.6.1-4 ii libssl1.0.0 1.0.0e-2 ii libstdc++6 4.6.1-4 ii python 2.6.7-3 ii zlib1g 1:1.2.3.4.dfsg-3 boinc-client recommends no packages. Versions of packages boinc-client suggests: ii boinc-app-seti <none> ii boinc-manager 6.12.33+dfsg-1.1 ii x11-xserver-utils 7.6+3 -- Configuration Files: /etc/boinc-client/global_prefs_override.xml changed: <global_preferences> <run_on_batteries>0</run_on_batteries> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>0</run_gpu_if_user_active> <idle_time_to_run>0.000000</idle_time_to_run> <suspend_cpu_usage>0.000000</suspend_cpu_usage> <start_hour>0.000000</start_hour> <end_hour>0.000000</end_hour> <net_start_hour>0.000000</net_start_hour> <net_end_hour>0.000000</net_end_hour> <leave_apps_in_memory>1</leave_apps_in_memory> <confirm_before_connecting>0</confirm_before_connecting> <hangup_if_dialed>0</hangup_if_dialed> <dont_verify_images>0</dont_verify_images> <work_buf_min_days>0.000000</work_buf_min_days> <work_buf_additional_days>0.250000</work_buf_additional_days> <max_ncpus_pct>100.000000</max_ncpus_pct> <cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes> <disk_interval>30.000000</disk_interval> <disk_max_used_gb>64.000000</disk_max_used_gb> <disk_max_used_pct>80.000000</disk_max_used_pct> <disk_min_free_gb>0.500000</disk_min_free_gb> <vm_max_used_pct>75.000000</vm_max_used_pct> <ram_max_used_busy_pct>75.000000</ram_max_used_busy_pct> <ram_max_used_idle_pct>75.000000</ram_max_used_idle_pct> <max_bytes_sec_up>0.000000</max_bytes_sec_up> <max_bytes_sec_down>0.000000</max_bytes_sec_down> <cpu_usage_limit>100.000000</cpu_usage_limit> <daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb> <daily_xfer_period_days>0</daily_xfer_period_days> </global_preferences> /etc/boinc-client/gui_rpc_auth.cfg [Errno 13] Ikke tilgang: u'/etc/boinc-client/gui_rpc_auth.cfg' /etc/boinc-client/remote_hosts.cfg changed: 192.168.1.3 192.168.1.4 -- debconf information: boinc-client/remove_boinc_dir: false -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org