Re: FreeBSD 13.2-RELEASE started failing to load i915kms.ko after upgrade from RC5

2023-04-09 Thread Peter 'PMc' Much
On 2023-04-09, Yoshihiro Ota  wrote:
> Hi,
>
> I've been following releng/13.2 since it was branched.
> I use amd64 arch for this.
>
> I had built kernel modules during BETA/RC period.
> The above i915kms had worked until RC5.
> I had not built RC6 locally and picked up RELEASE on releng/13.2.
[...]
> Any hints or same experiences?

This is a bit strange - I had that same issue between RC2 and RC3.

And there the commit logs show that somebody did explicitely change
the version number (from 1302000 to 1302001), for whatever reasons
I didn't fully grok.

It seems the modules do not like such a change, and have to be rebuilt.

That version number is visible with pkg info:
> Annotations:
>FreeBSD_version: 1302000

It is also present in the base installation ...

$ grep FreeBSD_version /usr/include/sys/param.h
#define __FreeBSD_version 1302000   /* Master, propagated to newvers */

 ... and in the kernel source:
$ grep FreeBSD_version /usr/src/sys/sys/param.h 
#define __FreeBSD_version 1302000   /* Master, propagated to newvers */

And somewhere it lingers also in the kernel itself:
$ strings /boot/kernel/kernel | grep 13020
1302000




Re: Camcontrol question related to Seagate disks

2023-06-06 Thread Peter 'PMc' Much
On 2023-06-06, Karl Denninger  wrote:
> Certain "newer" Seagate drives have an EPC profile that doesn't interact 
> as expected with the camcontrol "standard" way to tell spinning disks to 
> go into an idle state.
>
> Specifically those that support this: 
> https://www.seagate.com/files/docs/pdf/en-GB/whitepaper/tp608-powerchoice-tech-provides-gb.pdf

"This" whole lengthy babble sounds just like EPC.

> The usual has been "camcontrol idle da{x} -t 600" has typically resulted 
> in a 10 minute timeout, after which it goes into low power

I entertain quite a zoo of disks of various brands and ages, and my
impression is, about a third of them might behave as "typically"
expected in that regard. And practically every model behaves
different, some of them in an obscure and unexpected way.

Sadly I don't have actual SAS devices available (and it doesn't get
clear to me if Your's is SAS or SATA).
With SATA You can send low-level commands to the disk via
'camcontrol cmd' - *IF* you manage to figure out what these commands
should read.

> (sometimes 
> you want "standby" rather than "idle" depending on the recovery time and 
> power mode you're after and the specifics of the drive in question.)

And which one would You want?

Short abstract of the crap:
 - deskstar/ultrastar older models may have a two-level timer variable
   with separate values for idle and stop.
 - WD (whatever rainbow) may have the timer value hidden behind a "vendor
   specific" gate, as described on truenas ("hacking green and red...").
 - ultrastar (newer) may have EPC, but the timer values have only
   vague realtime resemblance.
 . seagate (consumer) may be configured to kill themselves with an
   incredible amount of almost immediate parkings.
 - ...

I for my part got tired of the whole stuff, and there is a little tool
in ports sysutils/gstopd, that can be easily expanded to handle SATA,
and then the machine (and not the disk) will control when the disk is
to stop. (ask me for patch)

> The reason is that /it appears //these drives, on power-up, do not 
> enable the timers /until and unless you send a SSU "START" with the 
> correct power conditioning bits. Specifically, the power conditioning 
> value of "7h" has to be specified.  If its not then the EPC timers are 
> present but, it appears, they are not used.

If You can figure out how this command should actually look like
byte-wise, you can probably send them.

> Does anyone know the proper camcontrol command to do this?  The "start" 
> command sent when the system spins up does not appear to do so.  If I 
> send an "idle" or "standby" to the drive with a timeout it takes
> effect

That stays unclear to me. "start" is SCSI, "idle" and "standby" is
SATA.

> immediately but any access to it spins it up (as expected) and it does 
> not re-enter the lower-power mode on the timer, implying that the SSU 
> command did not enable the timers, and thus they remain inactive even 
> though they ARE set and camcontrol does report them.

Hm, what is SSU? Staggered Spin-Up? I'm trying to configure delayed
spin-up with this one on plain SATA:
# camcontrol cmd /dev/ada2 -a "EF 06 00 00 00 00 00 00 00 00 00 00"
but that doesn't really work, because the device driver has it's own
ideas about when to taste the device...

No, wait...

> To allow unlimited flexibility in controlling the drive’s
> PowerChoice technology feature, the Start/Stop Unit (SSU) SCSI
> command can be used.

Hm... so this would be the *SCSI* command 0x1B... then this one should
work (here is the 'STOP' incantation; replace with Your proper bit values
according to that seagate paper):
# /sbin/camcontrol cmd /dev/xxx -c "1B 01 00 00 00 00"

HTH
PMc



Re: EARLY_AP_STARTUP now (effectively) mandatory?

2023-08-08 Thread Peter 'PMc' Much
On 2023-08-07, Garrett Wollman  wrote:

> This option was apparently added in 2016 by jhb@, and in his
> PHabricator description, he wrote:
>
>   As a transition aid, the new behavior is moved under a new
>   kernel option (EARLY_AP_STARTUP). This will allow the option
>   to be turned off if need be during initial testing. I hope to
>   enable this on x86 by default in a followup commit and to have
>   all platforms moved over before 11.0. Once the transition is
>   complete, the option will be removed along with the
>   !EARLY_AP_STARTUP code.
>

I remember reading that stance, so probably I ran into this one also
once.

It seems, we, who are building our own custom kernels, are becoming
a minority. :(



Re: Interesting (Open)ZFS issue

2023-08-17 Thread Peter 'PMc' Much
On 2023-08-13, Garrett Wollman  wrote:

> This seems to me like a bug: `zpool scrub` correctly identified the
> damaged parts of the disk, so ZFS knows that those regions of the pool
> are bad in some way -- they should cause an error rather than a panic!

Yes, but it does. On seriousely inconsistent data -and zerofill is
seriousely inconsistent- it can behave bad. I think one almost cannot
code&catch every possible exception while still providing excellent
performance.

OTOH, I adopted ZFS very early for my database, and I am usually
running on scrap hardware, but it never gave me a real data loss issue.

cheers,
PMc



Re: vfs.zfs.compressed_arc_enabled=0 is INCOMPATIBLE with L2ARC at least in FreeBSD 13 (Was: Crash on adding L2ARC to raidz1 pool)

2024-01-13 Thread Peter 'PMc' Much
On 2024-01-13, Alexander Burke  wrote:
> Hello,
>
> It looks like the issue is fixed in OpenZFS 2.2 (and thus in FreeBSD 
> 14-RELEASE):
>
> https://github.com/openzfs/zfs/issues/15764#issuecomment-1890491789
>
> Cheers,
> Alex
> 
>
> Jan 13, 2024 12:26:50 Lev Serebryakov :
>
>> On 08.01.2024 18:34, Lev Serebryakov wrote:
>> 
>>    I've found that all my L2ARC problems (live-locks and crashes) are result 
>> of OpenZFS bug which can not support L2ARC with un-compressed ARC 
>> (vfs.zfs.compressed_arc_enabled=0).
>> 
>>    It is NOT hardware-depended (and my NVMe is perfectly Ok and healthy) and 
>> could be easily reproduced under VM with all-virtual disks.
>> 
>>    I've opened the ticket in OpenZFS project 
>> (https://github.com/openzfs/zfs/issues/15764).
>> 
>>    Maybe, FreeBSD need ERRATA entry?
>> 
>> 
>>    Previous threads:
>> 
>>     [1] ZFS pool hangs (live-locks?) after adding L2ARC
>>     [2] Crash on adding L2ARC to raidz1 pool
>> 
>> -- 
>> // Lev Serebryakov
>
>

Just for the records, there is a note in my loader.conf:

> vfs.zfs.compressed_arc_enabled="1" # 27.7.17: since R11.1 l2_cksum_errs if 0

Apparently I didn't bother to open a ticket, since the general stance
was that one shouldn't mess with the defaults.

The more interesting question might be how we managed to improve from
checksum-errors (that were otherwise harmless) to "live-locks and
crashes" ;)



Re: git log - how to find out latest stable/14 breakage

2024-01-20 Thread Peter 'PMc' Much
On 2024-01-20, Harry Schmalzbauer  wrote:

> How can you all manage your daily jobs with git?!?!  For me as a 

daily jobs? Hm, maybe that's the mistake. This is FreeBSD, this ought
to be fun! (I don't have a job, I don't get a job, I'm just normal
unemployed trash :/ ). But to answer your question:

> part-time RCS user, git is a huge regression.  Never had anything to 
> lookup/read twice with subversion or cvs in the past, but never
> found

What I did, is put this into /usr/local/etc/gitconfig:
[alias]
dir = log --topo-order --compact-summary --pretty=fuller

This is slow, but it gives a log output that is more exhaustive,
and similar to the one I could get from SVN.

But then, self-creating sourcefiles are always a bit difficult.
At least people here are quite strict with the naming, so the
appearance of a second dot in in a sourcefile name should give
an alert that something unkosher is going on.

Cheerio!



Re: gpart device permissions security hole (/dev/geom.ctl)

2024-02-24 Thread Peter 'PMc' Much
On 2024-02-24, Miroslav Lachman <000.f...@quip.cz> wrote:
> On 24/02/2024 21:00, Vincent Stemen wrote:
>> On Sat, Feb 24, 2024 at 04:40:00PM +0100, Miroslav Lachman wrote:
>>> I agree with this security problem. Just a small note - there are
>>> backups of partitions (/var/backups/gpart.*) created by periodic script
>>> /etc/periodic/daily/221.backup-gpart (if you have
>>> daily_backup_gpart_enable="YES" in your /etc/periodic.conf or in a
>>> /etc/defaults/periodic.conf which is the default). That way you can get
>>> back the number plate on you house in some cases.
>> 
>> Thanks.  That's good to know.  I was not aware of those features of
>> periodic.
>
> Almost nobody knows.

Oh, now I see why there is a problem.

Actually I found the partition tables missing when I planned for
desaster recovery, and thought it would be helpful to have a copy
of them. So I implemented such periodic backup long before
it was officially provided.

I think there are many possibilities how things can go wrong, and
evil action is only one of them. So my first imperative is to get
the data savely into backup (and then the backup to offsite). That
accomplished, we can in a relaxed mood think about what we will
do to the person who actually manages to delete the partition
table...

cheerio,
PMc



Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task

2024-02-29 Thread Peter &#x27;PMc&#x27; Much
On 2024-02-27, Edward Sanford Sutton, III  wrote:
>More recently looked and see top showing threads+system processes 
> shows I have one core getting 100% cpu for kernel{arc_prune} which has 
> 21.2 hours over a 2 hour 23 minute uptime.

Ack.

> I started looking to see if 
> https://www.freebsd.org/security/advisories/FreeBSD-EN-23:18.openzfs.asc 
> was available as a fix for 13 but it is not (and doesn't quite sound 
> like it was supposed to apply to this issue). Would a kernel thread time 
> at 100% cpu for only 1 core explain the system becoming unusually 
> unresponsive?

That depends. This arc_prune issue does usually go alongside with some
other kernel thread (vm-whatever) also blocking, so you have two cores
busy. How many remain?

There is an updated patch in the PR 275594 (5 pieces), that works for
13.3; I have it installed, and only with that I am able to build gcc12
 - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120
does not help with this).

I didn't see any lagging behaviour, but then, I have 20 vCore.

cheerio,
PMc