Re: [gpsd-dev] refclock 28 gone wacky on me

2016-06-19 Thread Hal Murray

g...@rellim.com said:
> Yes, that is expected.  You need to tetll the Skytrazzq to force the top of
> the second, and save to flash. 

What does that mean?


bellyac...@gmail.com said:
> Did that which is why I didn't understand the delivery coming at near  the
> end of the second.  It appears thought that the firmware *slowly*  brings
> the delivery back to where it should be. 

I've lost track of the details of this discussion.

Many GPS chips seem to have a 100 ms timer that drifts slowly so the offset 
within the second when the NMEA strings come out will slowly drift then jump 
back and start over.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Fix for startup bug - please test

2016-06-19 Thread Hal Murray

Details in
  https://gitlab.com/NTPsec/ntpsec/issues/68


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Fedora kernels missing hardpps

2016-06-20 Thread Hal Murray

Mumble.  Long story.

There are two parts of PPS processing in the kernel.  One is RFC 2783 which 
describes an API for capturing the time when a pulse happens.  The other is 
RFC 1589 which describes a PLL which basically moves all the timekeeping work 
into the kernel.   If you turn on flag3 with the PPS driver, it tries to 
activate the kernel PLL.

I hadn't tried the kernel PLL for a while, maybe because I turned it off ages 
ago when the Linux PPS area was being rewritten.  So I tried it.  It gives a 
not-implemented error.  But I thought the code was there.

Poking around, that chunk of code in the kernel depends upon NTP_PPS which 
says:

config NTP_PPS
bool "PPS kernel consumer support"
depends on !NO_HZ
help
  This option adds support for direct in-kernel time
  synchronization using an external PPS signal.

  It doesn't work on tickless systems at the moment.

The kernels shipped with Fedora, Debian, and Ubuntu all have NO_HZ turned on.

I haven't found the config file for any of the ARM kernels.  One try got the 
same result.

I guess I'll try building a kernel and/or running on FreeBSD or NetBSD.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Documenting some progress - magic refclock addresses are almost gone

2016-06-25 Thread Hal Murray

e...@thyrsus.com said:
> Does anyone on the list understand mode 6 well enough to answer questions?
> My main one is: if I add a field to a mode 6 response, is it going to break
> old ntpqs or will they silently ignore it? 

I think they ignore it, but try it to be sure.

> (The response field I intend to add, of course, is to the peer query and is
> a refclock type name - empty for real servers.) 

Beware.  There are 4 variations on the peers command in ntpq.

The "o" types print the dispersion rather than jitter.
The "l" types print the local IP address rather than the refid.
At least I thought that was what should happen.  lpeers doesn't work that way.

There is also apeers.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-25 Thread Hal Murray

e...@thyrsus.com said:
> 1. Apply Classic's workaround for the problem, which I don't remember the
> details of but involved some dodgy nonstandard linker hacks done through the
> build system.  *However, I did not trust this method when I understood it.*
> It seemed sure to cause porting difficulties and is inherently fragile. 

k...@roeckx.be said:
> If it's the one I'm thinking about, I think the solution is to remove the
> locking of memory. 

We may be confusing several bugs.

There was a problem with locking stuff into memory.  Some library needed by 
end of thread processing wasn't loaded yet and things worked out such that 
with the default memory 32 bit systems worked but 64 bit systems didn't have 
enough room.

I think one solution was to create a dummy thread early on to get that module 
loaded.  Or disable memory locking, or tell it to use more memory, or ...


> 2. Fix the actual problem. Well, that'd be nice, but Hal looked into it
> months ago and said he understood it but couldn't generate a fix. IIRC, he
> said it needed a full rewrite.  That tells me the code is probably not
> salvageable. 

I don't remember that part.  I use the pool command on several systems.  I 
haven't seen a crash in ages.

There was another interesting problem in this area.  It was a bug in 
FreeBSD's trap handler.  ntpd managed to trigger it consistently.

.

> I favor #4.

I favor understanding things more.

Can you get a stack trace?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-25 Thread Hal Murray

e...@thyrsus.com said:
> I think the hack is to force libgcc_s to be loaded early. I don't know how
> to do that in waf. 

There are two problems in this area.  One is the end-of-thread code not 
getting locked into memory.  I think that is what you are running into.

The other is a tangle of error handling on out-of-memory issues by things 
like pthread_create and DNS lookup.  I think the latter end up with a retry 
error code.  I think I fixed some/many of them to crash rather than retry on 
the assumption that memory wasn't going to get freed and I didn't know of any 
other reason to retry.  But that was a long time ago (maybe pre fork) and I 
don't remember the details.


I think we should copy the warmup code from ntp classic.  It's basically an 
upstream bug.  Warmup seems like a reasonable work around.

It's in ntpd/ntpd.c  Search for NEED_PTHREAD_WARMUP and backup over the long 
comment
which describes what's going on.

There is a note about not working on FreeBSD.  I haven't sorted that out.  It 
may refer to the linker hack.

Here are the bugs I remember:
  https://bugs.ntp.org/show_bug.cgi?id=2831
FreeBSD page fault story, morphs into lock discussion
  https://bugs.ntp.org/show_bug.cgi?id=2905
rlimit/memlock discussion

There is more info in various bugs:
  https://bugs.ntp.org/show_bug.cgi?id=2332
  https://bugs.ntp.org/show_bug.cgi?id=2954
  https://bugs.ntp.org/show_bug.cgi?id=2817
The signal/noise may not be good.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-25 Thread Hal Murray

e...@thyrsus.com said:
> In this case, we have two possible complexity-reducing fixes.  One is to
> drop the memlock feature entirely.  The other is to drop the buggy homebrew
> asynchronous-DNS lookup from Classic and use libc's.

Dropping memlock is an interesting idea.  I can't think of any place where it 
is required today but my crystal ball for what we will need tomorrow has 
never been very good.

What would you do if we discovered a case where we wanted it?

We could try simplifying things to only supporting lock-everything-I-need 
rather than specifying how much.  There might be a slippery slope if 
something like a thread stack needs a sane size specified.

Is there a simple way to count page faults for a process?  Or measure swapped 
out data and/or code that isn't swapped in?


I don't think your use-libc approach will be as simple as you would like.  It's 
not available on NetBSD or FreeBSD.  Maybe I just didn't look in the right 
place.  It's not in netdb.h where it is for Linux.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: My first positive structural change to NTP

2016-06-25 Thread Hal Murray
> Here's how I think it should look:

> --
> refclock shm unit 0 refid GPS
> refclock shm unit 1 prefer refid PPS
> --

I think you should start a list of that sort of change.

Currently, we can switch between our code and ntpd classic.  The same 
ntp.conf works for both.

I think we should preserve that until we make an explicit decision that it's 
the right time to make the break.

---

> Oh well...almost everyone disables remote querying anyway.  

It may be disabled for general IP addresses, but it's used all the time for 
monitoring your own servers.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: My first positive structural change to NTP

2016-06-26 Thread Hal Murray

strom...@nexgo.de said:
> I think that's still perpetuating a mistake.  This whole business of having
> to specify two servers (or refclocks) for the same thing should go away.

There is a fundamental issue.  With a PPS, there really are two sources of 
time.  Internally, ntpd needs two different handles so you can see both sets 
of info on ntpq -peers and clockstats.

Normally, each PPS has an associated serial stream.  It would be good if 
there were a clean way to specify that rather than using the prefer kludge.


strom...@nexgo.de said:
> It's easy enough these days to tell udev what each device should be named,
> so in principle there wouldn't even be a need to use anything but the
> default names. 

Is there a udev equivalent on other OSes?

I don't think it is necessary.  A boot time script can setup symbolic links.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-26 Thread Hal Murray

e...@thyrsus.com said:
>> Is getaddrinfo_a() in RTEMS?  QNX?   BSD?
> It's not an OS thing, it's a toolchain thing.  getaddrinfo_a() is
> implemented using standard C and POSIX threads, it doesn't need OS-specific
> support.

Or it's in an optional extra library.

> Linux has it because Linux uses libc whether you're compiling with gcc or
> clang.  Any of those other platforms will have it *if* they have (gcc ||
> clang) && glibc. 

My Linux man page says:
   #define _GNU_SOURCE /* See feature_test_macros(7) */
   Link with -lanl.

I couldn't find it in /usr/include/ on NetBSD or FreeBSD.  On Linux, it's in 
netdb.h.

--

If it uses threads, we still have the problem of not being able to load the 
thread cleanup code.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-26 Thread Hal Murray

e...@thyrsus.com said:
>> We could try simplifying things to only supporting lock-everything-I-need 
>> rather than specifying how much.  There might be a slippery slope if 
>> something like a thread stack needs a sane size specified.

> I'm not intimate with mlockall, but it looks like it works that way now. 

There is a back door way to specify a limit.  Part of it is the total.  Part 
of it is the stack size for new threads.

[way to count page faults]
> I don't know.  I can do some research, but I'm not sure "enough page faults
> to merit memory locking" would be a well-defined threshold even if I knew
> how to count them. 

If the answer was 0 then we wouldn't have to discuss the threshold.

--


> I believe you're right that these platforms don't have it.  The question is,
> how important is that fact?  Is the performance hit from synchronous DNS
> really a showstopper?  I don't know the answer. 

There are two cases I know of where ntpd does a DNS lookup after it gets 
started.

One is the try again when DNS for the normal server case doesn't work during 
initialization.  It will try again occasionally until it gets an answer. 
(which might be negative)

The main one is the pool code trying for a new server.  I think we should be 
extending this rather than dropping it.  There are several possibles in this 
area.  The main one would be to verify that a server you are using is still 
in the pool.  (There isn't a way to do that yet - the pool doesn't have any 
DNS support for that.)  The other would be to try replacing the poorest 
server rather than only replacing dead servers.

DNS lookups can take a LONG time.  I think I've seen 40 seconds on a failing 
case.

If we get the recv time stamp from the OS, I think the DNS delays won't 
introduce any lies on the normal path.  We could test that by putting a sleep 
in the main loop.  (There is a filter to reject packets that take too long, 
but I think that's time-in-flight and excludes time sitting on the server.)

There are two cases I can think of where a pause in ntpd would cause 
troubles.  One is that it would mess up refclocks.  The other is that packets 
will get dropped if too many of them arrive.

I think that means we could use the pool command on a system without 
refclocks.  That covers end nodes and maybe lightly loaded servers.

---

It's worth checking out the input buffering side of things.  There may be 
some code there that we don't need.  I think there is a pool of buffers.  
Where can a buffer sit other than on the free queue.   Why do we need a pool?



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-26 Thread Hal Murray

Possible crazy idea...

How about we never kill the DNS helper thread.  Just let it sit there in case 
it gets more work to do.  The only cost is a bit of memory.

Or maybe only do that if we are locking stuff into memory.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-26 Thread Hal Murray

e...@thyrsus.com said:
> Ugh.  Our options have just narrowed.  I've just seen
> libgcc_s.so.1 must be installed for pthread_cancel to work Aborted (core
> dumped)

> with memlock off in the build.

Can you reproduce it?

My guess is that you didn't really get memlock turned off.  How about putting 
a break on mlockall or the call to it.  (There is only one in ntpd.c)


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Wonky NTP startup and the incremental-configuration problem

2016-06-26 Thread Hal Murray

An alternative option would be to implement rereading ntp.conf.

For each line in ntp.conf, there are 3 possibilities.  It's new or the value 
has changed, nothing has changed, or the item was dropped.  The latter is the 
tricky case.

The idea is to save a parsed copy of the old ntp.conf.  As the new file is read 
in, kick out the old items (if any) as they get replaced.  (Actually, move them 
to what will be the new saved info.)  Anything left on the old saved-list needs 
to be set back to the default.

That works for simple things like setting a parameter.  It gets more 
complicated for things like server/pool/refclock.

It feels like something that's reasonably clean with the appropriate table.

We would need a way to test things.  I wonder if we could do that from a script 
driving the debugger?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


head broken if no refclocks

2016-06-26 Thread Hal Murray
after a simple ./waf configure

[murray@fed raw]$ ./waf build
--- building host --- 
Waf: Entering directory `/home/murray/ntpsec/raw/build/host'
[1/5] Processing ntpd/ntp_parser.y
[2/5] Compiling build/host/ntpd/ntp_parser.tab.c
/home/murray/ntpsec/raw/ntpd/ntp_parser.y: In function ‘yyparse’:
/home/murray/ntpsec/raw/ntpd/ntp_parser.y:996:33: error: 
‘num_refclock_conf’ undeclared (first use in this function)
for (dtype = 1; dtype < (int)num_refclock_conf; dtype++)
 ^
/home/murray/ntpsec/raw/ntpd/ntp_parser.y:996:33: note: each undeclared 
identifier is reported only once for each function it appears in
/home/murray/ntpsec/raw/ntpd/ntp_parser.y:997:12: error: ‘refclock_conf’ 
undeclared (first use in this function)
if (refclock_conf[dtype]->basename != NULL && 
!strcasecmp(refclock_conf[dtype]->basename, $2) == 0)
^

Waf: Leaving directory `/home/murray/ntpsec/raw/build/host'


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Our testing sucks

2016-06-26 Thread Hal Murray

 1007  ./waf configure --refclock=20,22 --enable-debug-gdb
 1008  ./waf build
 1009  gdb ./build/main/ntpq/ntpq

(gdb) run -p
Starting program: /home/murray/ntpsec/raw/build/main/ntpq/ntpq -p
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.21-13.fc22.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
 remote   refid  st t when poll reach   delay   offset  jitter
==

Program received signal SIGSEGV, Segmentation fault.
0x00413d68 in strlcpy (dst=0x7fffd700 "", 
src=0x4f , siz=1025)
at ../../libntp/strl_obsd.c:36
36  if ((*d++ = *s++) == '\0')
Missing separate debuginfos, use: dnf debuginfo-install 
ncurses-libs-5.9-18.20150214.fc22.x86_64
(gdb) bt
#0  0x00413d68 in strlcpy (dst=0x7fffd700 "", 
src=0x4f , siz=1025)
at ../../libntp/strl_obsd.c:36
#1  0x0040a561 in doprintpeers (pvl=0x625460 , 
associd=1947, rstatus=37914, datalen=2, 
data=0x62880d  "\r\n", fp=0x7725b620 <_IO_2_1_stdout_>, 
af=0) at ../../ntpq/ntpq-subs.c:1795
#2  0x0040a8c8 in dogetpeers (pvl=0x625460 , 
associd=1947, fp=0x7725b620 <_IO_2_1_stdout_>, af=0)
at ../../ntpq/ntpq-subs.c:1877
#3  0x0040aae1 in dopeers (showall=0, 
fp=0x7725b620 <_IO_2_1_stdout_>, af=0) at ../../ntpq/ntpq-subs.c:1928
#4  0x0040ad9a in peers (pcmd=0x7fffe220, 
fp=0x7725b620 <_IO_2_1_stdout_>) at ../../ntpq/ntpq-subs.c:2008
#5  0x00404bc2 in docmd (cmdline=0x419d08 "peers")
at ../../ntpq/ntpq.c:1649
#6  0x00402cda in ntpqmain (argc=0, argv=0x7fffe478)
at ../../ntpq/ntpq.c:658
#7  0x00402426 in main (argc=2, argv=0x7fffe468)
at ../../ntpq/ntpq.c:442
(gdb) 


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


waf list shouldn't need to be configured

2016-06-26 Thread Hal Murray
$ ./waf --list
--- building host --- 
The cache directory is empty: reconfigure the project
$

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Our testing sucks

2016-06-26 Thread Hal Murray
 1010  ./waf configure
 1011  ./waf build


[ 74/206] Compiling ntpd/ntp_intercept.c
../../ntpd/ntp_control.c: In function ‘ctl_putpeer’:
../../ntpd/ntp_control.c:2319:8: error: ‘struct peer’ has no member named 
‘procptr’
   if (p->procptr != NULL) {
^
../../ntpd/ntp_control.c:2322:10: error: ‘struct peer’ has no member 
named ‘procptr’
 p->procptr->clockname, p->refclkunit);
  ^
../../ntpd/ntp_control.c:2322:33: error: ‘struct peer’ has no member 
named ‘refclkunit’
 p->procptr->clockname, p->refclkunit);
 ^

Waf: Leaving directory `/home/murray/ntpsec/foo/build/main'


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


waf --list needs to show old numbers as well as new names

2016-06-27 Thread Hal Murray

It's handy if you are updating a script.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


New ntpq peers chops refclocks to 6 characters

2016-06-27 Thread Hal Murray
But there is lots more room in that column.  I think it will hold a worst 
case IPv4 numerical address.


 remote   refid  st t when poll reach   delay   offset  jitter
==
 HP5850  .GPS.0 l7   6410.0000.000   0.000
 PPS(0)  .PPS.0 l-   6400.0000.000   0.000
 SHM(0)  .SHM.0 l5   6410.000  -218.14   0.001
 SHM(1)  .SHM.0 l4   6410.000   -1.094   0.001
 GPSD(0  .GPSD.   0 l3   6410.000  -308.94   0.001
 GPSD(1  .GPSD.   0 l2   6410.000   -0.209   0.001


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: The new refclock directive is implemented and documented

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
> and the
>   noun/verb "fudge" is reserved for the two time offset options.

Why?  What's the difference between a flag that gets set to 0 or 1 and a time 
that gets set to a number?



> There will be a *limited* open period for bikeshedding about the driver
> names.

hp58503a should probably be hpgps.  It works for several devices.

--

You need a plan for testing this stuff.  I won't be helping since I think 
it's important to be able to run ntp classic.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: waf --list needs to show old numbers as well as new names

2016-06-27 Thread Hal Murray
> Can you show me an example of this sort of script?

How do you build things for your collection of systems?  Do you really type 
the configuration in by hand each time?  Do you use --refclock=all?

Here is a fragment that I translated by hand:

--refclock=irig,nmea,pps,hp58503a,shm,gpsd


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: New ntpq peers chops refclocks to 6 characters

2016-06-27 Thread Hal Murray

How does your new stuff handle multiple instances of a refclock type?

For a test case, I suggest a USB driver in addition to a HAT.  Try both 
NMEA/PPS as well as both SHM and various combinations.


The JSON driver uses the high bit of the unit to enable/disable the PPS.


The NMEA and HP drivers use the mode/ttl slot to select the baud rate.  There 
are probably others that do something similar.   As long as you are changing 
things, you might as well clean up how the baud rate gets passed in.  It 
needs that before it opens the /dev/tty.  The old fudge stuff is too late.  
(Or was at one point.)

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: The new refclock directive is implemented and documented

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
> I don't think shm needs to change at all. It says what it is - data coming
> over System V shm, which defines its own format by the shared structure 

I like SHM.  I think there are non-gpsd sources of SHM data.

I have no strong preferences for gpsd vs json.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: waf --list needs to show old numbers as well as new names

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
>> --refclock=irig,nmea,pps,hp58503a,shm,gpsd
> I'm not seeing a problem here.  Isn't it trvial to get those names from,
> e.g., https://docs.ntpsec.org/latest/refclock.html ? 

The problem is not to "get the names", it's to translate an old number to the 
new name.  You may have forgotten which driver you tossed into that setup or 
why, and maybe now is not the time to clean up that sort of thing.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: New ntpq peers chops refclocks to 6 characters

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
> More suggestions like this, please.

bps may not be enough.  There is also the parity and stop bits, but I don't 
think they are fiddled much.

The HP driver uses one mode bit to switch from whatever the default is to a 
different baud rate and parity.  It may be simpler to use a device-type 
keyword rather than require the user to know about bps and such.

The NMEA driver uses a chunk of the mode field to select the type of sentence 
to use.  Something symbolic would be nicer.

The palisade driver has an option that selects from several modes.

There are probably others.

It's probably worth a pass through the code and/or documentation before you 
do anything.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


offset: time1 or time2

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
> Which reminds me: an addition I'm considering is adding "offset" as a
> synonym for time1 or time2, whichever one usually sets an offset for time
> reported from the unit. Only., I'm not clear which it should be; either it
> varies by driver or I'm not understanding the documentation properly. Can
> you shed any light on this? 

The problem arises when you have something like the NMEA driver that tries to 
handle the PPS by itself.  That needs two offsets, one for the serial port 
and one for the PPS.  My suggestion would be offset and pps-offset.

I think the only way to be sure what is going on would be to go through the 
drivers one by one and make a chart of their usage.  That might be handy for 
other uses.  If you make one, consider adding it to docs/  Just put a date on 
it.  It's probably worth grepping the drivers to make sure the code agrees 
with the documentation.  I think there was a reasonable attempt to keep the 
usage common across drivers but I won't be surprised by any differences.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: The new refclock directive is implemented and documented

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
>> hp58503a should probably be hpgps.  It works for several devices.
> OK.  Can you enumerate some other devices so I can list them in the header
> comment and on the driver page? 

The documentation already mentions the Z3801A.  There are a lot of them in 
the ham/hacker community courtesy of the cell phone industry many years ago.  
I think these were the first GPSDOs available at recycled prices.  The manual 
is available and makes a good read for background info on GPSDOs.  There is 
lots of non-HP info available on the web.

They are really old (GPS software says COPYRIGHT 1991-1995 MOTOROLA) so the 
GPS units aren't very sensitive.

There are several other Z38xx versions available.  There are usually a few of 
them on eBay.

Recently, a batch of new Z3811/3812 two unit pairs appeared, also known as 
KS-24361.  Lucent unloaded their stockpile.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: The new refclock directive is implemented and documented

2016-06-27 Thread Hal Murray

e...@thyrsus.com said:
> An argument for "json", maybe.  But not a really compelling one, because
> GPSD defined the protocol and anything else emitting it would probably be
> emulating GPSD deliberately.

I think I prefer JSON for the same reason I like SHM.

I think the real question is does the current driver depend on any GPS(d) 
cruft.  If all I wanted was time, could I send just the offset and would the 
current driver take it.

Maybe gpsd should have a time-only mode.




-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-27 Thread Hal Murray

cbwie...@gmail.com said:
> How are pool entries added when the service decides it needs more? 

There is some background stuff that roughly says "need more?", and if so 
fires off the DNS lookup.


> Would it be possible to leverage this code for adding all servers specified
> by name?

Probably not directly, but it wouldn't be hard for the server code to use 
more than one address if that was desired.  Maybe it should be "servers" 
rather than "server".  Do you have an example where that would be useful?

If you don't have lots of servers, you probably don't want to switch to using 
"pool" since that path will probably keep banging away at the DNS looking for 
more servers.

 



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-27 Thread Hal Murray

cbwie...@gmail.com said:
> I was thinking of setting up associations using the DNS lookup code.  If the
> mechanism for adding new pool servers was blocking on the DNS call but
> asynchronous to the rest of the daemon, I was figuring to call the lookup
> with the name provided by the server directive.  The only real difference
> between a specified server and a pool server is that you don't delete the
> specified server. 

The DNS lookup for server and pool both take the same general path of using 
another thread to do the lookup.

If all goes well, the server stuff could do the lookup during startup.  But 
there are all sorts of ways for DNS to not work.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


ntpq mrulist: cpu hog

2016-06-28 Thread Hal Murray
I have a pool server.  mru maxmem is set big enough to capture a whole day.  
Each midnight, a cron job fires off to capture everything to a file.  The 
file is 100 megabytes.

While that is going on, ntpq is using 95% of the cpu.

If anybody is looking for a nice distraction, it would be interesting to 
understand what's going on and see if it could be fixed.  (I haven't looked 
at the code yet.)



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-28 Thread Hal Murray

e...@thyrsus.com said:
> After discussion with Daniel about the performance and security issues I
> deleted the memlock code. As the comment explains:

I think changes like that are worthy of a general announcement.

> on modern systems, which swap so seldom
> that many people don't bother with swap partitions

I think you have extrapolated from some modern systems to our whole target 
environment.  I don't remember any discussion supporting memlock not being 
interesting/important.

I'd be a lot happier if you had a plan for what to do if it turned out to be 
a problem and/or a way to verify that we don't need it or detect that it 
causes trouble.

Consider ntpd running on an old system that is mostly lightly loaded and 
doesn't have a lot of memory.  I could easily imagine ntpd getting swapped 
out when some load did come along.  I don't know how to evaluate if that will 
cause problems and I don't think we have a test environment that is likely to 
blunder into it.

I poked around a bit.  Linux and NetBSD and FreeBSD all have getrusage().  I 
didn't notice any differences.  It covers page faults and CPU usage.  When 
I'm in the right mood, I'll add another file parallel to sysstats to collect 
that sort of data.  The CPU usage will probably be interesting even if page 
faults are boring.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Device driver mode bits and other skulduggery.

2016-06-28 Thread Hal Murray

e...@thyrsus.com said:
> One thing that jumps out at me is that several drivers have a clockstats
> verbosity option, always flag4 (which, alas, is used for other things too). 

There may have been a general idea that flag4 would be used to enable 
clockstats from an individual driver instance.

That's how refclock_shm uses it.  I'd be happy if you nuked that test, aka 
always write it.  (If clockstats isn't enabled it won't go anywhere.)

Same for refclock_gpsd

There may be others.  I didn't check the drivers I'm not familiar with.

It's probably not worth a lot of work in this area.

> hpgps:
>time1: PPS time offset

The HP driver doesn't know anything about PPS.  I assume that is a typo.

> nmea:
>flag3: clock discipline selection
> pps:
>flag3: PPS discipline select

I would say "kernel PLL" in there.  "discipline select" doesn't tell me 
anything.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Use of pool servers reveals unacceptable crash rate in async DNS

2016-06-28 Thread Hal Murray

matthew.sel...@twosigma.com said:
> "rlimit memlock 0" using Classic causes ntpd to died after 3 minutes with
> this error 2016-06-29T00:13:21.903+00:00 host.example.com ntpd[27206]:
> libgcc_s.so.1 must be installed for pthread_cancel to work 

What version of Classic are you running?  I though they had fixed that.


> I've attached 15 minute graphs for "rlimit memlock -1" and "rlimit memlock
> 128" using Classic.  Locking memory seems to result in more stable graphs
> over the time period that I was able to collect quickly. 

What are you plotting?

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Kernel PPS processing

2016-06-29 Thread Hal Murray
http://users.megapathdsl.net/~hmurray/ntpsec/glypnod-pps-kernel.png

If you turn on flag3 for a PPS driver on a Linux system, you get this error 
message:
06-20T12:25:32 ntpd[988]: refclock_params: kernel PLL (hardpps, RFC 1589) not 
implemented

I poked around a bit.  Those options are in drivers/pps/Kconfig  Here is the 
key chunk:
config NTP_PPS
bool "PPS kernel consumer support"
depends on !NO_HZ
help
  This option adds support for direct in-kernel time
  synchronization using an external PPS signal.

  It doesn't work on tickless systems at the moment.

So I pulled over the sources and built a kernel with NO_HZ turned off and 
NTP_PPS turned on.

The next project is to figure out why it works so much better, or rather why 
the normal ntpd can't do a lot better.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: adns is looking plausible

2016-06-29 Thread Hal Murray

e...@thyrsus.com said:
> I haven't looked at the code itself yet, but from reading the C header file
> and the website, adns is looking like a plausible replacement for our
> homebrew async-DNS.

Good find!

One feature that pushes me in that direction is being able to get at the TTL.


> and reducing our KLOC ...

It's not obvious to me that reducing our code at the cost of dragging in 
another library is progress.  Do we even have a list of the libraries that we 
now depend upon?  How do we evaluate their risks and/or trustworthiness?

It's available as a package in Fedora and NetBSD and FreeBSD.  I take that as 
a vote of confidence, but I don't know how strong.  Does anybody know a 
simple to find out how many other packages depend upon a given package?



Where does this fit in our our overall priorities?  What can I do to help get 
you back to working on TESTFRAME?  Should we rip out the current intercept 
stuff (few KLOC) and start over when we are done with what seems to be 
turning into a long series of other projects?




-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-29 Thread Hal Murray
> Can you quantify the better?  I would have expected identical...

Did you look at the graph?
  http://users.megapathdsl.net/~hmurray/ntpsec/glypnod-pps-kernel.png

I'm not sure why you would expect performance to be identical.  Dave Mills 
and crew went to a lot of effort to get code into various kernels, including 
writing a RFC.  I'd be very surprised if it wasn't a significant improvement.

The question in my mind is have things changed enough since 1994 so that we 
can do as well without that code in the kernel?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Technical strategy and performance

2016-06-29 Thread Hal Murray

fallenpega...@gmail.com said:
> Thank you Eric.  Have read, am pondering, and welcome other people to weigh
> in. 

The big picture question that comes to mind is why did we start by forking 
ntp classic?  Why not start from scratch?  Did anybody consider chrony?  What 
other options are/were there?

Where would I look to find a crisp statement of the goals of the project?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-29 Thread Hal Murray
matthew.sel...@twosigma.com said:
> We tested booting with "nohz=off intel_idle.max_cstate=0" and it made a
> difference in our production clocks. 

Interesting.  Thanks.

How did you decide to go there?

Did you try those 2 changes separately?  Was that with PPS or just a typical 
system?

Are you using the kernel from a distro or building your own?





-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Technical strategy and performance

2016-06-29 Thread Hal Murray
Thanks.

I didn't see any surprises.  I'm happy with the general idea, it's the 
details that get interesting.

Removing cruft is good.  Removing features is not.  There is a trade off 
between the cruftiness of the code and the importance of any features it 
includes.

This example gets tangled up in several issues.

I didn't see much discussion.

I seem to be the only one who occasionally pushes back when you hint at 
removing stuff.  I can't tell if I'm making the right amount of noise or not 
enough or too much.  Most of the cruft you remove looks like progress to me, 
but I can't tell if/when you are going too far.  It's a judgment call.  
Sometimes I don't care much.  Sometimes I do.

One of the complications for this case is that we don't have a good way to 
test things.  This feels like the sort of problem that might come back as a 
hard to debug example way off in a far away datacenter where it would be even 
harder to debug.  I don't like that sort of problem so I'm probably willing 
to put up with a bit of cruft in the code in order to reduce the risk.

You haven't convinced me that modern hardware will make this problem go away. 
 Yes, it will reduce it, but that also makes it harder to test.  Your comment 
about no swap space was timely.  I lost a cron job a few days ago because it 
ran out of memory.  I don't know enough about modern data center operations.  
On VM systems, they charge for memory.  ...

Did you consider simplifying things rather than removing everything?  (Sorry 
for not suggesting this sooner.)  Most of the cruft was in figuring out how 
much to lock.  Would locking everything be simple enough?

---

I thought there was a command line switch to use the real-time scheduler but 
I can't find it.  If it's there, it might be cruft to clean up.  If it's not 
there, it might be a good feature.  There would be complications with lots of 
traffic locking up the CPU.

---

There is another interesting consideration when using old hardware.  They 
take a lot of power.  At some point, it's cheaper to buy new gear that 
doesn't use as much power and has more memory while you are at it.  I 
computed the pay back time once, and it seemed like a good excuse to get some 
new toys.  The next time I did the calculations, I got a an answer I didn't 
like as much.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-29 Thread Hal Murray

g...@rellim.com said:
>> I'm not sure why you would expect performance to be identical.
> Because thhey use the same kernel generated time stamp and PLL algorithm. 

There are two chunks of PPS code in the kernel with separate RFCs.  One is 
getting the time stamp.  The other is doing the PLL.  The in-kernel PLL is 
totally different from anything in ntpd.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-29 Thread Hal Murray

g...@rellim.com said:
> Wow.  I thought something was wrong.  My local clock offset (peerstats file)
> has always been hanging around 100ppm.  Stable to  ±1ppm so I figured
> that was normal.

> After reboot the local clock offset started at 9ppm and has been slowyly
> going down, now under 2ppm. 

Your units don't make sense.

Offsets would be in units of seconds.  I assume you mean some fraction of a 
second.  A guess would be microseconds.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-29 Thread Hal Murray
> Local clock frequency offset, as opposed to local clock time offset.

Most NTP documentation calls that drift.  Its magnitude is not very 
interesting when discussing quality of time.  Changes over time can be 
interesting.  It's usually much more interesting to look at the clock offset.

There are two sources for drift.  One is crystal error.  That part often 
makes a good thermometer.

The other is software.  If somebody gets the arithmetic a bit wrong, ntpd can 
correct just like it does for the initial hardware error.

For many years, Linux had a not-good measurement of the system clock 
frequency at boot time.  If you rebooted, you got a different answer.  It was 
close, just not good enough in the low bits if you wanted good timekeeping.

Jun  2 10:34:25 fed kernel: tsc: Detected 1596.750 MHz processor
Jun  9 11:06:24 fed kernel: tsc: Detected 1596.966 MHz processor
Jun 19 11:42:22 fed kernel: tsc: Detected 1596.978 MHz processor


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: My task list

2016-06-30 Thread Hal Murray
> 1. Try replacing our buggy async-DNS code with the c-ares library.

You keep calling the existing code "buggy".  Is that correct, or are you just 
being sloppy since you don't like it (perhaps justifiably) and it has 
triggered bugs/quirks in other parts of the system.

As far as I can tell, our code is innocent.  The recent troubles are some 
combination of libc/memlockall and pthreads not working well together.  We 
just happened to trigger it reliably enough to cause troubles but not 
reliable enough to make testing simple.

> 2. If that succeeds, reinstate memlocking long enough to check if the
>crash bug recurs.  If it doesn't, leave memlocking in.

The old memlock code, or a simplified lock-everything (no parameters) version?

If any new code uses threads, it's going to have the same problem.  I'd vote 
against restoring the old code until you have figured out how to test it.


> 3. Collect the results from my first profiling runs, now about 14 days of
> data
>Learn how to graph and interpret them. 

You might do that first since you will probably want to tweak something and 
collect more data.

Data for a day will tell you most of what you will ever get.  If you have 
lots of data, then you have to scan it looking for glitches.

Consider bumping the clock and watching it recover.  (util/bumpclock)  There 
are two interesting cases.  One is a big bump so it will "step" the clock to 
recover.  The other is a small bump so it will slew (slowly) to recover.  The 
split is 128 ms.  So I'd try 200 ms and 100 ms.


> 5. Do the cleanup required to get the code compiling under -std=c99. 

What does that involve?



TESTFRAME is missing.  How about we both clear our schedules and desks and 
give it another try?  How about next Wed?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Technical strategy and performance

2016-06-30 Thread Hal Murray

e...@thyrsus.com said:
> In many cases, especially in governmant, they *can't* -- they have lengthy
> certification requirements for new infrastructure components.

If they are on the ball, they will have to do almost as much work to 
(re)certification after all the changes we have made.


>> Where would I look to find a crisp statement of the goals of the project?
> On the project website.
> https://www.ntpsec.org/announcement.html
> https://www.ntpsec.org/plans.html 

I don't see anything in either page that I would call "crisp".

Yes, there is lots of good stuff.  If you know what the answer is, you can 
find lots of supporting info.

The plans are all down in the weeds, details rather than big picture.  The 
announcement is background and handwaving.  If I asked you, does "X" fit in 
the scope of the project, you can scan the plans to see if you find a match 
but if not, good luck trying to figure it out from the announcement.  Take 
the memlock discussion.  Is there anything in the announcement that says we 
focus on modern systems with lots of memory?



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-06-30 Thread Hal Murray

g...@rellim.com said:
> I took another look, and realized I misunderstood the y axis.  And that you
> are plotting loopstats and I'm looking at offsets.  So not the bad I
> thought. 

I can't figure out what that means.

I was plotting the offset column from loopstats.

> To get apples and apples, can you send me your gnuplot formula?  I'll add it
> to my chrony graph.

>From another handy graph:

set ylabel "Offset ms"
set y2label "Drift PPM"

Plot \
"glypnod-loop" \
  using ($2/3600):($3*1000) \
  title "Offset" with lines lt 1, \
"glypnod-loop" \
  using ($2/3600):($4) \
  axes x1y2 \
  title "Drift" with lines lt 3


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Master Does Not Compile on Centos 6

2016-06-30 Thread Hal Murray

j...@rtems.org said:
> This likely fails on other platforms since it is a mismatched brace: 

Yes.  Anything without a refclock.

Amar:  Buildbot needs a few more build runs with various configurations to 
catch things like this.  They don't need to be run on all systems but they 
should be run on at least one.  My straw man to start with would be one with 
minimal features and one with everything.

We should run them on everything when getting ready for a release.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Master Does Not Compile on Centos 6

2016-06-30 Thread Hal Murray
>> This likely fails on other platforms since it is a mismatched brace: 
> Yes.  Anything without a refclock.

Fix pushed.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Technical strategy and performance

2016-06-30 Thread Hal Murray

ja...@azze.org said:
> This is why I try to make noise when things are broken on RHEL/CentOS 6.x. I
> don't see a builder for that OS on buildbot.ntpsec.org. The Red Hat
> Enterprise family (RHEL, CentOS, Scientific Linux, Oracle Enterprise Linux)
> and SuSE Linux Enterprise Server are where we boring, conservative sysadmins
> like to live. There are a lot of us who haven't moved off of RHEL 6
> (supported through 2020) for critical infrastructure because RHEL 7 went
> systemd on us. 

Is CentOS reasonable coverage for the Red Hat side?  What versions do we need?

Is Scientific Linux enough different that it's worth running it too?  If so, 
what versions?

Is openSUSE reasonable coverage for SUSE?  If so, what versions do we need?

Is there a free version of Oracle?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Technical strategy and performance

2016-06-30 Thread Hal Murray

e...@thyrsus.com said:
> There are some prerequisites.  Libraries need the library installed to run
> and in addition, the development headers installed to build.

> Python 2.x, x >= 5
> bison
> libevent 2.x
> libcap
> OpenSSL
> GNU readline
> BSD libedit
> sys/timepps.h
> asciidoc, a2x

It's (much?) more complicated than that.

Python and bison are needed to build and install.  Python may be needed by 
some utilities.

libcap and timepps.h are optional.  libcap on Linux is needed for drop root.  
timepps is needed for PPS support.  ntpd will run fine without them.

For the crypto stuff, we need libcrypto.  On Fedora, that comes from the 
cryptopp package.  I'm not sure how that's tangled up with OpenSSL.  We can 
build without it but you won't get the crypto stuff.

asciidoc is only needed to build the documentation.  If necessary, you could 
build it on a different system and copy it over.

I think readline and libedit are only needed by utilities like ntpq.

I think a tarball avoids some of the build requirements.  I'd guess bison and 
asciidoc, but I'm not sure.

We should probably setup a cross compile example.


e...@thyrsus.com said:
> H. Daniel, shouldn't that OpenSSL be replaced by libsodium?  Please
> write up an entry on that. 

We have a local copy of whatever we need from libsodium.  We need libcrypto 
for the crypto stuff.

Amar:  Buildbot should probably include a system that doesn't have any of the 
optional stuff installed - just to make sure we really can build on it.



Eric:  There is code for md5 and sha1 (I think) that gets built if you don't 
have the appropriate library.  I don't remember when it gets included.  Have 
you considered nuking it?

Or do we want to retain a private copy of the basic crypto routines so we 
don't depend on another package.  If so, we will probably need our copy of 
sha256.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


asciidoc tables

2016-07-01 Thread Hal Murray

The table widths have things like:
   [width="100%",cols="<34%,<33%,<33%"]

I find that makes a table that is ugly and hard to read.

I could tune the widths, but I don't know how wide the viewer's display will 
be.

Is there a better way to do things?  I'd like to say "make this column as 
wide as necessary to hold the widest content", and mark the last column as 
the one to wrap if the total doesn't fit.



This isn't a big deal, but it's annoying enough that I would put in some 
effort to fix things if I knew how to do it.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Add usestats to collect resource usage statistics

2016-07-01 Thread Hal Murray
I just pushed the code.

You will get things like this:

57570 76638.360 3600 19.221 29.499 1541 0 0 0 2984 288288 2123 0 8428
57570 80238.357 3600 20.812 25.956 1062 0 0 0 3024 246608 2274 0 12652
57570 83838.357 3600 23.353 26.497 833 0 0 0 2992 255329 2556 0 16164
57571 1038.358 3600 31.154 31.335 1027 0 0 0 3088 310802 2393 0 20280
57571 4638.357 3600 31.467 28.972 859 0 0 0 3120 266748 2469 0 23676
57571 8238.357 3600 43.700 38.214 1525 0 0 0 2976 369410 3247 0 29748
57571 11838.357 3600 35.270 24.945 644 0 0 0 3112 226024 3155 0 32384
57571 15438.357 3600 46.356 29.400 1439 0 0 0 2856 278092 1971 0 37928

The data is from getrusage().

The last column is the high water mark for the "resident set size" in 
kilobytes.  If anybody figures out exactly what that means, please clue me in.

The above is from a pool server setup with the mrulist limit big enough to 
hold a whole day.  It's still ramping up.

The floating point numbers are user and system CPU usage.  The last line 
shows a total usage of 2.104%.

One of the 0s is page faults.

More into in ./host/docs/monopt.html
or docs/includes/mon-commands.txt





-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Avoiding merge bubbles

2016-07-02 Thread Hal Murray
Thanks.  I hate that crap as much as anybody.

> git pull --rebase

I missed the --rebase part.

Is there any way to set things up so --rebase is the default with pull?

Is there any way to recover after I forget?

Can we fix the push process to reject pushes if they have that type of 
comment?  (I think it already rejects pushes that don't build, so the 
mechanism is there.)


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Avoiding merge bubbles

2016-07-02 Thread Hal Murray

e...@thyrsus.com said:
>> Is there any way to set things up so --rebase is the default with pull?
> Yes.  If you look in your .git/config, adding the "rebase = true" line will
> set --rebase for all pulls from master. 

Thanks.

Where should that be documented?  I think I set that when you sent out a 
similar message a long time ago but I lost it when making a new clone.

Are there other git quirks that should be documented?


>> Is there any way to recover after I forget?
> Not short of repository surgery.  Remember the hash chain - git is actually
> designed to make it difficult to modify old commits. 

If the crap is in my local copy, I can move the whole directory to the side, 
get a new clone, and merge my edits back in.  Recovering my edits could be 
ugly, but in the no-collision case it's not that hard.  Diff the directories 
to find the files you have edited, copy them over and commit...


>> Can we fix the push process to reject pushes if they have that
>> type of comment?
> Theoretically possible, but probably a bad idea.  We will probably have to
> do real branch merges occasionally. 

I was thinking of looking for an exact match on the default commit message.  
If there was a real collision the message should say something interesting.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


memory locking

2016-07-02 Thread Hal Murray

e...@thyrsus.com said:
> BTW, I think I've knocked the mlockall/threads/async bug on the head. I
> swiped some code from chrony that does memlocking after telling ntpd it can
> have as much memory as it wants - ntpd's worst-case memory requirement ain't
> much. I've had that version running continuously for about 14 hours merrily
> swapping pool servers in and out with no crash.

Thanks.

ntpd can take a lot of memory if you collect the statistics for who is 
talking to you on a busy machine, for example a pool server. It depends on 
how busy and how long you want to keep the data.

Is there any way to turn that off?

I don't see any mention in the documentation.  Where should that go?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: memory locking

2016-07-02 Thread Hal Murray

e...@thyrsus.com said:
> I'm not sure what the referent of "that" is. The statistics-gathering I've
> seen seems to be all about writing line-at-a-time records to various stats
> files; I can't see that generating a lot of memory pressure.

> If there's somewhere in the code that is allocating memory proportional to
> the size of saved statistics, yes that could be a problem.  Do you have some
> specific case in mind? 

Different context for statistics.  The MRU list keeps tracks of traffic from 
each IP Address.  It's in the misc options page under "mru".  You can see it 
with ntpq -c mrulist.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Is digest mode working for mailing lists?

2016-07-02 Thread Hal Murray
A few weeks ago, I signed up for bugs and vc in digest mode.  I thought I got 
one message, maybe one each list, but I haven't seen anything since.

I see stuff in the archives for vc but the archives for bugs is empty.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Refclock quirk

2016-07-03 Thread Hal Murray

I'm seeing things like this:
 remote   refid  st t when poll reach   delay   offset  jitter
==
+fe80::21e:c9ff: .PPS.1 u   86 1024  3770.483   -3.643   0.330
+fe80::226:2dff: .PPS.1 u  758 1024  3770.582   -3.652   0.392
+fe80::226:2dff: 192.168.1.33 2 u  802 1024  3770.591   -3.632   0.362
*GPS_NMEA(0) .GPS.0 l   21   64  3770.000   74.403   5.791
oPPS(0)  .PPS.0 l   84 1024  3770.000   -3.750   0.332
+glypnod .PPS.1 u   45   64  3770.361   -3.683   0.012
+shuksan .PPS.1 u   60   64  3770.313   -3.656   0.021
+mon .PPS.1 u   39   64  3770.552   -3.801   0.034
+tom .PPS.1 u   31   64  3770.506   -3.770   0.037
+cent.PPS.1 u   31   64  3770.442   -3.729   0.014

That's on a raspberry Pi with a GPS HAT.  The glitch is the 1024 polling 
interval for the PPS.  That shouldn't be there.  Note that the NMEA driver is 
at 64.

I don't remember seeing anything like that before your recent refclock 
changes.

There is a maxpoll 6 on the 5 servers after the PPS.  Nothing on the first 3 
or either refclock.

I thought the refclocks used to set their own polling interval, but I can't 
find that code.  (I remember changing something many years ago.)



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: memory locking

2016-07-03 Thread Hal Murray
> esr@snark:~/software/ntp-rescue/ntpsec$ ntpq -c mrulist
> ***Command `mrulist' unknown

I don't know what's wrong on your end.  When I cut/paste that line, I get 
things like this:

Ctrl-C will stop MRU retrieval and display partial results.
Retrieved 5 unique MRU entries and 0 updates.
lstint avgint rstr r m v  count rport remote address
==
 1 330 . 3 4   3833   123 shuksan
21 370 . 3 4   3394   123 mon
27 520 . 4 4   2399   123 glypnod
31 650 . 4 4   1916   123 cent
64 440 . 4 4   2826   123 tom


e...@thyrsus.com said:
> Name must have changed. But I remember seeing the code for that. Looking in
> ntp_control.c...looks like memory used used is O(n) in the number of peers. 

Where "peers" includes clients rather than just "peers" from ntp.conf


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Refclock quirk

2016-07-03 Thread Hal Murray

e...@thyrsus.com said:
> Doesn't show up in bisection, and now doesn't reproduce with the head
> revision either.

There is no point is bisecting unless you have a test case that fails on head.

Please try using NMEA and PPS rather than SHM.

You have to wait a while.  I'm not sure how long.  I think it's the normal 
ramp-up on polling interval.


e...@thyrsus.com said:
> Try nuking the build directory, re-configuring, and rebuilding. You might
> have a stale binary somewhere. 

I have a script that does that.

I'll poke around...


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Refclock quirk

2016-07-03 Thread Hal Murray

hmur...@megapathdsl.net said:
> You have to wait a while.  I'm not sure how long.  I think it's the normal
> ramp-up on polling interval.

It takes about 6 minutes.

 remote   refid  st t when poll reach   delay   offset  jitter
==
 ntp.mcast.net   .ACST.  16 a-   6400.0000.000   0.004
+glypnod .PPS.1 u7   64   770.4060.650   0.133
-shuksan .PPS.1 u   10   64   770.2880.356   0.355
+mon .PPS.1 u6   64   770.5310.637   0.165
-tom .PPS.1 u9   64   770.5150.600   0.161
+cent.PPS.1 u5   64   770.4480.625   0.143
*NMEA(0) .GPS.0 l   12   64   770.000  -14.045   1.068
oPPS(0)  .PPS.0 l   11  128   770.0000.070   0.092
+fed 192.168.1.3  2 u8   64   770.5210.669   0.140
+fed2192.168.1.3  2 u7   64   770.5370.694   0.169
+deb 213.74.106.159   2 b   14   64   760.5010.551   0.159
+deb2192.168.1.3  2 b   36   64   360.5630.643   0.162


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Zero-configuration ntpd

2016-07-03 Thread Hal Murray
> Default servers should be the global NTP pool.

In general, it's a very bad idea to wire names or addresses into code, 
especially if you don't own/control the resource being used.  This case is 
less-bad than many others since it is possible (maybe even easy) to 
change/fix.

The problem is that you need an exit strategy.  What are you going to do if 
your usage puts too much load on the pool (or it's DNS servers) or the pool 
goes out of business?

You could make it a build time option so each distro could set things up to 
use their NTP servers.  But they can already do that with a config file.  
Wiring in a default seems too complicated.

You might be able to pick up some servers via DHCP.  That probably belongs in 
the startup scripts rather than in ntpd itself.

> Default for statistics can be no stats gathering.

That's the current default.


> There is currently no default drift file location.  This is where I am not
> sure of my ground - should there be one?  If not, why not?   

It will work without one.  It should be a bit slower to get started.  We 
should test that case.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Anti-DDoS

2016-07-03 Thread Hal Murray

Is there consensus on what we should be doing?  Actually, I'm looking for a 
bigger picture of what all UDP services should be doing.  DNS is the other 
obvious example.

If you had asked me a year or two ago, I would have said "rate limiting" and 
thought that solved the problem.  It does solve the reflection attack, but it 
opens things up to a different type of attack.

A bad guy can deny service to Bob at selected servers by sending forged 
packets to those servers so they start rate limiting him.  That doesn't take 
a lot of traffic so it won't stand out and most of the infrastructure won't 
even know there is a problem.  (That does require that you can figure out 
what servers Bob is talking to.)


Is there any good writeup on why BCP-38 is so hard to implement and/or why it 
isn't implemented more often?  I assume it's money.  Is the problem routers 
can't do it?  (fast enough)  Or maybe ISPs don't have their act together?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Zero-configuration ntpd

2016-07-03 Thread Hal Murray

g...@rellim.com said:
> Default for statistics can be no stats gathering.
> Agreed.  They just grow forever.  Ditto ntp.log that should default to the
> system syslog.

The main log file does default to syslog

> Off-topic: ntpd should have a max number of saved logs.

The default is no log files.  I don't think ntpd should get involved with 
deleting anything.  If nothing else, it's an insecurity opportunity.  Debian 
has a cron job to do it.  (I have to kill it since I want them saved forever.)




-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: asciidoc tables

2016-07-03 Thread Hal Murray

e...@thyrsus.com said:
> Sadly, proportional is all you can do in the table model of XML-DocBook
> (which is what asciidoc uses as a back end). 

Can I specify the total width in characters?

Can we assume the width is appropriate for a man page?  That might look ugly 
with narrow or wide web pages but it will probably be better for the typical 
case.  (at least the way I read web pages)




-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: question about upgrading from Classic to NTPsec (packaging issue)

2016-07-04 Thread Hal Murray

j...@systemsartisans.com said:
> I am in the process of trying to create an RPM package from the repo's
> current head.  Given that I would expect this to be used by sysadmins, etc.
> who might already have installed the Classic version (very possibly from
> their distro's package sources), how would you all suggest I treat the
> presence of a ''conflicting'' version?  Save the config files, and overwrite/
> remove all the old executables?  Move everything aside, and put our entire
> "best practices" package in place??  Ask for manual intervention???  Halt,
> catch fire, burn it all to the ground?  ;) 

So far, the config files are compatible so it makes sense to leave the old 
one alone if it has been edited.  (The admin might have picked some good 
servers or setup logging.)

If there is a conflict, my suggestion would be to rename the old stuff to 
classic-xxx and install the new stuff as ntpsec-xxx and setup links and 
provide a script to swing the links.  Or something like that.  There probably 
needs to be a script to uninstall the classic version and undo the links.

Where would you document what happened and/or how to switch back?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-07-04 Thread Hal Murray

strom...@nexgo.de said:
> On another tangent back to NTP, I'm wondering if it wouldn't make sense to
> offload the timestamp filtering at least to the VC4.  Most NTP boxes would
> run headless anyway, so there'd be 16 processors sitting idle for that sort
> of thing. 

Not likely.  That sort of work doesn't take enough CPU cycles to worry about. 
 If you put it someplace else, it just makes things harder to maintain and 
debug.


>> Yeah, I'm wondering why the dealy in Linux kernel for 64 bit A8?
> It wouldn't buy anyone anything of immediate use except having a complete
> additional distro to build and maintain and more memory pressure to deal
> with.  I suspect that eventually a 64bit port will be added anyway to tick a
> checkbox somewhere. 

If you don't need them, 64 bit pointers just take up more memory.  That shows 
up as cache misses.

There are 2 reasons that I know of for needing 64 bit pointers.  The first is 
that you are using more than 32 bits of virtual memory.  The second is that 
your hardware can't address all of physical memory when running in 32 bit 
mode.  If your system uses the top bit for I/O, then you can only use 2 GB of 
physical memory.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PPS processing

2016-07-05 Thread Hal Murray

g...@rellim.com said:
> The big thing for NTP and gpsd would be the 64 bit math.  Both do a lot of
> 64 bit math. 

You can do 64 bit arithmetic without using 64 bit pointers.

Somebody mentioned that the plan is have one boot file that runs on all 
Raspberry Pis.  Are things setup so that user code builds that way too?  If 
so, it might take a magic option to the compiler to get it to use the 64 bit 
instructions.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Requesting code review on possible fix for nopeer/pool conflict

2016-07-05 Thread Hal Murray

dfoxfra...@gmail.com said:
> The whole receive() function you're looking at is about to get blown away in
> my ntp_proto refactor. Can you hold off on touching it until next week? 

Please don't push any big changes until Eric and/or I get the polling tangle 
fixed.


dfoxfra...@gmail.com said:
> One more reason I need to get my ACL language implemented and restrict needs
> to die. 

If you kill restrict, we are taking a major step toward making ntp.conf file 
no longer compatible.

Would it be possible to for your new code to support the old restrict stuff?  
(Similar to the way Eric's new refclock stuff still works with the old stuff.)

---

> I think this working as designed. 'restrict nopeer' means "Don't establish
> unauthenticated ephemeral associations with this IP address", which is
> exactly what pool does. I agree this is stupid design but I don't think it's
> a bug.

I think the current setup is buggy, but maybe that whole area is more 
complicated than I currently understand.

Maybe restrict needs a nopool tag so we don't get confused by peer vs pool.

There is a fundamental problem here.  What should happen if server/pool and 
restrict lines conflict?  Is server DNS different from server IP Address?  My 
straw man is that a restrict line with explicit IP Address(es) should block 
server/pool addresses but the default restrict should not.

I'd also vote for conflicts to generate error messages at startup time and/or 
at DNS lookup time.  The DNS lookup could try the next address and/or try 
again later (after TTL).

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Requesting code review on possible fix for nopeer/pool conflict

2016-07-05 Thread Hal Murray

dfoxfra...@gmail.com said:
> What exactly is the "polling tangle" you're referring to? I talked to Eric
> about this earlier today, and he mentioned something about the polling
> interval drifting to 1024 seconds on a consistently reachable server. But
> AFAIK, nothing has changed and that's always been exactly the intended
> behavior, as set by the "NTP_MAXDPOLL" constant. 

The problem is that the ramp up on polling interval is happening on 
refclocks.  Maybe only on PPS refclocks.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Refclock polling ramp up

2016-07-06 Thread Hal Murray
I just pushed a fix.  Would you please sanity check...


For servers, minpoll and maxpoll default to 6 and 10.  There is also a check 
to make sure that minpoll isn't greater than maxpoll.

For refclocks, minpoll defaults to 6 and maxpoll defaults to minpoll.

The problem is that you were storing maxpoll and friends in a peer struct and 
by the time you were checking to see if it needed a default, it had already 
been defaulted by newpeer so the test was testing clobbered data.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Do we have a list of user visible changes from ntp classic?

2016-07-06 Thread Hal Murray

It's probably all in NEWS (or should be), but that's chronological and seems 
hard to read.  For example, the deleted refclocks are scattered all over the 
place.

I think I'm suggesting something like CHANGES-form-ntp-classic


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing interleaved mode

2016-07-06 Thread Hal Murray

dfoxfra...@gmail.com said:
> With Eric's permission, I have removed support for interleaved mode in my
> proto-refactor branch. Here is its commit-message eulogy: 

Seems fine with me.  I've never used it.

We should test things to make sure nothing strange happens.  I think that 
requires 4 systems: 2 new and 2 old.  (I guess it could be done with 2 if you 
are willing to run the tests serially.)

This might be a good candidate for Pis with GPS HATs.  It might take the 
kernel PLL to notice the difference.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Linux capabilites check broken on NetBSD

2016-07-06 Thread Hal Murray

On NetBSD:
07-06T15:42:17 ntpd[4940]: root can't be dropped due to missing capabilities.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Weirdest bug yet.

2016-07-06 Thread Hal Murray

No rawstats or protostats either.

e...@thyrsus.com said:
> It was adding "subtype" as an alias for "mode" in the lexical analyzer. This
> somehow confuses the crap out of the parser's FSM. ...

Remember the saveconfigandquit stuff you ripped out?  That would have caught 
this.  (if we had used it)

How are we going to test the parser?

Can we replace ntp_config (and most of ntpd) with a skeleton that catches all 
the call-outs and prints stuff?

--

Are there any other aliases in the grammer?

Ahh.  I see this in ntpd/keyword-gen.c
{ "subtype",T_Mode, FOLLBY_TOKEN },

How about making a T_Subtype and adding it to ntpd/ntp_parser.y so that it 
does the same thing?



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Linux capabilites check broken on NetBSD

2016-07-07 Thread Hal Murray

matthew.sel...@twosigma.com said:
> NetBSD should be using the clockctl interface:
> http://netbsd.gw.com/cgi-bin/man-cgi?clockctl+4.i386+NetBSD-7.0 

Thanks.

Eric, I should probably fix it since I have a test case.

Should we add HAVE_SYS_CLOCKCTL to waf, or just test for __NetBSD__?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Linux capabilites check broken on NetBSD

2016-07-07 Thread Hal Murray
> Attempted port fix pushed. Please test.

Missing the _H on HAVE_SYS_CLOCKCTL

Fix pushed.  More testing in the pipeline.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Requesting review of "Eliminate some pointless gymnastics in the config parser."

2016-07-08 Thread Hal Murray

e...@thyrsus.com said:
> About eight hours ago I removed some code that looked so stupid that I now
> wonder if it was serving some purpose I don't understand. 

I don't know of any reason for the old code.

Your change looks sane to me.  I don't see how to fully test it.  The iburst 
case seems to still work.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Any important bugs/quirks?

2016-07-08 Thread Hal Murray
Things have been a bit, well, "interesting" the past few days.  I think 
everything has been put back together.  Is there anything that needs fixing 
that I/we have missed?  (I'm not looking for new features we haven't 
implemented yet, just things that we broke or things we changed that don't 
work right yet.)

--

Is this a good time for a release?  (before we break anything again)


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


SIGHUP catcher, issue #78

2016-07-10 Thread Hal Murray

I just pushed code that catches SIGHUP and reopens the log file if it has 
changed and checks for a new leapseconds file.  You can poke it by hand with
  killall -HUP ntpd

We should get a chance to test the new leap file stuff soon.  It's time for a 
new one.  Besides, a day or two ago, the news said they announced one for the 
end of this year so we'll get to test the run time too.

I think there is a script to fetch a new leap file.  This could be added to 
it.

For logrotate on Linux, you can put a fragment like
this in /etc/logrotate.d/ntpd

/var/log/ntp/ntpd.log {
monthly
postrotate
  /usr/bin/killall -HUP ntpd
endscript
rotate 
}


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Sandboxing: How important is seccomp?

2016-07-11 Thread Hal Murray
[For those not familiar with it, seccomp gives the kernel a list of syscalls 
that the program is allowed to use.  All others becomes illegal.  So if a bad 
guy finds a stack overflow there is (hopefully) a good chance that any code 
he tries to run will crash.]

I've got it working on Intel.  It doesn't work on ARM.  More below.

This is likely to have a long tail of cases that I can't test.  There is an 
interesting tangle of which syscalls are used by various combinations of 32 
vs 64 bit and the age of libc and kernels.  I've been adding things to the 
list as I discover them.  There are probably obscure combinations used by 
distros that I can't test and/or refclocks that I can't test may use strange 
calls.

There is also the chance that things will change out from under us.  For 
example, getrandom was added to the 3:17 kernel.  So if you updated your 
kernel and glibc our ntpd would crash until we updated it or you disabled it 
at runtime.


Assuming the environment supports them...

Should we set things up so so that droproot and seccomp are required at build 
time unless you explicitly disable them?  That just requires installing the 
required packages.

Should we set things up so that droproot is required at runtime?

I assume we add options to disable any runtime checks.
(Can we use -u root:root to say no-thanks?)

Both work on Linux.  NetBSD supports droproot.

-

On ARM, it dies before it gets to our code.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Cannot access memory at address 0x0

Program received signal SIGILL, Illegal instruction.
0x76dabde8 in ?? () from /usr/lib/arm-linux-gnueabihf/libcrypto.so.1.0.0
(gdb) bt
#0  0x76dabde8 in ?? () from /usr/lib/arm-linux-gnueabihf/libcrypto.so.1.0.0
#1  0x76da84b4 in OPENSSL_cpuid_setup ()
   from /usr/lib/arm-linux-gnueabihf/libcrypto.so.1.0.0
#2  0x76fdeffc in call_init (l=, argc=5, argv=0x7efffd34, 
env=0x7efffd4c) at dl-init.c:78
#3  0x76fdf0d8 in _dl_init (main_map=0x76fff958, argc=5, argv=0x7efffd34, 
env=0x7efffd4c) at dl-init.c:126
#4  0x76fcfd84 in _dl_start_user () from /lib/ld-linux-armhf.so.3
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Anybody know how to debug things like this?

2016-07-14 Thread Hal Murray
I'm working on segcomp.  I'm at the stage where things mostly work and I'm 
trying to find obscure code paths that use a syscall that isn't yet on the OK 
list.

The SIGSYS means it tried to call something that wasn't on the list.  
Normally, a simple backtrace will let me can figure out what it is and add 
the appropriate call to the list.

It might be in the magic part of creating a new thread, but that has been 
working for months.

Program received signal SIGSYS, Bad system call.
0x41d810d8 in clone () from /lib/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.16-34.fc18.i686 
libattr-2.4.46-7.fc18.i686 libcap-2.22-5.fc18.i686 libgcc-4.7.2-8.fc18.i686 
libseccomp-1.0.1-0.fc18.i686
(gdb) bt
#0  0x41d810d8 in clone () from /lib/libc.so.6
#1  0x0001 in ?? ()
#2  0xb7fcbb40 in ?? ()
#3  0x in ?? ()
(gdb) info threads
  Id   Target Id Frame 
* 1Thread 0xb7fcc6c0 (LWP 18519) "ntpd" 0x41d810d8 in clone ()
   from /lib/libc.so.6
(gdb) 



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Anybody know how to debug things like this?

2016-07-14 Thread Hal Murray
> Seems like a situation made for investigating with Mozilla rr.

Could you please say a bit more?  I don't know anything about Mozilla rr.  
Why is that likely to help me in this case?


I think I have tracked down the problem.  It's trying to start a new thread.  
The clone syscall wasn't on the list, but we have started new threads before. 
 The catch is that the sandboxing doesn't get called until late in the 
initialization procedure.  So the first thread gets created before sandboxing 
turns off the clone syscall.  If you trigger a second thread, that one dies.  
(My test case for that is closing the lid on a laptop for a short time.)

It seemed like a good idea to move the sandboxing up earlier.  The catch is 
that most of the initialization will get done as ntp rather than root, so 
permissions on devices for refclocks and files like ntp.keys need to be fixed.

Anybody see any problems with that?  I think old versions and ntp classic 
will still work running as root.

---

I figured out the problem with gdb on Raspberry Pi.  It got an illegal 
instruction from SSL before calling main.  It's also got a signal handler.  I 
assume it's a run time test for something.  It works if you continue.

The seccomp stuff doesn't work on Raspberry Pi.  I haven't figured out why.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Anybody know how to debug things like this?

2016-07-14 Thread Hal Murray

e...@thyrsus.com said:
> It's like a symbolic debugger that keeps an execution trace and lets you
> step backwards in time. Under rr you could induce the crash, then step back
> to the last syscall. 

I don't think that's going to help.  I'm in a signal handler from the current 
attempted syscall.  What I need is to get the call number.  Things are 
confused since the low level thread stuff is magic.  (at least to me)  
Normally I can just get it from the name of the next frame on the stack.  
Sometimes glibc uses old or different syscalls.  So far, I have always been 
able to figure things out.  For some of the thread stuff, I had to look at 
the source for pthread_create, but google found it on github.


> I'm worried about it.  Not for any specific reason, it just trips my
> "Danger! Danger!" sensors.  I've got a bad feeling that it might be one of
> those 'innocuous' changes that come back to bite us in the ass.

Me too, but I can't see any solid reason.

The lock in memory needs to be moved up too.  (Only root can ask for 
everything, or we need another capability or ...)

I'll keep testing it.  It makes finding syscalls used by threads/DNS easier, 
and we'll need that in case the pool stuff decides it needs more servers.

> I want to ask a different question: why the early thread launch? Can we move
> that?   

It's getting called from config_peers in ntp_config.  I don't know of any 
reason why we couldn't move it.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Anybody know how to debug things like this?

2016-07-15 Thread Hal Murray

e...@thyrsus.com said:
> The only safe alternative would be to force the initial DNS lookups to be
> synchronous. 

That doesn't work for the pool case.  It wants to get more servers if some of 
the ones it is using stop responding.

> A: get configuration (that's the early thread launch)

We could split the DNS lookups out of that.  It would add a new step to your 
list.

> E. initializing = false
> Moving F and B is no problem, but the others worry me - especially the
> setting of initialize, which does things I don't understand to the protocol
> machine.  My spider sense is tingling.

The DNS lookups may take a long time, so starting them a bit later should be 
OK.

It wouldn't surprise me if we discovered problems.  That's a mixed blessing.  
Finding problems is always good, but we would also have to to fix them.



I've found 2 more quirks associated with early sandbox.

The PID file needs to be writable by ntp.  That should be easy to fix - just 
some scripting before starting ntpd.  systemd doesn't have that scripting, 
but it doesn't use a pid file.

The other is that NetBSD won't let ntp open wildcard sockets.  It may be 
sockets with port# less that 1024.  I don't have a solution but I haven't 
looked very hard.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Anybody know how to debug things like this?

2016-07-16 Thread Hal Murray

e...@thyrsus.com said:
> I'm in favor of cleaning up and fixing some of these order dependencies, but
> I'd rather get us to a safe and functioning state first.  Accordingly,
> splitting out seccomp() implementation to do it early and keeping droproot
> late is looking better and better. 

I found another case that doesn't work with early drop root:  opening the 
first 2 SHM slots.

I'll add an option to use early drop root and push what I have.



When it gets to the top of the list, we should cleanup the SHM handshake.  If 
we make the handshake use 2 counters rather than a ready flag, the read side 
can be read-only and this sort of problem will go away.  That lets us have 
multiple users so we can run debugging/monitoring code in parallel with ntpd. 
 That will either take a command line switch to gpsd or it will have to setup 
duplicate SHM slots (with different names).

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Odds and ends...

2016-07-17 Thread Hal Murray

There is some ugly code in ntp_loopfilter that's setting up a signal handler 
in case ntp_adjtime doesn't work.  It's the sort of stuff Eric loves to rip 
out.

I can't figure out why that code would be useful.  I expect we should figure 
that out at build time.

I've commented it out.  It's still working on all of the systems I have 
access to.
If it blows up for you, please tell us what type of system you are running on.



Does anybody have any experience with seccomp?  How useful is it?

I think it's Linux only.  The idea is to tell the kernel what system calls the 
program uses so that if a bad guy finds something like a stack overflow, the 
program will die if his exploit uses any other syscall.

We inherited code from ntp classic, but it didn't work.  I poked around.  It 
needs a library.  I've got it working on Intel.  It builds on ARM, but gets an 
Invalid argument error at runtime.

I've pushed two configure time options that let us test seccomp.  The catch is 
that it's not simple to figure out which syscalls are actually used.  A lot of 
that stuff is hidden in libc and friends.  I've been adding them as I discover 
them.  It's working on all the systems i can test on.
  --enable_seccomp 
turns on that code.

Testing encouraged.  If it crashes for you, please run it from gdb and send be 
a backtrace.

DNS lookup uses a blizzard of syscalls, many involved with threads.  Normally, 
the DNS helper thread gets started before seccomp is activated.  That makes it 
hard to test the syscalls needed by pthread_create.  You have to wait for the 
thread to time out and then for the pool logic to try again which starts a new 
thread.
  --enable-early-droproot
does the drop root before reading the config file.  That turns on seccomp early 
enough to catch creating the first DNS helper thread.  User ntp has to be able 
to access refclocks and append to existing log/stats files which may be owned 
by root.  That's probably a good idea anyway.

There are two known cases where early drop root doesn't work.  On is on NetBSD. 
 Opening sockets doesn't work.  I haven't checked carefully.  I assume it's 
checking for port numbers less than 1024 or such.

The other is SHM.  It can't access the first two slots.  That can be fixed, but 
I don't know of a quick workaround.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: adev.py

2016-07-21 Thread Hal Murray
> Are you saying the unix time stamp result in the output is wrong?

I didn't look that far.



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-23 Thread Hal Murray

e...@thyrsus.com said:
> But AUSTRON/IRIG/CHU...I think there's a good (though not absolutely
> dispositive) case for simply dropping them all.

The Austron driver uses Loran.   It was unplugged in the US several years 
ago.  I think it's still used in Northern Europe.  It may come back in the US 
as a backup for GPS.



Is this a good time to setup a procedure for second class refclocks?  Or 
think about how to do it?

I haven't given this a lot of thought.  The idea is to make it easy to add 
drivers for hardware we don't support directly.

I think there are two things we would have to do.  One is keep track of 
names.  The other is to setup and document a recipe for adding a driver.

Handwave.  My straw man is that the keep-track of numbers part means that we 
maintain everything but the code and documentation for a driver.  I think 
that's just a table entry in pylib/refclock.py and another in 
ntpd/refclock_conf.c  It would be nice to teach waf to make man/web pages for 
optional drivers.

It's possible that some git magic would simplify most of that, but I don't 
know how to do it.  Maybe we just maintain a comment that git can replace.  ??

If you use dumbclock as an example, I'll adopt the IRIG driver as a sanity 
check.



How many of the current drivers can we test?  Maybe we should move all the 
others to second class status.

We need a web page with a status slot for each driver.

We should pick one driver to use as an example.  Simpler is better.

If you do dump drivers that have survived this far, I think it would be 
better to move them to another git repository where they are still visible 
but don't clutter up our main world.

Plan B is to not waste time on any of that and save our energy for the great 
SHM cleanup, or whatever.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-23 Thread Hal Murray

g...@rellim.com said:
> Several commercial NTP products do it, we wantt them to convert from NP
> Classic to NTPsec. 

Are they sending IRIG or listening to it?


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-23 Thread Hal Murray

e...@thyrsus.com said:
> No, you were right the first time - and it's something I should have
> noticed. That driver is designed for an obsolete class of sound card. 

I don't know much about audio.

What is the right API to use?  All we need is a batch of samples and the time 
they arrived.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-23 Thread Hal Murray

fallenpega...@gmail.com said:
> What I am wishing for, would be for someone to write a standalone in its own
> demon process IRIG driver, that then speaks GPSD or SHM to NTPsec. But
> testing such a beast would be specialized task. 

I think things are much more complicated than it seems.  That doesn't 
actually solve any problems, just pushes them over where they aren't visible 
unless you go looking for them.

It requires a public API and all the version support/coordination problems.  
We should be trying to make life simpler for sysadmins, not more complicated.

That isn't to say that cleaning up the refclock interface isn't a good idea, 
just that "for someone to write..." is probably going to to cause more 
troubles that it solves.  This area needs some serious thought before we 
start writing code.

-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-24 Thread Hal Murray

e...@thyrsus.com said:
> According to Wikipedia LORAN is dead. The principal station chains shut down
> in 1979-1980. Last live use was in China in the 1990s.

> What you are probably thinking of is DECCA, which was a hyperbolic radio
> navigation system (very similar operating principle to LORAN but better
> accuracy) deployed out of Great Britain with several station groups
> elsewhere in Northern Europe.  It shut down in 2000.  A Japanese station
> group continued operation until 2001. 

I think DECCA is/was way way old.

I was thinking of eLoran.  Looks like Britain pull the plug on their work the 
end of last year after France and Germany (and Norway?) bailed.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Possible cleanup

2016-07-24 Thread Hal Murray

There is a SAVE_ERRNO macro that wraps around some code to preserve errno.  
It's only used in a few places.  The first place I saw was calling msyslog.  
That would make sense if following code did something that depended on the 
error, but I checked all 4 cases and they never looked at errno.  They are 
all in ntp_refclock.

Looks like we can rip it out.  But that's not the sort of code that's easy to 
test so we should be careful.

SAVE_ERRNO uses a socket_errno() macro to get errno.  That looks like it's a 
hook to work with windows.  The answer gets put back into errno.

Ahhh.  the SAVE_ERRNO macro isn't being used to save errno but rather to copy 
the windows error over to errno where %m in msyslog can get it.

I wonder if we can push that into msyslog.  Looks like it's already there.

So either we can rip it out, or we have to fix all the other places that use 
%m.

(Or I haven't analyzed things correctly.)



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-24 Thread Hal Murray

e...@thyrsus.com said:
> Drivers that very well might fail the ten-year test: truetime, magnavox,
> palisade, oncore, jupiter. 

Palisade is in use.  It covers Trimble TSIP which includes the Thunderbolt 
which was widely available surplus only a few years ago and is popular with 
time-nuts.  It should probably be renamed to Trimble.

There are several sub-drivers/modes to cover different Trimble models which 
use different subsets of the full TSIP protocol.

There is also some TSIP code in the generic/parse driver.


The oncore driver covers Motorola which is/was very common.  They put out a 
long sequence of chips.  Many of them are way old, but probably don't add 
much to the driver.  The M12/M12+ was popular and state of the art only a few 
years ago.

Motorola sold off their GPS business.  I forget who bought it.  Somebody in 
Asia.  It may have been sold again.  I think the M12+ is still in production 
but the name may have changed and/or they may have updated it again.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Removing the worst cruft

2016-07-31 Thread Hal Murray
> Can the palisade/trimble driver be replaced with a parse driver?

I doubt it, but I'm far from familiar with the parse driver.

Based on Eric's previous comments, the parse driver handles devices that 
provide the time in an easy to parse format.  TSIP might fit that if all goes 
well.

But there are many variations of TSIP.  One covers reversing the normal PPS 
operation.  Instead of needing kernel support to time stamp the PPS pulse, 
you send it a pulse by flapping one of the modem control signals and it tells 
you the time that happened.

My vote would be to not rock that boat.  There are more important things to 
work on.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Kernel PLL graphs

2016-08-01 Thread Hal Murray

There are two parts to PPS processing in the kernel.  RFC 2783 describes an 
API for capturing time stamps.  RFC 1589 describes a PLL that lives in the 
kernel.

Most Linux distros don't support RFC 1589.  The code is in the kernel, but it 
doesn't work with the shipped kernels.  It requires !NO_HZ, but most distros 
prefer NO_HZ.

I pulled over the sources and built my own kernel.

Here are the before and after graphs:
  http://users.megapathdsl.net/~hmurray/ntpsec/PPS-kernel.png
The data is from two separate days so this isn't a clean comparison.  I don't 
know what that machine was doing on either day.

Here is a zoom in on the Kernel PLL day.
  http://users.megapathdsl.net/~hmurray/ntpsec/PPS-kernel2.png
Note that the peak offset is less than a microsecond.



We should see if we can get similar results on a Raspberry Pi.  I haven't tried 
building an ARM kernel.

I think we should be able to run the PLL code outside the kernel.  The PPS time 
stamp is key.  The PLL calculations don't need to be run in the kernel.  They 
need to be run soon after the PPS, but not interrupt level immediately.  The 
API has an option to wakeup on PPS.  I don't know if it is implemented on Linux.

The no-PLL test was run at the default maxpoll of 6.  I should try faster.  I 
also need a standard test load.

I remember various FreeBSD-is-better type comments from many years ago.  I 
don't know if the PLL was working in Linux at the time.  I should setup a test 
case.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: driftMime-Version: 1.0

2016-08-03 Thread Hal Murray

g...@rellim.com said:
> 1.  On startup chronyd checks the time stamp on the drift file.
> if the timestamp > sysclock, the sysclock is set to the timestamp 

I vote that we don't do anything, not even make it optional behind a command 
line switch.

We have more important things to do.

The OS should be doing that sort of thing, probably using the root directory.
Why stop with the drift file?  Should we check the log files too?

It's the sort of code that is hard to test and likely to have subtle problems.

I think it's a good item to put on the what-do-customers-want list.



> 2.  ntpd stores the frequency ppm offset in the driftfile. 
> chronyd stores the frequency ppm offset and the 'skew'
> (estimated accuracy of the existing frequency value). 

> I can see that saving the 'skew' is a nice touch, but I suspect much the
> good chronyd startup behavior is explained elsewhere. 

I'm not sure that ntpd has a parameter equivalent to skew.

Again, I vote that we don't do anything now.  The current startup stuff is 
broken.  There is no point in working on things like this until we understand 
and fix the current problems.


g...@rellim.com said:
> In a related topic, it would be nice (maybe an option) for ntpd to hold off
> logging the initial aweful data until after the -g option has set the system
> clock.  And a bit longer, so the wonky startup data is masked. 

But that is when you really really want the logging.

I might agree to put it someplace other than the normal place.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


Re: Kernel PLL graphs

2016-08-03 Thread Hal Murray

matthew.sel...@twosigma.com said:
> I'm using maxpoll of 1 on my stratum 1 servers.  And I have !NO_HZ set.  My
> offsets stay belong 1 microsecond as reported by ntpq.  If we switched the
> units to nanoseconds, that might be interesting.

Time to make sure I've got the right number of negatives...  "I have !NO_HZ 
set" means you have unset NO_HZ which probably means you had to build your 
own kernel.

Do you have flag3 turned on?  If so, the kernel does all the work and maxpoll 
is essentially ignored.

I though there was a min to maxpoll so I'm a bit surprised you could set it 
to 1.

> I don't have !NO_HZ set on my stratum 2 servers, but I'm looking at the
> ramifications of that.

At least for the effect I'm discussing, it only matters if you have a PPS.

> I'm curious what your results are. 



-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
http://lists.ntpsec.org/mailman/listinfo/devel


  1   2   3   4   5   6   7   8   9   10   >