Hi,

I've created a tracing tool several years ago for my master's thesis
against the 2.2 kernel and onto the 2.4 kernel. I'm currently using this
in the 2.6 kernel to debug some customizations against Ingo's RT kernel.

What this is, is basically a device that allocates a large number of
single pages to be used in conjunction as a ring buffer. The number of
pages defaults to 256 (1 meg for 4k pages) but this can be changed
through config.

This utility was made to debug large number of packets being
transferred, as well as how the scheduler was handling them. So it had
to be able to work very fast and in any kernel context (process or
interrupt). So I came up with this utility and it has been very helpful
over the years in debugging kernels that I've worked on.  Now I feel
like I should share this, so I'm now posting it out to everyone to get
some feedback from other kernel developers. Its good for any type of
debugging, and not just scheduling or packets. 

As I've mentioned, this is something that I wrote and rewrote through
the years and has been through 2.2, 2.4 and now the 2.6 kernel. I've
tried to clean it up as I went, but there might be relics from years
past still in the code, as well as some bugs :-)

I'm putting the code and the patch on my website at:

http://www.kihontech.com/logdev/

The tar ball contains userland tools to read the device, and the patch
is against the 2.6.10 kernel.

I'll post the README here so you don't need to download it to have a
better look. 

If anyone out there thinks this is a helpful patch, let myself and the
rest of the world know, and maybe we can get it into the mainline.
Otherwise it will sit lonely on my server forever :(


Oh, and everything is under GPL version 2 license (of course).



               LOGDEV device and utilities.

             Copyright: Steven Rostedt, Kihon Technologies Inc. 2005


Introduction:
====================

The logdev device is an add on to the Linux Kernel.  It is useful where
printk is not. For example in the scheduler, debugging the console
itself or network traffic, or simply anything else that can happen
hundreds of times a second. 

The advantage that logdev has is that it is built into the kernel 
and uses memory to store information to be reviewed at a later time.
The memory used is an array of single pages (for most archs 4k). The
array is just an array of pointers to those pages.  These pages are
allocated when the driver is loaded, either as a module or built in,
and then no more allocation is required of the device. This allows
recording to take place in any context (process or interrupt). 
Spinlocks are used to keep this SMP and preemption safe.

The memory is used as a ring buffer that can store anything that a
developer would like.  There's some built in routines that are
useful. Especially for what kinds of things that were mentioned 
earlier.

There are really two parts to the logdev device.  If you plan on
using it as either a module or a device, you must compile in the
LOGDEV_HOOKS.  This creates hooks for the module when it's loaded
such that macros can be used to call the module.  If the LOGDEV is
compiled into the kernel, then the calls are just made directly to
the logdev device and it doesn't go through the hooks.



Application Programming Interface (API):
=========================================

Here's some of the things that can be used with this device whether
it was compiled in or inserted as a module.


  LOGSWITCH(prev,next)

      This is patched into the scheduler just before switch_to, to
      see what and when processes are being scheduled in and out.
      Here's an example of the retrieved output from this:

      sec: 1109013989:238322  CPU:0 (syslogd:1828) -->> (swapper:0)
      sec: 1109013989:318253  CPU:0 (swapper:0) -->> (ntpd:2003)
      sec: 1109013989:318294  CPU:0 (ntpd:2003) -->> (swapper:0)
      sec: 1109013989:515641  CPU:0 (swapper:0) -->> (sshd:2129)
      sec: 1109013989:515752  CPU:0 (sshd:2129) -->> (bash:2134)
      sec: 1109013989:515823  CPU:0 (bash:2134) -->> (sshd:2129)
      sec: 1109013989:515960  CPU:0 (sshd:2129) -->> (swapper:0)


      The seconds are from do_gettimeofday, seconds followed by microseconds.

  LOGPKT(skb,direction)

      This may be added somewhere in the networking code to see what
      packets are going through a location.  The direction is just an integer
      to determine the direction of the packet if it is used in both
      incoming and outgoing packet code. 

      Here's an example of the retrieved output when this was placed
      in net/ipv4/ip_input.c:ip_rcv_finish as LOGPKT(skb,0);

      sec: 1109016309:002573  CPU:0 192.168.23.9:36448 <== 192.168.23.88:22 
seq:4168259885 ack:4168180295 (---PA-) len:48 win:2372 end_seq:4168259933
      sec: 1109016309:002976  CPU:0 192.168.23.9:36448 <== 192.168.23.88:22 
seq:4168259933 ack:4168180343 (----A-) len:0 win:2372 end_seq:4168259933
      sec: 1109016309:065440  CPU:0 192.168.23.9:36448 <== 192.168.23.88:22 
seq:4168259933 ack:4168180343 (---PA-) len:48 win:2372 end_seq:4168259981
      sec: 1109016309:065848  CPU:0 192.168.23.9:36448 <== 192.168.23.88:22 
seq:4168259981 ack:4168180391 (----A-) len:0 win:2372 end_seq:4168259981
      
      Here we see the packets time, cpu it arrived on (this was run on
      a UP so it would always be zero), the source ip and port, the 
      direction was received (don't get confused that the arrow points
      to the source, it just means it was received... this may be changed
      in the future), and it also shows the destination, sequence number,
      ack number, flags, length of packet, window size and the end sequence.


  LOGPRINT(x,...)

      This acts just like a printk, except that the output is written
      to the logdev device and not to output. So you can put any type
      of prints that you would usually do with printk in this, in 
      areas of the kernel that would just show too much data over the 
      console, and then retrieve it later.


  LOGTPRINT(x,...)

      This acts just like LOGPRINT but it also prepends the time stamp
      before the print in case you are interested in the timings of 
      some code.  This simplifies doing it yourself.


  LOGDUMP()
  LOGDUMPNET()

     These two macros are best used (but not necessarily) in the 
     oops section of the kernel.  I usually place these in the 
     kernel panic code and/or the kernel page fault code on a kernel
     access of a bad address.

     The first one dumps what it can from the logdev device to the
     console (using printk), and the later uses the netconsole code
     to dump to the network.  But to use the later you must set up
     the logdev device like you would set up the netconsole device.
     You could just use netconsole and LOGDUMP() to achieve the same
     effect, but I thought it was fun to put it in anyway.

     I've also thought about putting this into the sysreq key, to
     cause the dump.


The /proc system
=======================

   The logdev module adds a few items into the proc filesystem


   /proc/logdev

      This just shows the status of the logdev device, with the following:

    Logdev:
      len:    256
      size:   7244
      start:  2
      end:    3
      corrupted:0

      Entries:
        0:        4095 :     4095     size: 0
        1:        4095 :     4095     size: 0
        2:        3218 :     3217     size: 4095
        3:           0 :     3149     size: 3149

        .
        .
        .

     Here it shows that the number of pages is 256 used.
     The amount of bytes stored in the device is indicated by size.
     The ring buffer starts at array location 2 and ends at 3.
     The ring buffer still has integrity wrt. keeping it's records straight.
     And the list of what's in the pages that is stored.
     The first two start and stop at 4095, and are empty.
     Page 2 starts at 3218 and stops at 3217 so it's size is 4095
       (note that we are missing a byte to differentiate between a full
       and empty page).
     And the list goes on.


  /proc/logdev_minor
 
     The logdev device is registered as a miscellaneous device with
     a dynmic minor. Don't worry, the utilities use this to create the
     device for you.

  /proc/logdev_print

     This must be set to 1 (or just non-zero) to allow for the LOGPRINT
     and LOGTPRINT to write to the logdev buffer.  The default state
     is set by CONFIG_LOGDEV_PRINT_ENABLED, which itself is defaulted to
     yes.

  /proc/logdev_switch

     When this is set to 1 (or just non-zero) it triggers the recording
     of the context switches in the scheduler.

  /proc/logdev_level

     This is not directly used by the logdev device, but it allows for
     developers to trigger certain tracings written to the logdev device
     with different numbers assigned to the logdev_level. 

  Note: All the above are set as atomic_t variables.


  

The Utilities:
=====================


  The logdev device would not be very valuable if you could not
  get the data out from it.  So I have also provided a few utilities
  to retrieve the data. 

  mklogdev  - creates a /dev/logdev entry from the /proc/logdev_minor.
     It deletes /dev/logdev if it is already there but contains 
     a wrong minor. I don't have udev (don't care for it), so if
     someone has a better method as well, let me know.

  logread - reads /dev/logdev and dumps out what it knows about. 
     Basically, if you don't make your own records, and use just 
     what was given by the logdev device, it will print out everything.

  logwrite - writes to the logdev device from userland makeing it look
     like a LOGPRINT. You can just write to the device from userland
     with a cat > /dev/logdev, but this corrupts the device, and 
     may have logread miss some data.

  Note: both logread and logwrite will automatically create the /dev/logdev
      device, so you don't need to use mklogdev before using either of the
      above two.

  libdumplog.a - this is a library that contains the functions used
    by the above utilities such that you can write your own readers.

  All of the above is under GPL so you can look at the code if 
  you want to know more. They are quite simple.



The Ring Buffer:
========================

   The Ring Buffer of the logdev device is made up of an array of
   descriptors that represents each page that was allocated to
   the device. This allows for each page to be allocated separately
   and not to waste a large continuous allocation. This lets the 
   kernel memory management get the best allocated memory for the 
   device and keeps the device from being too much of a hog.

   Each descriptor keeps track of the first and last 
   entry on the page. So in fact, each page in of itself is also
   a ring buffer.  When the page is full, then the logdev device
   continues onto the next page.  When the logdev device, itself, becomes
   full, then it starts to overwrite what was written earliest to the
   device.  If the device is not corrupted (that is, only the 
   logdev device records, or LOGDEV_CUSTOM was used), then it 
   removes the entire earliest records until there's enough room to
   place the next record. If the device is flagged as corrupted, then
   it only removes just enough to fit what it is writing.

   Corruption sounds really bad, when here it is not. It just means that
   a search for the next record is required using magic ids of the 
   record to find them. It doesn't really mean that the logdev device
   itself is corrupt. In fact, as soon as the logdev device becomes empty,
   it then clears the corrupt flag, and starts fresh again.
   So you may never ever notice if the logdev device is corrupt or not.



Thanks,

-- Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to