Re: [linux-pm] [RFC] sleepy linux

2007-12-26 Thread Igor Stoppa
Hi,
On Wed, 2007-12-26 at 00:07 +0100, ext Pavel Machek wrote:
> This is RFC. It does not even work for me... it sleeps but it will not
> wake up, because SATA wakeup code is missing. Code attached for illustration.
> 
> I wonder if this is the right approach? What is right interface to the
> drivers?
> 
> 
>   Sleepy Linux
>   
> 
> Copyright 2007 Pavel Machek <[EMAIL PROTECTED]>
> GPLv2
> 
> Current Linux versions can enter suspend-to-RAM just fine, but only
> can do it on explicit request. But suspend-to-RAM is important, eating
> something like 10% of power needed for idle system. Starting suspend
> manually is not too convinient; it is not an option on multiuser
> machine, and even on single user machine, some things are not easy:
> 
> 1) Download this big chunk in mozilla, then go to sleep
> 
> 2) Compile this, then go to sleep

Why can't these cases be based on CPUIdle?

> 3) You can sleep now, but wake me up in 8:30 with mp3 player

This is about setting up properly the wakeup sources which means:

- the wakeup source is really capable of generating wakeups for the
target idle state

- the wakeup source is not actually capable of genrating wakeups from
the target idle state, which can be solved in 2 ways:

- if the duration of the activity is known, set up an alarm 
  (assuming alarms are proper wakeup sources) so that the
   system is ready just in time, in a less efficient but more
   responsive power saving state

- if the duration of the activity is unknown choose the more 
  efficient amongst the following solutions:

- go to deep sleep state and periodically wakeup and
  poll, with a period compatible with the timing 
  of the event source

- prevent too deep sleep states till the event happens

> Todays hardware is mostly capable of doing better: with correctly set
> up wakeups, machine can sleep and successfully pretend it is not
> sleeping -- by waking up whenever something interesting happens. Of
> course, it is easier on machines not connected to the network, and on
> notebook computers.

It might be that some hw doesn't provide deep power saving state for
some devices, but if the only missing feature is the wakeup capability,
it could be effectively replaced by some HW timer.


> Requirements:
> 
> 0) Working suspend-to-RAM, with kernel being able to bring video back.
> 
> 1) RTC clock that can wake up system
> 
> 2) Lid that can wake up a system,
>or keyboard that can wake up system and does not loose keypress
>or special screensaver setup
> 
> 3) Network card that is either down
>or can wake up system on any packet (and not loose too many packets)

These are just few system specific case, but if you start including USB
devices, the situation is going to get quite complicated very soon, if
you explicitly include certain HW devices in your model.


-- 
Cheers, Igor

Igor Stoppa <[EMAIL PROTECTED]>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-22 Thread Igor Stoppa
Hi,
On Sat, 2007-07-21 at 23:49 -0700, ext
[EMAIL PROTECTED] wrote:
> I'm deliberatly breaking the threading on this so that people who have 
> tuned out the hibernation thread can take a look at this.
> 
> below is the proposal that I made at the bottom of one of the posts on the 
> hibernation thread.

I have the impression that you are trying to describe a mix of the clock
and latency frameworks.

Could you elaborate on how your proposal is incompatible with enhancing
the clock framework? 

It looks like you are proposing a brand new shiny thing that frankly I
would be happy to leave alone, unless it is crystal clear that the clock
fw cannot be improved.

The clocfk fw is used for OMAP and other architectures (including SH,
iirc) and so far it has provided very good support for our power
management needs (Nokia 770 and N800).

Currently we are working on DVFS for OMAP2 (see slides presented at the
linux-pm summit for OLS 2007 http://tinyurl.com/28tact ) and even if the
current prototype is not actively involving the clock fw, our final goal
is to make it capable of supporting atomic transactions for changing the
core parameters.

OMAP3 will require suspend to ram implementation where the content of
system memory is retained, while parts or all the SoC are switched off.
The plan is still to have a clock fw based implementation (plus
interaction with the power rails, of course).

I think these are good examples of the non-ACPI systems you are
mentioning.

To make any proposal that has some chance of being accepted, you have to
compare it against the existing solution, explaining:

-what it is bringing in terms of new functionalities
-how it is different
-why the current implementation cannot simply be enhanced

You can refer to the linux-pm archives for examples of failed attempts
over the last year or so, just search for "framework" in the subject.

-- 
Cheers, Igor

Igor Stoppa <[EMAIL PROTECTED]>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-22 Thread Igor Stoppa
On Sun, 2007-07-22 at 01:58 -0700, ext [EMAIL PROTECTED] wrote:
> On Sun, 22 Jul 2007, Igor Stoppa wrote:

[snip]

> > Could you elaborate on how your proposal is incompatible with enhancing
> > the clock framework?
> 
> It's not that I think it's incompatible with any existing powersaving 
> tools (in fact I hope it's not)
> 
> it's that I think that this (or something similar) could be made to cover 
> all thevarious power options instead of CPU's having one interface, ACPI 
> capable drivers having another, embeded devices presenting a third, etc
> 
> this was triggered by the mess of different function calls for different 
> purposes that are used for the suspend functions where you have a bunch of 
> different functions that are each supposed to be called at a specific time 
> from a specific mode during the suspend process. with all these different 
> functions driver writes tend to not bother implementing any of them, and 
> it seems like there is a fairly steady stream of new functions that end up 
> being needed. the initial intent was to just change this into a generic 
> set of calls that every driver writer would implement the minimum set of, 
> and make it trivially extensable to future capabilities of hardware.

Every now and then there is some attempt to find One solution to bind
them all: x86, SoC, ACPI ... you name it.

Unfortunately, while it's true that there are significant similarities,
there are also notable differencies; as far as i know the USB subsystem
is the one that gets closer to what we have in the embedded arena, since
it can have complex cases of parent-child powering and wakeup.

> one other effect of this is that driver writers would see the mode 
> interface from day one rather then just completely ignoring it. right now 
> device driver authors tend to thing " why worry about figuring out how to 
> implement 'prepare to suspend', 'late suspend', 'suspend', 'quiese but 
> don't suspend', etc" if they aren't really interested in working on 
> suspend, it's not really clear what each of these should do even after 
> reading the docs on it. however listing the power modes that a device can 
> be in, documenting the cost of switching between them, and implementing 
> the transition is something very straightforward for the device driver 
> author to do (and they don't have to worry about the details of how and 
> when the various modes get used, that's up to the suspend/powersaving 
> software to figure out). as such I expect that the driver support for 
> powersaving modes to improve. in fact, I expect that some driver writers 
> will implement a whole bunch of modes, just to show off the features of 
> the hardware. and even if nothing uses the modes right now at least they 
> are implemented and documented for future use (and it should be trivial to 
> have a test routine that just runs every driver you have hardware for 
> through every mode transition to make sure that they all work, so the less 
> commonly used modes shouldn't bitrot too badly)

What you are saying can be summarised as making the driver model more
expressive.

> while I was describing the issues to my roomates over dinner I realized 
> that the same type of functions are needed for the CPU clocks.
> 
> if you have an accepted framework in place there that can do what I 
> described, please consider extending it to cover other types of devices 
> and drivers.

That is not part of the fw: the fw simply expresses parent-child clock
distribution and keeps usecounts so that unused clocks are automatically
gated.

The actual clock tree description is platform/arch/board specific and
doesn't affect the framework. You can just roll your own version for x86
by providing a description of the methods used to switch on/off every
individual clock on your board.

So what you are asking for is that somebody writes an x86 version of the
clock fw.

As for latencies, well, only few clocks really have significant impact.
Most notably the main system oscillator. Everything else has 0 latency
since it ends up in opening/closing a clock gate.

Powering device on/off will certainly introduce more latency, but either
the powering is supported by the hw, to make it quick or it has to go
through most, if not all of he usual initialisation sequence; in that
case it probably makes sense to avoid controlling it from kernelspace,
since it will be slow and won't require dedcisions made with us
precision.

> I want sanity and functionality far more then credit :-)

I want to avoid redesigning the wheel: the current version is not round
yet, but re-starting from a triangle every time is far less appealing.

> thanks for the link. I've read through it, and it looks 

Re: [linux-pm] Power Management framework proposal

2007-07-23 Thread Igor Stoppa
wakeup 
> signals in this mode'
> 
> but this API would just provide this info to the decision makeing code, 
> that code would have to antually enforce the limits

Again, you are going into a field that belongs to hw abstraction and
already has a standard tool do dela with it - HAL

> >> for some subsystems this would be little more then renameing existing
> >> fucntions, for others it would be converting several indepndant functions
> >> into one, discoverable api
> >
> > if you check cpufreq, you will find out that it already covers the
> > multiple cores case (but nothing prevents from using the same logic on
> > something that is not really a cpu) and also has some simple concept of
> > latency for frequency transition, concept that could be enhanced to
> > handle latencies that are depending on the current operating point and
> > target operating point.
> 
> does it provide a full matrix of latencies, or just mode 1->mode2=x, 
> mode2->mode3=y so mode1->mode3=x+y?

IIRC it's just 1 value

> >>> -why the current implementation cannot simply be enhanced
> >>
> >> which current implementation should be enhanced? and with the massive
> >> broadening of functionality should it retain the same name, or should it
> >> get renamed to something more generic?
> >
> > cpufreq could be renamed to anything that makes sense, but i see _no_
> > massive broadening of functionality.
> 
> what I'm talking about would provide an API to devices that you are 
> ignoring becouse they should be managed from userspace.

again, HAL / OHM / Mobilin

> >> the cpufreq implementation is very close to what I'm proposing, it would
> >> need to get broadend to cover other devices (like disk drives, wireless
> >> cards, etc), is this really the right thing to do or should the more
> >> generic API go in for external use and then the existing cpufreq be called
> >> from the set_mode() call?
> >
> > No, that doesn't make sense, as general approach.
> > You want to manage from kernel only those parts of the system where the
> > latency is so low that userspace wouldn't be able to keep up.
> >
> > Your examples (wireless, disk drive) can be easily controlled from
> > userspace, with a timeout.
> 
> absoutly, and they should be (at least most of the time). this was not 
> intended as a kernelspace only api. it is intended to be available to both 
> kernelspace and userspace.
> 
> > In both cases there are significant delays (change of rotation speed /
> > sync with the access point).
> 
> correct, and these delays should be reflected in the transition cost 
> matrix
> 
> > All this is hand waiving unless it is backed up by numbers.
> > Real cases are required in order to establish a list of priorities for
> > latency/power consumption.
> 
> this isn't attempting to establish a list of priorites, simply to give the 
> software that is trying to establish such a list the info to make it's 
> decisions, and the interface to use to issue the resulting instructions.

What i'm saying is that sw is implemented to fulfill certain needs. I'd
rather see a detailed description of the need and based on that debate
on the actual API / implementation.

-- 
Cheers, Igor

Igor Stoppa <[EMAIL PROTECTED]>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Power Management framework proposal

2007-07-24 Thread Igor Stoppa
On Tue, 2007-07-24 at 10:43 +0200, ext Jerome Glisse wrote:

> I believe a central place where user can set/change hw state to save
> power or to increase computational power is definitely a goal to pursue.
> But i truly think that the OHM approach is the best one ie using plugins
> so that one can make a plugin specific for each device. The point is that
> i believe there is no way to do an abstract interface for this and trying to
> do so will endup doing ugly code and any interface would fail to encompass
> all possible tweak that might exist for all devices.
> 
> For instance on graphics card you could do the following (maybe more):
> -change GPU clock
> -change memory clock
> -disable part of engine
> -disable unit
> i truly don't think you can make a common interface for all this, more
> over there might be constraint on how you can change things (GPU &
> memory clock might need to follow a given ratio). So you definitely
> need knowledge in the user space program to handle this.

Even simpler case: LCD backlight can come in many flavors, both in terms
of brightness levels and fixed amount of current required to keep it ON.

Trying to abstract such details from the decision-making makes little
sense.
Isolating that into a separate module, instead, brings the best of both
worlds:
-containment of the HW-specific code
-leveraging every possible, no matter how exotic, power saving mode
available.


-- 
Cheers, Igor

Igor Stoppa <[EMAIL PROTECTED]>
(Nokia Multimedia - CP - OSSO / Helsinki, Finland)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] [RFC][PATCH] PM: Document requirements for basic PM support in drivers

2007-02-14 Thread Igor Stoppa
s the power button to make the system
> > +resume.  If that works, you are ready to test the STD with the new driver
> > +loaded.  Otherwise, you have to identify what is wrong.
> > +
> > +a) To verify if there are any drivers that cause problems you can run the 
> > STD
> > +in the test mode:
> > +
> > +# echo test > /sys/power/disk
> > +# echo disk > /sys/power/state
> > +
> > +in which case the system should freeze tasks, suspend devices, disable 
> > nonboot
> > +CPUs (if any), wait for 5 seconds, enable nonboot CPUs, resume devices, 
> > thaw
> > +tasks and return to your command prompt.  If that fails, most likely there 
> > is
> > +a driver that fails to either suspend or resume (in the latter case the 
> > system
> > +may hang or be unstable after the test, so please take that into 
> > consideration).
> > +To find this driver, you can carry out a binary search according to the 
> > rules:
> > +- if the test fails, unload a half of the drivers currently loaded and 
> > repeat
> > +(that would probably involve rebooting the system, so always note what 
> > drivers
> > +have been loaded before the test),
> > +- if the test succeeds, load a half of the drivers you have unloaded most
> > +recently and repeat.
> > +
> > +Once you have found the failing driver (there can be more than just one of
> > +them), you have to unload it every time before the STD transition.  In 
> > that case
> > +please make sure to report the problem with the driver.
> 
> It is also possible that a cycle can still fail after you have unloaded
> all modules. In that case, you would want to look in your kernel
> configuration for possibilities that can be modularised (testing again
> with them as modules), and possibly also try boot time options such as
> noapic or noacpi.

The first step, imho, would be to identify all the peripherals requiired
for a barebone configuration to run (i.e. serial console) and verify
that at least those can reliably go through several suspend/resume
cycles. Then the dicotomic approach can be used.
> 
> > +
> > +b) If the test mode of STD works, you can boot the system with 
> > "init=/bin/bash"
> > +and attempt to suspend in the "reboot", "shutdown" and "platform" modes.  
> > If
> > +that does not work, there probably is a problem with one of the low level
> > +drivers and you generally cannot do much about it except for reporting it
> > +(fortunately, that does not happen very often these days).  Otherwise, 
> > there is
> 
> Oh. Perhaps some of the suggestions from above belong here?
> 
> > +a problem with a modular driver and you can find it by loading a half of 
> > the
> > +modules you normally use and binary searching in accordance with the 
> > algorithm:
> > +- if there are n modules loaded and the attempt to suspend and resume 
> > fails,
> > +unload n/2 of the modules and try again (that would probably involve 
> > rebooting
> > +the system),
> > +- if there are n modules loaded and the attempt to suspend and resume 
> > succeeds,
> > +load n/2 modules more and try again.
> > +
> > +Again, if you find the offending module(s), it(they) must be unloaded 
> > every time
> > +before the STD transition, and please report the problem with it(them).
> > +
> > +2. To verify that the STR works, it is generally more convenient to use the
> > +s2ram tool available from http://suspend.sf.net and documented at
> > +http://en.opensuse.org/s2ram .  However, before doing that it is 
> > recommended to
> > +carry out the procedure described in section 1.
> > +
> > +Assume you have resolved the problems with the STD and you have found some
> > +failing drivers.  These drivers are also likely to fail during the STR or
> > +during the resume, so it is better to unload them every time before the STR
> > +transition.  Now, you can follow the instructions at
> > +http://en.opensuse.org/s2ram to test the system, but if it does not work
> > +"out of the box", you may need to boot it with "init=/bin/bash" and test
> > +s2ram in the minimal configuration.  In that case, you may be able to 
> > search
> > +for failing drivers by following the procedure analogous to the one 
> > described in
> > +1b).  If you find some failing drivers, you will have to unload them every 
> > time
> > +before the STR transition (ie. before you run s2ram), and please report the
> > +problem with them.
> > +
> > +II. Testing the driver
> > +
> > +On

[RFC] memory allocations in genalloc

2017-08-17 Thread Igor Stoppa
Foreword:
If I should direct this message to someone else, please let me know.
I couldn't get a clear idea, by looking at both MAINTAINERS and git blame.



Hi,

I'm currently trying to convert the SE Linux policy db into using a
protectable memory allocator (pmalloc) that I have developed.

Such allocator is based on genalloc: I had come up with an
implementation that was pretty similar to what genalloc already does, so
it was pointed out to me that I could have a look at it.

And, indeed, it seemed a perfect choice.

But ... when freeing memory, genalloc wants that the caller also states
how large each specific memory allocation is.

This, per se, is not an issue, although genalloc doesn't seen to check
if the memory being freed is really matching a previous allocation request.

However, this design doesn't sit well with the use case I have in mind.

In particular, when the SE Linux policy db is populated, the creation of
one or more specific entry of the db might fail.
In this case, the memory previously allocated for said entry, is
released with kfree, which doesn't need to know the size of the chunk
being freed.

I would like to add similar capability to genalloc.

genalloc already uses bitmaps, to track what words are allocated (1) and
which are free (0)

What I would like to do is to add another bitmap, which would track the
beginning of each individual allocation (1 on the first allocation unit
of each allocation, 0 otherwise).

Such enhancement would enable also the detection of calls to free with
incorrect / misaligned addresses - right now it is possible to
successfully free a memory area that overlaps the interface of two
adjacent allocations, without fully covering either of them.

Would this change be acceptable?
Is there any better way to achieve what I want?


---

I have also a question wrt the use of spinlocks in genalloc.
Why a spinlock?

Freeing a chunk of memory previously allocated with vmalloc requires
invoking vfree_atomic, instead of vfree, because the list of chunks is
walked with the spinlock held, and vfree can sleep.

Why not using a mutex?


--
TIA, igor


[RFC PATCH v11 0/6] mm: security: ro protection for dynamic data

2018-01-24 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading.
To do this, the memory must first be freed, then the pool can be destroyed.

An example is provided, in the form of self-testing.

Changes since the v10 version:

Initially I tried to provide support for hardening the LSM hooks, but the
LSM code was too much in a flux to have some chance to be merged.

Several drop-in replacement for kmalloc based functions, for example
kzalloc.

>From this perspective I have also modified genalloc, to make its free
functionality follow more closely the kfree, which doesn't need to be told
the size of the allocation being released. This was sent out for review
twice, but it has not received any feedback, so far.

Also genalloc now comes with self-testing.

The latest can be found also here:

https://www.spinics.net/lists/kernel/msg2696152.html

The need to integrate with hardened user copy has driven an optimization
in the management of vmap_areas, where each struct page in a vmalloc area
has a reference to it, saving the search through the various areas.

I was planning - and can still do it - to provide hardening for some IMA
data, but in the meanwhile it seems that the XFS developers might be
interested in htis functionality:

http://www.openwall.com/lists/kernel-hardening/2018/01/24/1

So I'm sending it out as preview.


Igor Stoppa (6):
  genalloc: track beginning of allocations
  genalloc: selftest
  struct page: add field for vm_struct
  Protectable Memory
  Documentation for Pmalloc
  Pmalloc: self-test

 Documentation/core-api/pmalloc.txt | 104 
 include/linux/genalloc-selftest.h  |  30 +++
 include/linux/genalloc.h   |   6 +-
 include/linux/mm_types.h   |   1 +
 include/linux/pmalloc.h| 215 
 include/linux/vmalloc.h|   1 +
 init/main.c|   2 +
 lib/Kconfig|  15 ++
 lib/Makefile   |   1 +
 lib/genalloc-selftest.c| 402 +
 lib/genalloc.c | 444 +--
 mm/Kconfig |   7 +
 mm/Makefile|   2 +
 mm/pmalloc-selftest.c  |  65 +
 mm/pmalloc-selftest.h  |  30 +++
 mm/pmalloc.c   | 516 +
 mm/usercopy.c  |  25 +-
 mm/vmalloc.c   |  18 +-
 18 files changed, 1744 insertions(+), 140 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.txt
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 lib/genalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 1/6] genalloc: track beginning of allocations

2018-01-24 Thread Igor Stoppa
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.

It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.

The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).

The user of the API must keep track of how much space was requested, if
it ever needs to be freed.

This can cause errors being undetected.
Ex:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
  allocation.

The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.

--

This patch doubles the space reserved in the bitmap for each allocation.
By using 2 bits per allocation, it is possible to encode also the
information of where the allocation starts:
(msb to the left, lsb to the right, in the following "dictionary")

11: first allocation unit in the allocation
10: any subsequent allocation unit (if any) in the allocation
00: available allocation unit
01: invalid

Ex, with the same notation as above - MSb...LSb:

 ...100010101011   <-- Read in this direction.
\__|\__|\|\|\__|
   |   | | |   \___ 4 used allocation units
   |   | | \___ 3 empty allocation units
   |   | \_ 1 used allocation unit
   |   \___ 2 used allocation units
   \___ 2 empty allocation units

Because of the encoding, the previous lockless operations are still
possible. The only caveat is to change the parameter of the zero-finding
function which establishes the alignment at which to perform the test
for first zero.
The original value of the parameter is 0, meaning that an allocation can
start at any point in the bitmap, while the new value is 1, meaning that
allocations can start only at even places (bit 0, bit 2, etc.)
The number of zeroes to look for, must therefore be doubled.

When it's time to free the memory associated to an allocation request,
it's a matter of checking if the corresponding allocation unit is really
the beginning of an allocation (both bits are set to 1).
Looking for the ending can also be performed locklessly.
It's sufficient to identify the first mapped allocation unit
that is represented either as free (00) or busy (11).
Even if the allocation status should change in the meanwhile, it doesn't
matter, since it can only transition between free (00) and
first-allocated (11).

The parameter indicating to the *_free() function the size of the space
that should be freed is not currently removed, to facilitate the
transition, but it is verified, whenever it is not zero.
If it is set to zero, then the free function will autonomously decide the
size to be free, by scanning the bitmap.

About the implementation: the patch introduces the concept of "bitmap
entry", which has a 1:1 mapping with allocation units, while the code
being patched has a 1:1 mapping between allocation units and bits.

This means that, now, the bitmap can be extended (by following powers of
2), to track also other properties of the allocations, if ever needed.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   3 +-
 lib/genalloc.c   | 417 ---
 2 files changed, 289 insertions(+), 131 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 6dfec4d..a8fdabf 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 
 struct device;
 struct device_node;
@@ -75,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr;  /* physical starting address of memory 
chunk */
unsigned long start_addr;   /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk 
(inclusive) */
-   unsigned long bits[0];  /* bitmap for allocating memory chunk */
+   unsigned long entries[0];   /* bitmap for allocating memory chunk */
 };
 
 /*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 144fe6b..13bc8cf 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -36,114 +36,221 @@
 #include 
 #include 
 
+#define ENTRY_ORDER 1UL
+#define ENTRY_MASK ((1UL << ((ENTRY_ORDER) + 1UL)) - 1UL)
+#define ENTRY_HEAD ENTRY_MASK
+#define ENTRY_UNUSED 0UL
+#define BITS_PER_ENTRY (1U << ENTRY_ORDER)
+#define BITS_DIV_ENTRIES(x) ((x) >> ENTRY_ORDER)
+#define ENTRIES_TO_BITS(x) ((x) << ENTRY_ORDER)
+#define BITS_DIV_LONGS(x) ((x) / BITS_PER_LONG)
+#define ENTRIES_DIV_L

[PATCH 2/6] genalloc: selftest

2018-01-24 Thread Igor Stoppa
Introduce a set of macros for writing concise test cases for genalloc.

The test cases are meant to provide regression testing, when working on
new functionality for genalloc.

Primarily they are meant to confirm that the various allocation strategy
will continue to work as expected.

The execution of the self testing is controlled through a Kconfig option.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc-selftest.h |  30 +++
 init/main.c   |   2 +
 lib/Kconfig   |  15 ++
 lib/Makefile  |   1 +
 lib/genalloc-selftest.c   | 402 ++
 5 files changed, 450 insertions(+)
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 lib/genalloc-selftest.c

diff --git a/include/linux/genalloc-selftest.h 
b/include/linux/genalloc-selftest.h
new file mode 100644
index 000..7af1901
--- /dev/null
+++ b/include/linux/genalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * genalloc-selftest.h
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __GENALLOC_SELFTEST_H__
+#define __GENALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_GENERIC_ALLOCATOR_SELFTEST
+
+#include 
+
+void genalloc_selftest(void);
+
+#else
+
+static inline void genalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 0e4d39c..2bdacb9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -87,6 +87,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -649,6 +650,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   genalloc_selftest();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/lib/Kconfig b/lib/Kconfig
index b1445b2..1c535f4 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -291,6 +291,21 @@ config DECOMPRESS_LZ4
 config GENERIC_ALLOCATOR
bool
 
+config GENERIC_ALLOCATOR_SELFTEST
+   bool "genalloc tester"
+   default n
+   select GENERIC_ALLOCATOR
+   help
+ Enable automated testing of the generic allocator.
+ The testing is primarily for the tracking of allocated space.
+
+config GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+   bool "make the genalloc tester more verbose"
+   default n
+   select GENERIC_ALLOCATOR_SELFTEST
+   help
+ More information will be displayed during the self-testing.
+
 #
 # reed solomon support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index b8f2c16..ff7ad5f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_LIBCRC32C) += libcrc32c.o
 obj-$(CONFIG_CRC8) += crc8.o
 obj-$(CONFIG_XXHASH)   += xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_GENERIC_ALLOCATOR_SELFTEST) += genalloc-selftest.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/genalloc-selftest.c b/lib/genalloc-selftest.c
new file mode 100644
index 000..007a0cf
--- /dev/null
+++ b/lib/genalloc-selftest.c
@@ -0,0 +1,402 @@
+/*
+ * genalloc-selftest.c
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+
+/* Keep the bitmap small, while including case of cross-ulong mapping.
+ * For simplicity, the test cases use only 1 chunk of memory.
+ */
+#define BITMAP_SIZE_C 16
+#define ALLOC_ORDER 0
+
+#define ULONG_SIZE (sizeof(unsigned long))
+#define BITMAP_SIZE_UL (BITMAP_SIZE_C / ULONG_SIZE)
+#define MIN_ALLOC_SIZE (1 << ALLOC_ORDER)
+#define ENTRIES (BITMAP_SIZE_C * 8)
+#define CHUNK_SIZE  (MIN_ALLOC_SIZE * ENTRIES)
+
+#ifndef CONFIG_GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+
+static inline void print_first_chunk_bitmap(struct gen_pool *pool) {}
+
+#else
+
+static void print_first_chunk_bitmap(struct gen_pool *pool)
+{
+   struct gen_pool_chunk *chunk;
+   char bitmap[BITMAP_SIZE_C * 2 + 1];
+   unsigned long i;
+   char *bm = bitmap;
+   char *entry;
+
+   if (unlikely(pool == NULL || pool->chunks.next == NULL))
+   return;
+
+   chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+next_chunk);
+   entry = (void *)chunk->entries;
+   for (i = 1; i <= BITMAP_SIZE_C; i++)
+   bm += snprintf

[PATCH 3/6] struct page: add field for vm_struct

2018-01-24 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtian
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area. This will avoid more expensive searches.

As example, the function find_vm_area is reimplemented, to take advantage
of the newly introduced field.

Signed-off-by: Igor Stoppa 
---
 include/linux/mm_types.h |  1 +
 mm/vmalloc.c | 18 +-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b..c3a4825 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6739420..44c5dfc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1466,13 +1466,16 @@ struct vm_struct *get_vm_area_caller(unsigned long 
size, unsigned long flags,
  */
 struct vm_struct *find_vm_area(const void *addr)
 {
-   struct vmap_area *va;
+   struct page *page;
 
-   va = find_vmap_area((unsigned long)addr);
-   if (va && va->flags & VM_VM_AREA)
-   return va->vm;
+   if (unlikely(!is_vmalloc_addr(addr)))
+   return NULL;
 
-   return NULL;
+   page = vmalloc_to_page(addr);
+   if (unlikely(!page))
+   return NULL;
+
+   return page->area;
 }
 
 /**
@@ -1536,6 +1539,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1744,6 +1748,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
const void *caller)
 {
struct vm_struct *area;
+   unsigned int page_counter;
void *addr;
unsigned long real_size = size;
 
@@ -1769,6 +1774,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
 
kmemleak_vmalloc(area, size, gfp_mask);
 
+   for (page_counter = 0; page_counter < area->nr_pages; page_counter++)
+   area->pages[page_counter]->area = area;
+
return addr;
 
 fail:
-- 
2.9.3



[PATCH 4/6] Protectable Memory

2018-01-24 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   3 +
 include/linux/pmalloc.h  | 215 
 include/linux/vmalloc.h  |   1 +
 lib/genalloc.c   |  27 +++
 mm/Makefile  |   1 +
 mm/pmalloc.c | 513 +++
 mm/usercopy.c|  25 ++-
 7 files changed, 781 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index a8fdabf..9f2974f 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -121,6 +121,9 @@ extern unsigned long gen_pool_alloc_algo(struct gen_pool *, 
size_t,
 extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
dma_addr_t *dma);
 extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+extern void gen_pool_flush_chunk(struct gen_pool *pool,
+struct gen_pool_chunk *chunk);
 extern void gen_pool_for_each_chunk(struct gen_pool *,
void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
 extern size_t gen_pool_avail(struct gen_pool *);
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..cb18739
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,215 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+
+#include 
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at run-time.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation,
+ * 2) writes to the allocated memory,
+ * 3) write protection of the pool,
+ * 4) freeing of the allocated memory, and
+ * 5) destruction of the pool.
+ *
+ * For a non-threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provide support for locking, point 2)
+ * would still depend on the user taking the lock.
+ */
+
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool upon success, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+int is_pmalloc_object(const void *ptr, const unsigned long n);
+
+/**
+ * pmalloc_prealloc - tries to allocate a memory chunk of the requested size
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Prepares a chunk of the requested size.
+ * This is intended to both minimize latency in later memory requests and
+ * avoid sleping during allocation.
+ * Memory allocated with

[PATCH 5/6] Documentation for Pmalloc

2018-01-24 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/pmalloc.txt | 104 +
 1 file changed, 104 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.txt

diff --git a/Documentation/core-api/pmalloc.txt 
b/Documentation/core-api/pmalloc.txt
new file mode 100644
index 000..9c39672
--- /dev/null
+++ b/Documentation/core-api/pmalloc.txt
@@ -0,0 +1,104 @@
+
+Protectable memory allocator
+
+
+Introduction
+
+
+When trying to perform an attack toward a system, the attacker typically
+wants to alter the execution flow, in a way that allows actions which
+would otherwise be forbidden.
+
+In recent years there has been lots of effort in preventing the execution
+of arbitrary code, so the attacker is progressively pushed to look for
+alternatives.
+
+If code changes are either detected or even prevented, what is left is to
+alter kernel data.
+
+As countermeasure, constant data is collected in a section which is then
+marked as readonly.
+To expand on this, also statically allocated variables which are tagged
+as __ro_after_init will receive a similar treatment.
+The difference from constant data is that such variables can be still
+altered freely during the kernel init phase.
+
+However, such solution does not address those variables which could be
+treated essentially as read-only, but whose size is not known at compile
+time or cannot be fully initialized during the init phase.
+
+
+Design
+--
+
+pmalloc builds on top of genalloc, using the same concept of memory pools
+A pool is a handle to a group of chunks of memory of various sizes.
+When created, a pool is empty. It will be populated by allocating chunks
+of memory, either when the first memory allocation request is received, or
+when a pre-allocation is performed.
+
+Either way, one or more memory pages will be obtaiend from vmalloc and
+registered in the pool as chunk. Subsequent requests will be satisfied by
+either using any available free space from the current chunks, or by
+allocating more vmalloc pages, should the current free space not suffice.
+
+This is the key point of pmalloc: it groups data that must be protected
+into a set of pages. The protection is performed through the mmu, which
+is a prerequisite and has a minimum granularity of one page.
+
+If the relevant variables were not grouped, there would be a problem of
+allowing writes to other variables that might happen to share the same
+page, but require further alterations over time.
+
+A pool is a group of pages that are write protected at the same time.
+Ideally, they have some high level correlation (ex: they belong to the
+same module), which justifies write protecting them all together.
+
+To keep it to a minimum, locking is left to the user of the API, in
+those cases where it's not strictly needed.
+Ideally, no further locking is required, since each module can have own
+pool (or pools), which should, for example, avoid the need for cross
+module or cross thread synchronization about write protecting a pool.
+
+The overhead of creating an additional pool is minimal: a handful of bytes
+from kmalloc space for the metadata and then what is left unused from the
+page(s) registered as chunks.
+
+Compared to plain use of vmalloc, genalloc has the advantage of tightly
+packing the allocations, reducing the number of pages used and therefore
+the pressure on the TLB. The slight overhead in execution time of the
+allocation should be mostly irrelevant, because pmalloc memory is not
+meant to be allocated/freed in tight loops. Rather it ought to be taken
+in use, initialized and write protected. Possibly destroyed.
+
+Considering that not much data is supposed to be dynamically allocated
+and then marked as read-only, it shouldn't be an issue that the address
+range for pmalloc is limited, on 32-bit systemd.
+
+Regarding SMP systems, the allocations are expected to happen mostly
+during an initial transient, after which there should be no more need to
+perform cross-processor synchronizations of page tables.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+1. create a pool
+2. [optional] pre-allocate some memory in the pool
+3. issue one or more allocation requests to the pool
+4. initialize the memory obtained
+   - iterate over points 3 & 4 as needed -
+5. write protect the pool
+6. use in read-only mode the handlers obtained throguh the allocations
+7. [optional] destroy the pool
+
+
+In a scenario where, for example due to some error, part or all of the
+allocations performed at point 3 must be reverted, it is possible to free
+them, as long as point 5 has not been executed, and the pool is still
+modifiable. Such freed memory can be re-used.
+Performing a free operation on a write-protected pool will, instead,
+simply release the corresponding memory from the accountin

[PATCH 6/6] Pmalloc: self-test

2018-01-24 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

Signed-off-by: Igor Stoppa 
---
 mm/Kconfig|  7 ++
 mm/Makefile   |  1 +
 mm/pmalloc-selftest.c | 65 +++
 mm/pmalloc-selftest.h | 30 
 mm/pmalloc.c  |  3 +++
 5 files changed, 106 insertions(+)
 create mode 100644 mm/pmalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.h

diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8f..1de6ea6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -760,3 +760,10 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config PROTECTABLE_MEMORY_SELFTEST
+   bool "Run self test for pmalloc memory allocator"
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index a6a47e1..1e76a9b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pmalloc.o
+obj-$(CONFIG_PROTECTABLE_MEMORY_SELFTEST) += pmalloc-selftest.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/pmalloc-selftest.c b/mm/pmalloc-selftest.c
new file mode 100644
index 000..1c025f3
--- /dev/null
+++ b/mm/pmalloc-selftest.c
@@ -0,0 +1,65 @@
+/*
+ * pmalloc-selftest.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+#define validate_alloc(expected, variable, size)   \
+   pr_notice("must be " expected ": %s",   \
+ is_pmalloc_object(variable, size) > 0 ? "ok" : "no")
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc("ok", variable, size)
+
+#define is_alloc_no(variable, size)\
+   validate_alloc("no", variable, size)
+
+void pmalloc_selftest(void)
+{
+   struct gen_pool *pool_unprot;
+   struct gen_pool *pool_prot;
+   void *var_prot, *var_unprot, *var_vmall;
+
+   pr_notice("pmalloc self-test");
+   pool_unprot = pmalloc_create_pool("unprotected", 0);
+   pool_prot = pmalloc_create_pool("protected", 0);
+   BUG_ON(!(pool_unprot && pool_prot));
+
+   var_unprot = pmalloc(pool_unprot,  SIZE_1 - 1, GFP_KERNEL);
+   var_prot = pmalloc(pool_prot,  SIZE_1, GFP_KERNEL);
+   var_vmall = vmalloc(SIZE_2);
+   is_alloc_ok(var_unprot, 10);
+   is_alloc_ok(var_unprot, SIZE_1);
+   is_alloc_ok(var_unprot, PAGE_SIZE);
+   is_alloc_no(var_unprot, SIZE_1 + 1);
+   is_alloc_no(var_vmall, 10);
+
+
+   pfree(pool_unprot, var_unprot);
+   vfree(var_vmall);
+
+   pmalloc_protect_pool(pool_prot);
+
+   /* This will intentionally trigger a WARN because the pool being
+* destroyed is not protected, which is unusual and should happen
+* on error paths only, where probably other warnings are already
+* displayed.
+*/
+   pmalloc_destroy_pool(pool_unprot);
+
+   /* This must not cause WARNings */
+   pmalloc_destroy_pool(pool_prot);
+}
diff --git a/mm/pmalloc-selftest.h b/mm/pmalloc-selftest.h
new file mode 100644
index 000..3673d23
--- /dev/null
+++ b/mm/pmalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * pmalloc-selftest.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __PMALLOC_SELFTEST_H__
+#define __PMALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_PROTECTABLE_MEMORY_SELFTEST
+
+#include 
+
+void pmalloc_selftest(void);
+
+#else
+
+static inline void pmalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index a64ac49..a722d7b 100644
--- a/mm/pmalloc.c
+++ b/mm/pmalloc.c
@@ -25,6 +25,8 @@
 #include 
 #include 
 
+#include "pmalloc-selftest.h"
+
 /**
  * pmalloc_data contains the data specific to a pmalloc pool,
  * in a format compatible with the design of gen_alloc.
@@ -508,6 +510,7 @@ static int __init pmalloc_late_init(void)
}
}
mutex_unlock(&pmalloc_mutex);
+   pmalloc_selftest();
return 0;
 }
 late_initcall(pmalloc_late_init);
-- 
2.9.3



Re: [PATCH 5/6] Documentation for Pmalloc

2018-01-24 Thread Igor Stoppa


On 24/01/18 21:14, Ralph Campbell wrote:
> 2 Minor typos inline below:

thanks for proof-reading, will fix accordingly.

--
igor


Re: [kernel-hardening] [PATCH 4/6] Protectable Memory

2018-01-25 Thread Igor Stoppa
Hi,

thanks for the review. My reply below.

On 24/01/18 21:10, Jann Horn wrote:

> I'm not entirely convinced by the approach of marking small parts of
> kernel memory as readonly for hardening.

Because of the physmap you mention later?

Regarding small parts vs big parts (what is big enough?) I did propose
the use of a custom zone at the very beginning, however I met 2 objections:

1. It's not a special case and there was no will to reserve another zone
   This might be mitigated by aliasing with a zone that is already
   defined, but not in use. For example DMA or DMA32.
   But it looks like a good way to replicate the confusion that is page
   struct. Anyway, I found the next objection more convincing.

2. What would be the size of this zone? It would become something that
   is really application specific. At the very least it should become a
   command line parameter. A distro would have to allocate a lot of
   memory for it, because it cannot really know upfront what its users
   will do. But, most likely, the vast majority of users would never
   need that much.

If you have some idea of how to address these objections without using
vmalloc, or at least without using the same page provider that vmalloc
is using now, I'd be interested to hear it.

Besides the double mapping problem, the major benefit I can see from
having a contiguous area is that it simplifies the hardened user copy
verification, because there is a fixed range to test for overlap.

> Comments on some details are inline.

thank you

>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 1e5d8c3..116d280 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -20,6 +20,7 @@ struct notifier_block;/* in notifier.h */
>>  #define VM_UNINITIALIZED   0x0020  /* vm_struct is not fully 
>> initialized */
>>  #define VM_NO_GUARD0x0040  /* don't add guard page */
>>  #define VM_KASAN   0x0080  /* has allocated kasan 
>> shadow memory */
>> +#define VM_PMALLOC 0x0100  /* pmalloc area - see docs */
> 
> Is "see docs" specific enough to actually guide the reader to the
> right documentation?

The doc file is named pmalloc.txt, but I can be more explicit.

>> +#define pmalloc_attr_init(data, attr_name) \
>> +do { \
>> +   sysfs_attr_init(&data->attr_##attr_name.attr); \
>> +   data->attr_##attr_name.attr.name = #attr_name; \
>> +   data->attr_##attr_name.attr.mode = VERIFY_OCTAL_PERMISSIONS(0444); \
>> +   data->attr_##attr_name.show = pmalloc_pool_show_##attr_name; \
>> +} while (0)
> 
> Is there a good reason for making all these files mode 0444 (as
> opposed to setting them to 0400 and then allowing userspace to make
> them accessible if desired)? /proc/slabinfo contains vaguely similar
> data and is mode 0400 (or mode 0600, depending on the kernel config)
> AFAICS.

ok, you do have a point, so far I have been mostly focusing on the

"drop-in replacement for kmalloc" aspect.

>> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
>> +{
> [...]
>> +   /* Expand pool */
>> +   chunk_size = roundup(size, PAGE_SIZE);
>> +   chunk = vmalloc(chunk_size);
> 
> You're allocating with vmalloc(), which, as far as I know, establishes
> a second mapping in the vmalloc area for pages that are already mapped
> as RW through the physmap. AFAICS, later, when you're trying to make
> pages readonly, you're only changing the protections on the second
> mapping in the vmalloc area, therefore leaving the memory writable
> through the physmap. Is that correct? If so, please either document
> the reasoning why this is okay or change it.

About why vmalloc as backend for pmalloc, please refer to this:

http://www.openwall.com/lists/kernel-hardening/2018/01/24/11

I tried to give a short summary of what took me toward vmalloc.
vmalloc is also a convenient way of obtaining arbitrarily (within
reason) large amounts of virtually contiguous memory.

Your objection is toward the unprotected access, through the alternate
mapping, rather than to the idea of having pools that can be protected
individually, right?

In the mail I linked, I explained that I could not use kmalloc because
of the problem of splitting huge pages on ARM.

kmalloc does require the physmap, for performance reason.

However, vmalloc is already doing mapping of individual pages, because
it must ensure that they are virtually contiguous, so would it be
possible to have vmalloc _always_ outside of the physmap?

If I have understood correctly, the actual extension of physmap is
highly architecture and platform dependant, so it might be (but I have
not checked) that in some cases (like some 32bit systems) vmalloc is
typically outside of physmap, but probably that is not the case on 64bit?

Also, I need to understand how physmap works against vmalloc vs how it
works against kernel text and const/__ro_after_init sections.

Can they also be acc

[PATCH 2/2] genalloc: selftest

2018-01-11 Thread Igor Stoppa
Introduce a set of macros for writing concise test cases for genalloc.

The test cases are meant to provide regression testing, when working on
new functionality for genalloc.

Primarily they are meant to confirm that the various allocation strategy
will continue to work as expected.

The execution of the self testing is controlled through a Kconfig option.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc-selftest.h |  30 +++
 init/main.c   |   2 +
 lib/Kconfig   |  14 ++
 lib/Makefile  |   1 +
 lib/genalloc-selftest.c   | 402 ++
 5 files changed, 449 insertions(+)
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 lib/genalloc-selftest.c

diff --git a/include/linux/genalloc-selftest.h 
b/include/linux/genalloc-selftest.h
new file mode 100644
index 000..7af1901
--- /dev/null
+++ b/include/linux/genalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * genalloc-selftest.h
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __GENALLOC_SELFTEST_H__
+#define __GENALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_GENERIC_ALLOCATOR_SELFTEST
+
+#include 
+
+void genalloc_selftest(void);
+
+#else
+
+static inline void genalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 0e4d39c..2bdacb9 100644
--- a/init/main.c
+++ b/init/main.c
@@ -87,6 +87,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -649,6 +650,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   genalloc_selftest();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/lib/Kconfig b/lib/Kconfig
index b1445b2..89fa195 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -291,6 +291,20 @@ config DECOMPRESS_LZ4
 config GENERIC_ALLOCATOR
bool
 
+config GENERIC_ALLOCATOR_SELFTEST
+   bool "genalloc tester"
+   default n
+   select GENERIC_ALLOCATOR
+   help
+ Enable automated testing of the generic allocator.
+
+config GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+   bool "make the genalloc tester more verbose"
+   default n
+   select GENERIC_ALLOCATOR_SELFTEST
+   help
+ More information will be displayed during the self-testing.
+
 #
 # reed solomon support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index b8f2c16..ff7ad5f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_LIBCRC32C) += libcrc32c.o
 obj-$(CONFIG_CRC8) += crc8.o
 obj-$(CONFIG_XXHASH)   += xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_GENERIC_ALLOCATOR_SELFTEST) += genalloc-selftest.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/genalloc-selftest.c b/lib/genalloc-selftest.c
new file mode 100644
index 000..007a0cf
--- /dev/null
+++ b/lib/genalloc-selftest.c
@@ -0,0 +1,402 @@
+/*
+ * genalloc-selftest.c
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+
+/* Keep the bitmap small, while including case of cross-ulong mapping.
+ * For simplicity, the test cases use only 1 chunk of memory.
+ */
+#define BITMAP_SIZE_C 16
+#define ALLOC_ORDER 0
+
+#define ULONG_SIZE (sizeof(unsigned long))
+#define BITMAP_SIZE_UL (BITMAP_SIZE_C / ULONG_SIZE)
+#define MIN_ALLOC_SIZE (1 << ALLOC_ORDER)
+#define ENTRIES (BITMAP_SIZE_C * 8)
+#define CHUNK_SIZE  (MIN_ALLOC_SIZE * ENTRIES)
+
+#ifndef CONFIG_GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+
+static inline void print_first_chunk_bitmap(struct gen_pool *pool) {}
+
+#else
+
+static void print_first_chunk_bitmap(struct gen_pool *pool)
+{
+   struct gen_pool_chunk *chunk;
+   char bitmap[BITMAP_SIZE_C * 2 + 1];
+   unsigned long i;
+   char *bm = bitmap;
+   char *entry;
+
+   if (unlikely(pool == NULL || pool->chunks.next == NULL))
+   return;
+
+   chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+next_chunk);
+   entry = (void *)chunk->entries;
+   for (i = 1; i <= BITMAP_SIZE_C; i++)
+   bm += snprintf(bm, 3, "%02hhx", entry[BITMAP_SIZE_C - i]);
+   *bm = '\0'

[RESEND PATCH v2 0/2] mm: genalloc - track beginning of allocations

2018-01-11 Thread Igor Stoppa
This is a partial resend:
- the primary functionality (PATCH 1/2) is unmodified
- while waiting for review, I added selftest capability for genalloc (2/2)


During the effort of introducing in the kernel an allocator for
protectable memory (pmalloc), it was noticed that genalloc can be
improved, to know how to separate the memory use by adjacent allocations

However, it seems that the functionality could have a value of its own.

It can:
- verify that the freeing of memory is consistent with previous allocations
- relieve the user of the API from tracking the size of each allocation
- enable use cases where generic code can free memory allocations received
  through a pointer (provided that the reference pool is known)

Details about the implementation are provided in the comment for the patch.

I mentioned this idea few months ago, as part of the pmalloc discussion,
but then I did not have time to follow-up immediately, as I had hoped.

This is an implementation of what I had in mind.
It seems to withstand several test cases i put together, in the form of
self-test, but it definitely would need thorough review.


I hope I have added as reviewer all the relevant people.
If I missed someone, please include them to the recipients.



Igor Stoppa (2):
  genalloc: track beginning of allocations
  genalloc: selftest

 include/linux/genalloc-selftest.h |  30 +++
 include/linux/genalloc.h  |   3 +-
 init/main.c   |   2 +
 lib/Kconfig   |  14 ++
 lib/Makefile  |   1 +
 lib/genalloc-selftest.c   | 402 
 lib/genalloc.c| 417 ++
 7 files changed, 738 insertions(+), 131 deletions(-)
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 lib/genalloc-selftest.c

-- 
2.9.3



[PATCH 1/2] genalloc: track beginning of allocations

2018-01-11 Thread Igor Stoppa
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.

It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.

The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).

The user of the API must keep track of how much space was requested, if
it ever needs to be freed.

This can cause errors being undetected.
Ex:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
  allocation.

The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.

--

This patch doubles the space reserved in the bitmap for each allocation.
By using 2 bits per allocation, it is possible to encode also the
information of where the allocation starts:
(msb to the left, lsb to the right, in the following "dictionary")

11: first allocation unit in the allocation
10: any subsequent allocation unit (if any) in the allocation
00: available allocation unit
01: invalid

Ex, with the same notation as above - MSb...LSb:

 ...100010101011   <-- Read in this direction.
\__|\__|\|\|\__|
   |   | | |   \___ 4 used allocation units
   |   | | \___ 3 empty allocation units
   |   | \_ 1 used allocation unit
   |   \___ 2 used allocation units
   \___ 2 empty allocation units

Because of the encoding, the previous lockless operations are still
possible. The only caveat is to change the parameter of the zero-finding
function which establishes the alignment at which to perform the test
for first zero.
The original value of the parameter is 0, meaning that an allocation can
start at any point in the bitmap, while the new value is 1, meaning that
allocations can start only at even places (bit 0, bit 2, etc.)
The number of zeroes to look for, must therefore be doubled.

When it's time to free the memory associated to an allocation request,
it's a matter of checking if the corresponding allocation unit is really
the beginning of an allocation (both bits are set to 1).
Looking for the ending can also be performed locklessly.
It's sufficient to identify the first mapped allocation unit
that is represented either as free (00) or busy (11).
Even if the allocation status should change in the meanwhile, it doesn't
matter, since it can only transition between free (00) and
first-allocated (11).

The parameter indicating to the *_free() function the size of the space
that should be freed is not currently removed, to facilitate the
transition, but it is verified, whenever it is not zero.
If it is set to zero, then the free function will autonomously decide the
size to be free, by scanning the bitmap.

About the implementation: the patch introduces the concept of "bitmap
entry", which has a 1:1 mapping with allocation units, while the code
being patched has a 1:1 mapping between allocation units and bits.

This means that, now, the bitmap can be extended (by following powers of
2), to track also other properties of the allocations, if ever needed.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   3 +-
 lib/genalloc.c   | 417 ---
 2 files changed, 289 insertions(+), 131 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 6dfec4d..a8fdabf 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 
 struct device;
 struct device_node;
@@ -75,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr;  /* physical starting address of memory 
chunk */
unsigned long start_addr;   /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk 
(inclusive) */
-   unsigned long bits[0];  /* bitmap for allocating memory chunk */
+   unsigned long entries[0];   /* bitmap for allocating memory chunk */
 };
 
 /*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index 144fe6b..13bc8cf 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -36,114 +36,221 @@
 #include 
 #include 
 
+#define ENTRY_ORDER 1UL
+#define ENTRY_MASK ((1UL << ((ENTRY_ORDER) + 1UL)) - 1UL)
+#define ENTRY_HEAD ENTRY_MASK
+#define ENTRY_UNUSED 0UL
+#define BITS_PER_ENTRY (1U << ENTRY_ORDER)
+#define BITS_DIV_ENTRIES(x) ((x) >> ENTRY_ORDER)
+#define ENTRIES_TO_BITS(x) ((x) << ENTRY_ORDER)
+#define BITS_DIV_LONGS(x) ((x) / BITS_PER_LONG)
+#define ENTRIES_DIV_L

Re: [kernel-hardening] [PATCH 4/6] Protectable Memory

2018-01-30 Thread Igor Stoppa
On 26/01/18 18:36, Boris Lukashev wrote:
> I like the idea of making the verification call optional for consumers
> allowing for fast/slow+hard paths depending on their needs.
> Cant see any additional vectors for abuse (other than the original
> ones effecting out-of-band modification) introduced by having
> verify/normal callers, but i've not had enough coffee yet. Any access
> races or things like that come to mind for anyone?

Well, the devil is in the details.
In this case, the question is how to perform the verification in a way
that is sufficiently robust against races.

After thinking about it for a while, I doubt it can be done reliably.
It might work for some small data types, but the typical use case I have
found myself dealing with, is protecting data structures.

That also brings up a separate problem: what would be the size of data
to hash? At one extreme there is a page, but it's probably too much, so
what is the correct size? it cannot be smaller than a specific
allocation, however that would imply looking for the hash related to the
data being accessed, with extra overhead.

And the data being accessed might be a field in a struct, for which we
would not have any hash.
There would be a hash only for the containing struct that was allocated ...


Overall, it seems a good idea in theory, but when I think about its
implementation, it seems like the overhead is so big that it would
discourage its use for almost any practical purpose.

If one really wants to be paranoid could, otoh have redundancy in a
different pool.

--
igor



[RFC PATCH v12 0/6] mm: security: ro protection for dynamic data

2018-01-30 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading.
To do this, the memory must first be freed, then the pool can be destroyed.

An example is provided, in the form of self-testing.

Changes since the v11 version:
[http://www.openwall.com/lists/kernel-hardening/2018/01/24/4]

- restricted access to sysfs entries created (444 -> 400)
- more explicit reference to documentation
- couple of typos

Igor Stoppa (6):
  genalloc: track beginning of allocations
  genalloc: selftest
  struct page: add field for vm_struct
  Protectable Memory
  Documentation for Pmalloc
  Pmalloc: self-test

 Documentation/core-api/pmalloc.txt | 104 
 include/linux/genalloc-selftest.h  |  30 +++
 include/linux/genalloc.h   |   5 +-
 include/linux/mm_types.h   |   1 +
 include/linux/pmalloc.h| 216 
 include/linux/vmalloc.h|   1 +
 init/main.c|   2 +
 lib/Kconfig|  15 ++
 lib/Makefile   |   1 +
 lib/genalloc-selftest.c| 402 +
 lib/genalloc.c | 444 +--
 mm/Kconfig |   7 +
 mm/Makefile|   2 +
 mm/pmalloc-selftest.c  |  65 +
 mm/pmalloc-selftest.h  |  30 +++
 mm/pmalloc.c   | 516 +
 mm/usercopy.c  |  25 +-
 mm/vmalloc.c   |  18 +-
 18 files changed, 1744 insertions(+), 140 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.txt
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 lib/genalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.h
 create mode 100644 mm/pmalloc.c

-- 
2.9.3



[PATCH 1/6] genalloc: track beginning of allocations

2018-01-30 Thread Igor Stoppa
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.

It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.

The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).

The user of the API must keep track of how much space was requested, if
it ever needs to be freed.

This can cause errors being undetected.
Ex:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
  allocation.

The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.

--

This patch doubles the space reserved in the bitmap for each allocation.
By using 2 bits per allocation, it is possible to encode also the
information of where the allocation starts:
(msb to the left, lsb to the right, in the following "dictionary")

11: first allocation unit in the allocation
10: any subsequent allocation unit (if any) in the allocation
00: available allocation unit
01: invalid

Ex, with the same notation as above - MSb...LSb:

 ...100010101011   <-- Read in this direction.
\__|\__|\|\|\__|
   |   | | |   \___ 4 used allocation units
   |   | | \___ 3 empty allocation units
   |   | \_ 1 used allocation unit
   |   \___ 2 used allocation units
   \___ 2 empty allocation units

Because of the encoding, the previous lockless operations are still
possible. The only caveat is to change the parameter of the zero-finding
function which establishes the alignment at which to perform the test
for first zero.
The original value of the parameter is 0, meaning that an allocation can
start at any point in the bitmap, while the new value is 1, meaning that
allocations can start only at even places (bit 0, bit 2, etc.)
The number of zeroes to look for, must therefore be doubled.

When it's time to free the memory associated to an allocation request,
it's a matter of checking if the corresponding allocation unit is really
the beginning of an allocation (both bits are set to 1).
Looking for the ending can also be performed locklessly.
It's sufficient to identify the first mapped allocation unit
that is represented either as free (00) or busy (11).
Even if the allocation status should change in the meanwhile, it doesn't
matter, since it can only transition between free (00) and
first-allocated (11).

The parameter indicating to the *_free() function the size of the space
that should be freed is not currently removed, to facilitate the
transition, but it is verified, whenever it is not zero.
If it is set to zero, then the free function will autonomously decide the
size to be free, by scanning the bitmap.

About the implementation: the patch introduces the concept of "bitmap
entry", which has a 1:1 mapping with allocation units, while the code
being patched has a 1:1 mapping between allocation units and bits.

This means that, now, the bitmap can be extended (by following powers of
2), to track also other properties of the allocations, if ever needed.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   2 +-
 lib/genalloc.c   | 417 ---
 2 files changed, 288 insertions(+), 131 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 872f930..0377681 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -76,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr;  /* physical starting address of memory 
chunk */
unsigned long start_addr;   /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk 
(inclusive) */
-   unsigned long bits[0];  /* bitmap for allocating memory chunk */
+   unsigned long entries[0];   /* bitmap for allocating memory chunk */
 };
 
 /*
diff --git a/lib/genalloc.c b/lib/genalloc.c
index ca06adc..dde7830 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -36,114 +36,221 @@
 #include 
 #include 
 
+#define ENTRY_ORDER 1UL
+#define ENTRY_MASK ((1UL << ((ENTRY_ORDER) + 1UL)) - 1UL)
+#define ENTRY_HEAD ENTRY_MASK
+#define ENTRY_UNUSED 0UL
+#define BITS_PER_ENTRY (1U << ENTRY_ORDER)
+#define BITS_DIV_ENTRIES(x) ((x) >> ENTRY_ORDER)
+#define ENTRIES_TO_BITS(x) ((x) << ENTRY_ORDER)
+#define BITS_DIV_LONGS(x) ((x) / BITS_PER_LONG)
+#define ENTRIES_DIV_LONGS(x) (BITS_DIV_LONGS(ENTRIES_TO_BITS(x)))
+
+#define ENTRIES_PER_LONG BITS_DIV_ENTRIES(BITS_PE

[PATCH 2/6] genalloc: selftest

2018-01-30 Thread Igor Stoppa
Introduce a set of macros for writing concise test cases for genalloc.

The test cases are meant to provide regression testing, when working on
new functionality for genalloc.

Primarily they are meant to confirm that the various allocation strategy
will continue to work as expected.

The execution of the self testing is controlled through a Kconfig option.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc-selftest.h |  30 +++
 init/main.c   |   2 +
 lib/Kconfig   |  15 ++
 lib/Makefile  |   1 +
 lib/genalloc-selftest.c   | 402 ++
 5 files changed, 450 insertions(+)
 create mode 100644 include/linux/genalloc-selftest.h
 create mode 100644 lib/genalloc-selftest.c

diff --git a/include/linux/genalloc-selftest.h 
b/include/linux/genalloc-selftest.h
new file mode 100644
index 000..7af1901
--- /dev/null
+++ b/include/linux/genalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * genalloc-selftest.h
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __GENALLOC_SELFTEST_H__
+#define __GENALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_GENERIC_ALLOCATOR_SELFTEST
+
+#include 
+
+void genalloc_selftest(void);
+
+#else
+
+static inline void genalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index a8100b9..fb844aa 100644
--- a/init/main.c
+++ b/init/main.c
@@ -89,6 +89,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -660,6 +661,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   genalloc_selftest();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fb..430026d0 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -287,6 +287,21 @@ config DECOMPRESS_LZ4
 config GENERIC_ALLOCATOR
bool
 
+config GENERIC_ALLOCATOR_SELFTEST
+   bool "genalloc tester"
+   default n
+   select GENERIC_ALLOCATOR
+   help
+ Enable automated testing of the generic allocator.
+ The testing is primarily for the tracking of allocated space.
+
+config GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+   bool "make the genalloc tester more verbose"
+   default n
+   select GENERIC_ALLOCATOR_SELFTEST
+   help
+ More information will be displayed during the self-testing.
+
 #
 # reed solomon support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index d11c48e..ba06e83 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_LIBCRC32C) += libcrc32c.o
 obj-$(CONFIG_CRC8) += crc8.o
 obj-$(CONFIG_XXHASH)   += xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_GENERIC_ALLOCATOR_SELFTEST) += genalloc-selftest.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/genalloc-selftest.c b/lib/genalloc-selftest.c
new file mode 100644
index 000..007a0cf
--- /dev/null
+++ b/lib/genalloc-selftest.c
@@ -0,0 +1,402 @@
+/*
+ * genalloc-selftest.c
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+
+
+/* Keep the bitmap small, while including case of cross-ulong mapping.
+ * For simplicity, the test cases use only 1 chunk of memory.
+ */
+#define BITMAP_SIZE_C 16
+#define ALLOC_ORDER 0
+
+#define ULONG_SIZE (sizeof(unsigned long))
+#define BITMAP_SIZE_UL (BITMAP_SIZE_C / ULONG_SIZE)
+#define MIN_ALLOC_SIZE (1 << ALLOC_ORDER)
+#define ENTRIES (BITMAP_SIZE_C * 8)
+#define CHUNK_SIZE  (MIN_ALLOC_SIZE * ENTRIES)
+
+#ifndef CONFIG_GENERIC_ALLOCATOR_SELFTEST_VERBOSE
+
+static inline void print_first_chunk_bitmap(struct gen_pool *pool) {}
+
+#else
+
+static void print_first_chunk_bitmap(struct gen_pool *pool)
+{
+   struct gen_pool_chunk *chunk;
+   char bitmap[BITMAP_SIZE_C * 2 + 1];
+   unsigned long i;
+   char *bm = bitmap;
+   char *entry;
+
+   if (unlikely(pool == NULL || pool->chunks.next == NULL))
+   return;
+
+   chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+next_chunk);
+   entry = (void *)chunk->entries;
+   for (i = 1; i <= BITMAP_SIZE_C; i++)
+   bm += snprintf

[PATCH 3/6] struct page: add field for vm_struct

2018-01-30 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtian
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area. This will avoid more expensive searches.

As example, the function find_vm_area is reimplemented, to take advantage
of the newly introduced field.

Signed-off-by: Igor Stoppa 
---
 include/linux/mm_types.h |  1 +
 mm/vmalloc.c | 18 +-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cfd0ac4..2abd540 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6739420..44c5dfc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1466,13 +1466,16 @@ struct vm_struct *get_vm_area_caller(unsigned long 
size, unsigned long flags,
  */
 struct vm_struct *find_vm_area(const void *addr)
 {
-   struct vmap_area *va;
+   struct page *page;
 
-   va = find_vmap_area((unsigned long)addr);
-   if (va && va->flags & VM_VM_AREA)
-   return va->vm;
+   if (unlikely(!is_vmalloc_addr(addr)))
+   return NULL;
 
-   return NULL;
+   page = vmalloc_to_page(addr);
+   if (unlikely(!page))
+   return NULL;
+
+   return page->area;
 }
 
 /**
@@ -1536,6 +1539,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1744,6 +1748,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
const void *caller)
 {
struct vm_struct *area;
+   unsigned int page_counter;
void *addr;
unsigned long real_size = size;
 
@@ -1769,6 +1774,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
 
kmemleak_vmalloc(area, size, gfp_mask);
 
+   for (page_counter = 0; page_counter < area->nr_pages; page_counter++)
+   area->pages[page_counter]->area = area;
+
return addr;
 
 fail:
-- 
2.9.3



[PATCH 4/6] Protectable Memory

2018-01-30 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   3 +
 include/linux/pmalloc.h  | 216 
 include/linux/vmalloc.h  |   1 +
 lib/genalloc.c   |  27 +++
 mm/Makefile  |   1 +
 mm/pmalloc.c | 513 +++
 mm/usercopy.c|  25 ++-
 7 files changed, 782 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 0377681..a486a26 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -121,6 +121,9 @@ extern unsigned long gen_pool_alloc_algo(struct gen_pool *, 
size_t,
 extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
dma_addr_t *dma);
 extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+
+extern void gen_pool_flush_chunk(struct gen_pool *pool,
+struct gen_pool_chunk *chunk);
 extern void gen_pool_for_each_chunk(struct gen_pool *,
void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
 extern size_t gen_pool_avail(struct gen_pool *);
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index 000..ad7d557
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,216 @@
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifndef _PMALLOC_H
+#define _PMALLOC_H
+
+
+#include 
+#include 
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at run-time.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation,
+ * 2) writes to the allocated memory,
+ * 3) write protection of the pool,
+ * 4) freeing of the allocated memory, and
+ * 5) destruction of the pool.
+ *
+ * For a non-threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provide support for locking, point 2)
+ * would still depend on the user taking the lock.
+ */
+
+
+/**
+ * pmalloc_create_pool - create a new protectable memory pool -
+ * @name: the name of the pool, must be unique
+ * @min_alloc_order: log2 of the minimum allocation size obtainable
+ *   from the pool
+ *
+ * Creates a new (empty) memory pool for allocation of protectable
+ * memory. Memory will be allocated upon request (through pmalloc).
+ *
+ * Returns a pointer to the new pool upon success, otherwise a NULL.
+ */
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+int is_pmalloc_object(const void *ptr, const unsigned long n);
+
+/**
+ * pmalloc_prealloc - tries to allocate a memory chunk of the requested size
+ * @pool: handler to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ *
+ * Prepares a chunk of the requested size.
+ * This is intended to both minimize latency in later memory requests and
+ * avoid sleping during allocation.
+ * Memory

[PATCH 5/6] Documentation for Pmalloc

2018-01-30 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/pmalloc.txt | 104 +
 1 file changed, 104 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.txt

diff --git a/Documentation/core-api/pmalloc.txt 
b/Documentation/core-api/pmalloc.txt
new file mode 100644
index 000..934d356
--- /dev/null
+++ b/Documentation/core-api/pmalloc.txt
@@ -0,0 +1,104 @@
+
+Protectable memory allocator
+
+
+Introduction
+
+
+When trying to perform an attack toward a system, the attacker typically
+wants to alter the execution flow, in a way that allows actions which
+would otherwise be forbidden.
+
+In recent years there has been lots of effort in preventing the execution
+of arbitrary code, so the attacker is progressively pushed to look for
+alternatives.
+
+If code changes are either detected or even prevented, what is left is to
+alter kernel data.
+
+As countermeasure, constant data is collected in a section which is then
+marked as readonly.
+To expand on this, also statically allocated variables which are tagged
+as __ro_after_init will receive a similar treatment.
+The difference from constant data is that such variables can be still
+altered freely during the kernel init phase.
+
+However, such solution does not address those variables which could be
+treated essentially as read-only, but whose size is not known at compile
+time or cannot be fully initialized during the init phase.
+
+
+Design
+--
+
+pmalloc builds on top of genalloc, using the same concept of memory pools
+A pool is a handle to a group of chunks of memory of various sizes.
+When created, a pool is empty. It will be populated by allocating chunks
+of memory, either when the first memory allocation request is received, or
+when a pre-allocation is performed.
+
+Either way, one or more memory pages will be obtained from vmalloc and
+registered in the pool as chunk. Subsequent requests will be satisfied by
+either using any available free space from the current chunks, or by
+allocating more vmalloc pages, should the current free space not suffice.
+
+This is the key point of pmalloc: it groups data that must be protected
+into a set of pages. The protection is performed through the mmu, which
+is a prerequisite and has a minimum granularity of one page.
+
+If the relevant variables were not grouped, there would be a problem of
+allowing writes to other variables that might happen to share the same
+page, but require further alterations over time.
+
+A pool is a group of pages that are write protected at the same time.
+Ideally, they have some high level correlation (ex: they belong to the
+same module), which justifies write protecting them all together.
+
+To keep it to a minimum, locking is left to the user of the API, in
+those cases where it's not strictly needed.
+Ideally, no further locking is required, since each module can have own
+pool (or pools), which should, for example, avoid the need for cross
+module or cross thread synchronization about write protecting a pool.
+
+The overhead of creating an additional pool is minimal: a handful of bytes
+from kmalloc space for the metadata and then what is left unused from the
+page(s) registered as chunks.
+
+Compared to plain use of vmalloc, genalloc has the advantage of tightly
+packing the allocations, reducing the number of pages used and therefore
+the pressure on the TLB. The slight overhead in execution time of the
+allocation should be mostly irrelevant, because pmalloc memory is not
+meant to be allocated/freed in tight loops. Rather it ought to be taken
+in use, initialized and write protected. Possibly destroyed.
+
+Considering that not much data is supposed to be dynamically allocated
+and then marked as read-only, it shouldn't be an issue that the address
+range for pmalloc is limited, on 32-bit systems.
+
+Regarding SMP systems, the allocations are expected to happen mostly
+during an initial transient, after which there should be no more need to
+perform cross-processor synchronizations of page tables.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+1. create a pool
+2. [optional] pre-allocate some memory in the pool
+3. issue one or more allocation requests to the pool
+4. initialize the memory obtained
+   - iterate over points 3 & 4 as needed -
+5. write protect the pool
+6. use in read-only mode the handlers obtained through the allocations
+7. [optional] destroy the pool
+
+
+In a scenario where, for example due to some error, part or all of the
+allocations performed at point 3 must be reverted, it is possible to free
+them, as long as point 5 has not been executed, and the pool is still
+modifiable. Such freed memory can be re-used.
+Performing a free operation on a write-protected pool will, instead,
+simply release the corresponding memory from the accountin

[PATCH 6/6] Pmalloc: self-test

2018-01-30 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

Signed-off-by: Igor Stoppa 
---
 lib/genalloc.c|  2 +-
 mm/Kconfig|  7 ++
 mm/Makefile   |  1 +
 mm/pmalloc-selftest.c | 65 +++
 mm/pmalloc-selftest.h | 30 
 mm/pmalloc.c  |  9 ---
 6 files changed, 110 insertions(+), 4 deletions(-)
 create mode 100644 mm/pmalloc-selftest.c
 create mode 100644 mm/pmalloc-selftest.h

diff --git a/lib/genalloc.c b/lib/genalloc.c
index 62f69b3..7ba2ec9 100644
--- a/lib/genalloc.c
+++ b/lib/genalloc.c
@@ -542,7 +542,7 @@ void gen_pool_flush_chunk(struct gen_pool *pool,
memset(chunk->entries, 0,
   DIV_ROUND_UP(size >> pool->min_alloc_order * BITS_PER_ENTRY,
BITS_PER_BYTE));
-   atomic_set(&chunk->avail, size);
+   atomic_long_set(&chunk->avail, size);
 }
 
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 03ff770..f0c960e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -765,3 +765,10 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config PROTECTABLE_MEMORY_SELFTEST
+   bool "Run self test for pmalloc memory allocator"
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index a6a47e1..1e76a9b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pmalloc.o
+obj-$(CONFIG_PROTECTABLE_MEMORY_SELFTEST) += pmalloc-selftest.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/pmalloc-selftest.c b/mm/pmalloc-selftest.c
new file mode 100644
index 000..1c025f3
--- /dev/null
+++ b/mm/pmalloc-selftest.c
@@ -0,0 +1,65 @@
+/*
+ * pmalloc-selftest.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include 
+#include 
+
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+#define validate_alloc(expected, variable, size)   \
+   pr_notice("must be " expected ": %s",   \
+ is_pmalloc_object(variable, size) > 0 ? "ok" : "no")
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc("ok", variable, size)
+
+#define is_alloc_no(variable, size)\
+   validate_alloc("no", variable, size)
+
+void pmalloc_selftest(void)
+{
+   struct gen_pool *pool_unprot;
+   struct gen_pool *pool_prot;
+   void *var_prot, *var_unprot, *var_vmall;
+
+   pr_notice("pmalloc self-test");
+   pool_unprot = pmalloc_create_pool("unprotected", 0);
+   pool_prot = pmalloc_create_pool("protected", 0);
+   BUG_ON(!(pool_unprot && pool_prot));
+
+   var_unprot = pmalloc(pool_unprot,  SIZE_1 - 1, GFP_KERNEL);
+   var_prot = pmalloc(pool_prot,  SIZE_1, GFP_KERNEL);
+   var_vmall = vmalloc(SIZE_2);
+   is_alloc_ok(var_unprot, 10);
+   is_alloc_ok(var_unprot, SIZE_1);
+   is_alloc_ok(var_unprot, PAGE_SIZE);
+   is_alloc_no(var_unprot, SIZE_1 + 1);
+   is_alloc_no(var_vmall, 10);
+
+
+   pfree(pool_unprot, var_unprot);
+   vfree(var_vmall);
+
+   pmalloc_protect_pool(pool_prot);
+
+   /* This will intentionally trigger a WARN because the pool being
+* destroyed is not protected, which is unusual and should happen
+* on error paths only, where probably other warnings are already
+* displayed.
+*/
+   pmalloc_destroy_pool(pool_unprot);
+
+   /* This must not cause WARNings */
+   pmalloc_destroy_pool(pool_prot);
+}
diff --git a/mm/pmalloc-selftest.h b/mm/pmalloc-selftest.h
new file mode 100644
index 000..3673d23
--- /dev/null
+++ b/mm/pmalloc-selftest.h
@@ -0,0 +1,30 @@
+/*
+ * pmalloc-selftest.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+#ifndef __PMALLOC_SELFTEST_H__
+#define __PMALLOC_SELFTEST_H__
+
+
+#ifdef CONFIG_PROTECTABLE_MEMORY_SELFTEST
+
+#include 
+
+void pmalloc_selftest(void);
+
+#else
+
+static inline void pmalloc_selftest(void){};
+
+#endif
+
+#endif
diff --git a/mm/pmalloc.c b/mm/pmalloc.c
index a64ac49..73387d7 100644
--- a/mm/pmallo

Re: [PATCH 3/6] struct page: add field for vm_struct

2018-02-01 Thread Igor Stoppa


On 01/02/18 02:00, Christopher Lameter wrote:
> On Tue, 30 Jan 2018, Igor Stoppa wrote:
> 
>> @@ -1769,6 +1774,9 @@ void *__vmalloc_node_range(unsigned long size, 
>> unsigned long align,
>>
>>  kmemleak_vmalloc(area, size, gfp_mask);
>>
>> +for (page_counter = 0; page_counter < area->nr_pages; page_counter++)
>> +area->pages[page_counter]->area = area;
>> +
>>  return addr;
> 
> Well this introduces significant overhead for large sized allocation. Does
> this not matter because the areas are small?

Relatively significant?
I do not object to your comment, but in practice i see that:

- vmalloc is used relatively little
- allocations do not seem to be huge
- there seem to be way larger overheads in the handling of virtual pages
  (see my proposal for the LFS/m summit, about collapsing struct
   vm_struct and struct vmap_area)


> Would it not be better to use compound page allocations here?
> page_head(whatever) gets you the head page where you can store all sorts
> of information about the chunk of memory.

Can you please point me to this function/macro? I don't seem to be able
to find it, at least not in 4.15

During hardened user copy permission check, I need to confirm if the
memory range that would be exposed to userspace is a legitimate
sub-range of a pmalloc allocation.


So, I start with the pair (address, size) and I must end up to something
I can compare it against.
The idea here is to pass through struct_page and then the related
vm_struct/vmap_area, which already has the information about the
specific chunk of virtual memory.

I cannot comment on your proposal because I do not know where to find
the reference you made, or maybe I do not understand what you mean :-(

--
igor


Re: [kernel-hardening] [PATCH 4/6] Protectable Memory

2018-01-26 Thread Igor Stoppa
On 26/01/18 07:35, Matthew Wilcox wrote:
> On Wed, Jan 24, 2018 at 08:10:53PM +0100, Jann Horn wrote:
>> I'm not entirely convinced by the approach of marking small parts of
>> kernel memory as readonly for hardening.
> 
> It depends how significant the data stored in there are.  For example,
> storing function pointers in read-only memory provides significant
> hardening.
> 
>> You're allocating with vmalloc(), which, as far as I know, establishes
>> a second mapping in the vmalloc area for pages that are already mapped
>> as RW through the physmap. AFAICS, later, when you're trying to make
>> pages readonly, you're only changing the protections on the second
>> mapping in the vmalloc area, therefore leaving the memory writable
>> through the physmap. Is that correct? If so, please either document
>> the reasoning why this is okay or change it.
> 
> Yes, this is still vulnerable to attacks through the physmap.  That's also
> true for marking structs as const.  We should probably fix that at some
> point, but at least they're not vulnerable to heap overruns by small
> amounts ... you have to be able to overrun some other array by terabytes.

Actually, I think there is something to say in favor of using a vmalloc
based approach, precisely because of the physmap :-P

If I understood correctly, the physmap is primarily meant to speed up
access to physical memory through the TLB. In particular, for kmalloc
based allocations.

Which means that, to perform a physmap-based attack to a kmalloced
allocation, one needs to know:

- the address of the target variable in the kmalloc range
- the randomized offset of the kernel
- the location of the physmap

But, for a vmalloc based allocation, there is one extra hoop: since the
mapping is really per page, now the attacker has actually to walk the
page table, to figure out where to poke in the physmap.


One more thought about physmap: does it map also code?
Because, if it does, and one wants to use it for an attack, isn't it
easier to look for some security test and replace a bne with be or
equivalent?


> It's worth having a discussion about whether we want the pmalloc API
> or whether we want a slab-based API.

pmalloc is meant to be useful where the attack surface is made up of
lots of small allocations - my first use case was the SE Linux policy
DB, where there is a variety of elements being allocated, in large
amount. To the point where having ready made caches would be wasteful.


Then there is the issue I already mentioned about arm/arm64 which would
require to break down large mappings, which seems to be against current
policy, as described in my previous mail:

http://www.openwall.com/lists/kernel-hardening/2018/01/24/11


I do not know exactly what you have in mind wrt slab, but my impression
is that it will most likely gravitate toward the pmalloc implementation.
It will need:

- "pools" or anyway some means to lock only a certain group of pages,
related to a specific kernel user

- (mostly) lockless allocation

- a way to manage granularity (or order of allocation)

Most of this is already provided by genalloc, which is what I ended up
almost re-implementing, before being pointed to it :-)

I only had to add the tracking of end of allocations, which is what the
patch 1/6 does - as side note, is anybody maintaining it?
I could not find an entry in MAINTAINERS

As I mentioned above, using vmalloc adds even an extra layer of protection.

The major downside is the increased TLB use, however this is not so
relevant for the volumes of data that I had to deal with so far:
only few 4K pages.

But you might have in mind something else.
I'd be interested to know what and what would be an obstacle in using
pmalloc. Maybe it can be solved.

--
igor


Re: [kernel-hardening] [PATCH 4/6] Protectable Memory

2018-01-26 Thread Igor Stoppa
On 25/01/18 17:38, Jerome Glisse wrote:
> On Thu, Jan 25, 2018 at 10:14:28AM -0500, Boris Lukashev wrote:
>> On Thu, Jan 25, 2018 at 6:59 AM, Igor Stoppa  wrote:
> 
> [...]
> 
>> DMA/physmap access coupled with a knowledge of which virtual mappings
>> are in the physical space should be enough for an attacker to bypass
>> the gating mechanism this work imposes. Not trivial, but not
>> impossible. Since there's no way to prevent that sort of access in
>> current hardware (especially something like a NIC or GPU working
>> independently of the CPU altogether)

[...]

> I am not saying that this can not happen but that we are trying our best
> to avoid it.

How about an opt-in verification, similar to what proposed by Boris
Lukashev?

When reading back the data, one could access the pointer directly and
bypass the verification, or could use a function that explicitly checks
the integrity of the data.

Starting from an unprotected kmalloc allocation, even just turning the
data into R/O is an improvement, but if one can afford the overhead of
performing the verification, why not?

It would still be better if the service was provided by the library,
instead than implemented by individual users, I think.

--
igor


Re: [PATCH 4/6] Protectable Memory

2018-01-26 Thread Igor Stoppa
On 24/01/18 19:56, Igor Stoppa wrote:

[...]

> +bool pmalloc_prealloc(struct gen_pool *pool, size_t size)
> +{

[...]

> +abort:
> + vfree(chunk);

this should be vfree_atomic()

[...]

> +void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
> +{

[...]

> +free:
> + vfree(chunk);

and this one too

I will fix them in the next iteration.
I am waiting to see if any more comments arrive.
Otherwise, I'll send it out probably next Tuesday.

--
igor


Re: [PATCH 4/7] Protectable Memory

2018-03-12 Thread Igor Stoppa


On 12/03/18 21:13, Matthew Wilcox wrote:
> On Wed, Feb 28, 2018 at 10:06:17PM +0200, Igor Stoppa wrote:
>> struct gen_pool *pmalloc_create_pool(const char *name,
>>   int min_alloc_order);
>> int is_pmalloc_object(const void *ptr, const unsigned long n);
>> bool pmalloc_prealloc(struct gen_pool *pool, size_t size);
>> void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp);
>> static inline void *pzalloc(struct gen_pool *pool, size_t size, gfp_t gfp)
>> static inline void *pmalloc_array(struct gen_pool *pool, size_t n,
>>size_t size, gfp_t flags)
>> static inline void *pcalloc(struct gen_pool *pool, size_t n,
>>  size_t size, gfp_t flags)
>> static inline char *pstrdup(struct gen_pool *pool, const char *s, gfp_t gfp)
>> int pmalloc_protect_pool(struct gen_pool *pool);
>> static inline void pfree(struct gen_pool *pool, const void *addr)
>> int pmalloc_destroy_pool(struct gen_pool *pool);
> 
> Do you have users for all these functions?  I'm particularly sceptical of
> pfree().

The typical case is when rolling back allocations, on an error path.
For example, with SELinux, the userspace provides the policy, which gets
processed and converted into a policyDB, where every policy maps to
several structures allocated dynamically.

The allocation is not transactional. In case a policy turns out to be
bad/broken, while being interpreted, those structures that were
initially allocated for that policy, must be freed.

Since pmalloc is meant to be a drop in replacement for k/vmalloc, it
needs to provide also pfree.

>  To my mind, a user wants to:
> 
> pmalloc_create();
> pmalloc(); * N
> pmalloc_protect();
> ...
> pmalloc_destroy();

This is the simplest case, but also the error path must be supported.

> I don't mind the pstrdup, pcalloc, pmalloc_array, pzalloc variations, but

All those functions turned out to be necessary when converting SELinux
to pmalloc.
Yes, I haven't published this code yet, but I was hoping to first be
done with pmalloc and then move on to SELinux, which I suspect will be
harder to chew :-/

> I don't know why you need is_pmalloc_object().

Because of hardened usercopy [1]:


On 23/05/17 00:38, Kees Cook wrote:

[...]

> I'd like hardened usercopy to grow knowledge of these
> allocations so we can bounds-check objects. Right now, mm/usercopy.c
> just looks at PageSlab(page) to decide if it should do slab checks. I
> think adding a check for this type of object would be very important
> there.



[1] http://www.openwall.com/lists/kernel-hardening/2017/05/23/17


--
igor


Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-06 Thread Igor Stoppa
On 06/03/2018 16:10, Matthew Wilcox wrote:
> On Wed, Feb 28, 2018 at 10:06:14PM +0200, Igor Stoppa wrote:
>> + * Encoding of the bitmap tracking the allocations
>> + * ---
>> + *
>> + * The bitmap is composed of units of allocations.
>> + *
>> + * Each unit of allocation is represented using 2 consecutive bits.
>> + *
>> + * This makes it possible to encode, for each unit of allocation,
>> + * information about:
>> + *  - allocation status (busy/free)
>> + *  - beginning of a sequennce of allocation units (first / successive)
>> + *
>> + *
>> + * Dictionary of allocation units (msb to the left, lsb to the right):
>> + *
>> + * 11: first allocation unit in the allocation
>> + * 10: any subsequent allocation unit (if any) in the allocation
>> + * 00: available allocation unit
>> + * 01: invalid
>> + *
>> + * Example, using the same notation as above - MSb...LSb:
>> + *
>> + *  ...100010101011   <-- Read in this direction.
>> + * \__|\__|\|\|\__|
>> + *|   | | |   \___ 4 used allocation units
>> + *|   | | \___ 3 empty allocation units
>> + *|   | \_ 1 used allocation unit
>> + *|   \___ 2 used allocation units
>> + *\___ 2 empty allocation units
>> + *
>> + * The encoding allows for lockless operations, such as:
>> + * - search for a sufficiently large range of allocation units
>> + * - reservation of a selected range of allocation units
>> + * - release of a specific allocation
>> + *
>> + * The alignment at which to perform the research for sequence of empty
>> + * allocation units (marked as zeros in the bitmap) is 2^1.
>> + *
>> + * This means that an allocation can start only at even places
>> + * (bit 0, bit 2, etc.) in the bitmap.
>> + *
>> + * Therefore, the number of zeroes to look for must be twice the number
>> + * of desired allocation units.
>> + *
>> + * When it's time to free the memory associated to an allocation request,
>> + * it's a matter of checking if the corresponding allocation unit is
>> + * really the beginning of an allocation (both bits are set to 1).
>> + *
>> + * Looking for the ending can also be performed locklessly.
>> + * It's sufficient to identify the first mapped allocation unit
>> + * that is represented either as free (00) or busy (11).
>> + * Even if the allocation status should change in the meanwhile, it
>> + * doesn't matter, since it can only transition between free (00) and
>> + * first-allocated (11).
> 
> This seems unnecessarily complicated.

TBH it seemed to me a natural extension of the existing encoding :-)

>  Why not handle it like this:
> 
>  - Double the bitmap in size (as you have done) but
>  - The first half of the bits are unchanged from the existing implementation
>  - The second half of the bits are used for determining the length

Wouldn't that mean a less tight loop and less localized data?
The implementation from this patch does not have to jump elsewhere, when
(un)marking the allocation units and the start.

> On allocation, you look for a sufficiently-large string of 0 bits in
> the first-half.  When you find it, you set all of them to 1, and set one
> bit in the second-half to indicate where the tail of the allocation is
> (you might actually want to use an rbtree or something to handle this ...
> using all these bits seems pretty inefficient).

1 bit maps to 1 unit of allocation, which is very seldom 1 byte.
For pmalloc use, I expect that the average allocation is likely to be
2-4 units, where 1 unit equals either a 32 or 64 bits word.
So it's probably likely that for every couple of allocation units, one
is marked as start-of-allocation.

In other cases where genalloc is used, like the tracking of uncached
pages, 1 unit of allocation equals to 1 page.

I would expect the rbtree to end up generating a far larger footprint.

For the same reasons, since the bitmap is implemented using unsigned
longs, chances are high that one allocation will fit in one bitmap
"word", which means that if the "beginning" bit and the "occupied" bit
are adjacent, one write is sufficient.

In the case you describe, it would be almost always at least 2.

I do not have factual evidence to back my reasoning, but it seems more
likely to be the case, from my analysis of data types that could belong
to pools (both existing users of genalloc and my experiments with
SELinux data structures and pmalloc).

Even in the XFS case, if I understood correctly, it was about protecting
1 or 2 pages at a time, which seems to fit what I have empirically observed.

What makes you think otherwise?

--
igor


Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-06 Thread Igor Stoppa


On 05/03/2018 21:00, J Freyensee wrote:
> .
> .
> 
> 
> On 2/28/18 12:06 PM, Igor Stoppa wrote:
>> +
>> +/**
>> + * gen_pool_dma_alloc() - allocate special memory from the pool for DMA 
>> usage
>> + * @pool: pool to allocate from
>> + * @size: number of bytes to allocate from the pool
>> + * @dma: dma-view physical address return value.  Use NULL if unneeded.
>> + *
>> + * Allocate the requested number of bytes from the specified pool.
>> + * Uses the pool allocation function (with first-fit algorithm by default).
>> + * Can not be used in NMI handler on architectures without
>> + * NMI-safe cmpxchg implementation.
>> + *
>> + * Return:
>> + * * address of the memory allocated- success
>> + * * NULL   - error
>> + */
>> +void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size, dma_addr_t 
>> *dma);
>> +
> 
> OK, so gen_pool_dma_alloc() is defined here, which believe is the API 
> line being drawn for this series.
> 
> so,
> .
> .
> .
>>
>>   
>>   /**
>> - * gen_pool_dma_alloc - allocate special memory from the pool for DMA usage
>> + * gen_pool_dma_alloc() - allocate special memory from the pool for DMA 
>> usage
>>* @pool: pool to allocate from
>>* @size: number of bytes to allocate from the pool
>>* @dma: dma-view physical address return value.  Use NULL if unneeded.
>> @@ -342,14 +566,15 @@ EXPORT_SYMBOL(gen_pool_alloc_algo);
>>* Uses the pool allocation function (with first-fit algorithm by default).
>>* Can not be used in NMI handler on architectures without
>>* NMI-safe cmpxchg implementation.
>> + *
>> + * Return:
>> + * * address of the memory allocated- success
>> + * * NULL   - error
>>*/
>>   void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size, dma_addr_t 
>> *dma)
>>   {
>>  unsigned long vaddr;
>>   
>> -if (!pool)
>> -return NULL;
>> -
> why is this being removed?  I don't believe this code was getting 
> removed from your v17 series patches.

Because, as Matthew Wilcox pointed out [1] (well, that's how I
understood it) de-referencing a NULL pointer will cause the kernel to
complain loudly.

Where is the NULL pointer coming from?

a) from a bug in the user of the API - in that case it will be noticed,
reported and fixed, that is how also other in-kernel APIs work

b) from an attacker - it will still trigger an error from the kernel,
but it cannot really do much else, besides crashing repeatedly and
causing a DOS. However, there are so many other places that could be
under similar attack, that it doesn't seem to make a difference having a
check here only.

If the value was coming from userspace, that would be a completely
different case and some sort of sanitation would be mandatory.

> Otherwise, looks good,
> 
> Reviewed-by: Jay Freyensee 

thanks


[1] http://www.openwall.com/lists/kernel-hardening/2018/02/26/16


--
igor



Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-07 Thread Igor Stoppa


On 06/03/18 18:05, Igor Stoppa wrote:
> On 06/03/2018 16:10, Matthew Wilcox wrote:

[...]

>> This seems unnecessarily complicated.
> 
> TBH it seemed to me a natural extension of the existing encoding :-)

BTW, to provide some background, this is where it begun:

http://www.openwall.com/lists/kernel-hardening/2017/08/18/4

Probably that comment about "keeping existing behavior and managing two
bitmaps locklessly" is what made me think of growing the 1-bit-per-unit
into a 1-word-per-unit.

--
igor


Re: [PATCH 6/7] lkdtm: crash on overwriting protected pmalloc var

2018-03-07 Thread Igor Stoppa


On 06/03/18 19:20, J Freyensee wrote:

> On 2/28/18 12:06 PM, Igor Stoppa wrote:

[...]

>>   void __init lkdtm_perms_init(void);
>>   void lkdtm_WRITE_RO(void);
>>   void lkdtm_WRITE_RO_AFTER_INIT(void);
>> +void lkdtm_WRITE_RO_PMALLOC(void);
> 
> Does this need some sort of #ifdef too?

Not strictly. It's just a function declaration.
As long as it is not used, the linker will not complain.
The #ifdef placed around the use and definition is sufficient, from a
correctness perspective.

But it's a different question if there is any standard in linux about
hiding also the declaration.

I am not very fond of #ifdefs, so when I can I try to avoid them.

>> +pr_info("attempting bad pmalloc write at %p\n", i);
>> +*i = 0;
> 
> OK, now I'm on the right version of this patch series, same comment 
> applies.  I don't get the local *i assignment at the end of the 
> function, but seems harmless.


Because that's the whole point of the function: prove that pmalloc
protection works (see the message in the pr_info one line above).

The function is supposed to do:

* create a pool
* allocate memory from it
* protect it
* try to alter it (and crash)

*i = 0; performs the last step

--
igor


Re: [PATCH 4/7] Protectable Memory

2018-03-07 Thread Igor Stoppa
On 06/03/18 05:59, J Freyensee wrote:

[...]

>> +config PROTECTABLE_MEMORY
>> +bool
>> +depends on MMU
> 
> 
> Curious, would you also want to depend on "SECURITY" as well, as this is 
> being advertised as a compliment to __read_only_after_init, per the file 
> header comments, as I'm assuming ro_after_init would be disabled if the 
> SECURITY Kconfig selection is *NOT* selected?

__ro_after_init is configured like this:

#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
bool rodata_enabled __ro_after_init = true;

But even if __ro_after_init and pmalloc are conceptually similar, in
practice they have - potentially - different constraints.

1) the __ro_after_init segment belongs to linear kernel memory
2) the pmalloc pools belong to vmalloc memory

There is one extra layer of indirection in pmalloc.
I am not an expert of MMUs but I suppose there might be types where it
is possible to mark pages as RO but it's not possible to have virtual
memory.

If (and this is a big "if") such MMUs exist and are supported by linux,
then __ro_after_init would be possible, while pmalloc would not be.

So it seemed more correct to focus specifically on hte enablers required
by pmalloc to perform correctly.

Open Question:

Is it ok that the API disappears in case the enablers are missing?
Or should it fall back to something else?

Dealing with lack of ReadOnly support would be pretty simple, it would
be enough to make the write-Protection conditional.

But what to do if virtual mapping is not supported?

kmalloc might not have the ability to support large requests made toward
pmalloc and this would possibly cause runtime failures.

--
igor


Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-07 Thread Igor Stoppa


On 06/03/18 15:19, Mike Rapoport wrote:
> On Wed, Feb 28, 2018 at 10:06:14PM +0200, Igor Stoppa wrote:

[...]

> If I'm not mistaken, several kernel-doc descriptions are duplicated now.
> Can you please keep a single copy? ;-)

What's the preferred approach?
Document the functions that are API in the .h file and leave in the .c
those which are not API?

[...]

>> + * The alignment at which to perform the research for sequence of empty
> 
>^ search?

yes

>> + * get_boundary() - verifies address, then measure length.
> 
> There's some lack of consistency between the name and implementation and
> the description.
> It seems that it would be simpler to actually make it get_length() and
> return the length of the allocation or nentries if the latter is smaller.
> Then in gen_pool_free() there will be no need to recalculate nentries
> again.

There is an error in the documentation. I'll explain below.

> 
>>   * @map: pointer to a bitmap
>> - * @start: a bit position in @map
>> - * @nr: number of bits to set
>> + * @start_entry: the index of the first entry in the bitmap
>> + * @nentries: number of entries to alter
> 
> Maybe: "maximal number of entries to check"?

No, it's actually the total number of entries in the chunk.

[...]

>> +return nentries - start_entry;
> 
> Shouldn't it be "nentries + start_entry"?

And in the light of the correct comment, also what I am doing should be
now more clear:

* start_entry is the index of the initial entry
* nentries is the number of entries in the chunk

If I iterate over the rest of the chunk:

(i = start_entry + 1; i < nentries; i++)

without finding either another HEAD or an empty slot, then it means I
was measuring the length of the last allocation in the chunk, which was
taking up all the space, to the end.

Simple example:

- chunk with 7 entries -> nentries is 7
- start_entry is 2, meaning that the last allocation starts from the 3rd
element, iow it occupies indexes from 2 to 6, for a total of 5 entries
- so the length is (nentries - start_entry) = (7 - 2) = 5


But yeah, the kerneldoc was wrong.

[...]

>> - * gen_pool_alloc_algo - allocate special memory from the pool
>> + * gen_pool_alloc_algo() - allocate special memory from the pool
> 
> + using specified algorithm

ok

> 
>>   * @pool: pool to allocate from
>>   * @size: number of bytes to allocate from the pool
>>   * @algo: algorithm passed from caller
>> @@ -285,14 +502,18 @@ EXPORT_SYMBOL(gen_pool_alloc);
>>   * Uses the pool allocation function (with first-fit algorithm by default).
> 
> "uses the provided @algo function to find room for the allocation"

ok

--
igor


Re: [PATCH 1/7] genalloc: track beginning of allocations

2018-03-07 Thread Igor Stoppa


On 07/03/18 16:48, Igor Stoppa wrote:
> 
> 
> On 06/03/18 15:19, Mike Rapoport wrote:
>> On Wed, Feb 28, 2018 at 10:06:14PM +0200, Igor Stoppa wrote:

[...]

>>> + * get_boundary() - verifies address, then measure length.
>>
>> There's some lack of consistency between the name and implementation and
>> the description.
>> It seems that it would be simpler to actually make it get_length() and
>> return the length of the allocation or nentries if the latter is smaller.
>> Then in gen_pool_free() there will be no need to recalculate nentries
>> again.
> 
> There is an error in the documentation. I'll explain below.

Argh, I do not know why I came out with that.

Yes, your comment is correct. I've modified the function accordingly and
it is simpler.

I will post it in the next revision.

--
igor


Re: [PATCH 6/8] Pmalloc selftest

2018-03-24 Thread Igor Stoppa


On 14/03/18 14:25, Matthew Wilcox wrote:
> On Tue, Mar 13, 2018 at 11:45:52PM +0200, Igor Stoppa wrote:
>> Add basic self-test functionality for pmalloc.
> 
> Here're some additional tests for your test-suite:
> 
>   for (i = 1; i; i *= 2)
>   pzalloc(pool, i - 1, GFP_KERNEL);
> 

Ok, I have almost finished the rewrite.
I still have to address this comment.

When I run the test, eventually the system runs out of memory, it keeps
getting allocation errors from vmalloc, until i finally overflows and
becomes 0.

Am I supposed to do something about it?
If pmalloc receives a request that the vmalloc backend cannot satisfy, I
would prefer that vmalloc itself produces the warning and pmalloc
returns NULL.

This doesn't look like a test case that one can leave always enabled in
a build, but maybe I'm missing the point.

--
igor



[RFC PATCH v20 0/6] mm: security: ro protection for dynamic data

2018-03-26 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is protected, all the
memory that is currently part of it, will become R/O.

A R/O pool can be expanded (adding more protectable memory).
It can also be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading. The pool, therefore, can be destroyed.

An example is provided, in the form of self-testing.

Changes since v19:

[http://www.openwall.com/lists/kernel-hardening/2018/03/13/68]

* dropped genalloc as allocator
* first attempt at rewriting pmalloc, as discussed with Matthew Wilcox:
  [http://www.openwall.com/lists/kernel-hardening/2018/03/14/20]
* removed free function from the API
* removed distinction between protected and unprotected pools: a pool can
  contain both protectec and unprotected areas.
* removed gpf parameter, as it didn't seem too useful (or not?)
* added option to specify alignment of allocations
* added parameter for specifying size of a refill
* removed option to pre-allocate memory for a pool (is it a bad idea?)
* changed vmap_area to allow chaining them, for tracking them in a pool
* made public the previously private find_vmap_area function

Igor Stoppa (6):
  struct page: add field for vm_struct
  vmalloc: rename llist field in vmap_area
  Protectable Memory
  Pmalloc selftest
  lkdtm: crash on overwriting protected pmalloc var
  Documentation for Pmalloc

 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 101 
 drivers/misc/lkdtm.h   |   1 +
 drivers/misc/lkdtm_core.c  |   3 +
 drivers/misc/lkdtm_perms.c |  28 
 include/linux/mm_types.h   |   1 +
 include/linux/pmalloc.h| 281 
 include/linux/test_pmalloc.h   |  24 +++
 include/linux/vmalloc.h|   5 +-
 init/main.c|   2 +
 mm/Kconfig |  16 ++
 mm/Makefile|   2 +
 mm/pmalloc.c   | 321 +
 mm/test_pmalloc.c  | 136 
 mm/usercopy.c  |  33 
 mm/vmalloc.c   |  10 +-
 16 files changed, 960 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.rst
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/test_pmalloc.c

-- 
2.14.1



[PATCH 1/6] struct page: add field for vm_struct

2018-03-26 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 
Reviewed-by: Jay Freyensee 
Reviewed-by: Matthew Wilcox 
---
 include/linux/mm_types.h | 1 +
 mm/vmalloc.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..c3a4825e10c0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..61a1ca22b0f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
area->nr_pages = i;
goto fail;
}
+   page->area = area;
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();
-- 
2.14.1



[PATCH 2/6] vmalloc: rename llist field in vmap_area

2018-03-26 Thread Igor Stoppa
The vmap_area structure has a field of type struct llist_node, named
purge_list and is used when performing lazy purge of the area.

Such field is left unused during the actual utilization of the
structure.

This patch renames the field to a more generic "area_list", to allow for
utilization outside of the purging phase.

Since the purging happens after the vmap_area is dismissed, its use is
mutually exclusive with any use performed while the area is allocated.

Signed-off-by: Igor Stoppa 
---
 include/linux/vmalloc.h | 2 +-
 mm/vmalloc.c| 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1e5d8c392f15..2d07dfef3cfd 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -47,7 +47,7 @@ struct vmap_area {
unsigned long flags;
struct rb_node rb_node; /* address sorted rbtree */
struct list_head list;  /* address sorted list */
-   struct llist_node purge_list;/* "lazy purge" list */
+   struct llist_node area_list;/* generic list of areas */
struct vm_struct *vm;
struct rcu_head rcu_head;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61a1ca22b0f6..1bb2233bb262 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -682,7 +682,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
lockdep_assert_held(&vmap_purge_lock);
 
valist = llist_del_all(&vmap_purge_list);
-   llist_for_each_entry(va, valist, purge_list) {
+   llist_for_each_entry(va, valist, area_list) {
if (va->va_start < start)
start = va->va_start;
if (va->va_end > end)
@@ -696,7 +696,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
flush_tlb_kernel_range(start, end);
 
spin_lock(&vmap_area_lock);
-   llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+   llist_for_each_entry_safe(va, n_va, valist, area_list) {
int nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 
__free_vmap_area(va);
@@ -743,7 +743,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
&vmap_lazy_nr);
 
/* After this point, we may free va at any time */
-   llist_add(&va->purge_list, &vmap_purge_list);
+   llist_add(&va->area_list, &vmap_purge_list);
 
if (unlikely(nr_lazy > lazy_max_pages()))
try_purge_vmap_area_lazy();
-- 
2.14.1



[PATCH 3/6] Protectable Memory

2018-03-26 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

A pool is organized in areas of virtually contiguous memory.
Whenever the protection functionality is invoked on a pool, all the
areas it contains are marked as read-only.

The process of growing and protecting the pool can be iterated at will.

The pool can only be destroyed (it is up to its user to avoid any further
references to the memory from the pool, after the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Since pmalloc memory is obtained from vmalloc, an attacker that has
gained access to the physical mapping, still has to identify where the
target of the attack is actually located.

At the same time, being also based on genalloc, pmalloc does not
generate as much trashing of the TLB as it would be caused by only using
directly vmalloc.

Signed-off-by: Igor Stoppa 
---
 include/linux/pmalloc.h | 281 ++
 include/linux/vmalloc.h |   3 +
 mm/Kconfig  |   6 +
 mm/Makefile |   1 +
 mm/pmalloc.c| 321 
 mm/usercopy.c   |  33 +
 mm/vmalloc.c|   2 +-
 7 files changed, 646 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index ..1d71fb73bb5b
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,281 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017-18 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#ifndef _LINUX_PMALLOC_H
+#define _LINUX_PMALLOC_H
+
+
+#include 
+
+/*
+ * Library for dynamic allocation of pools of protectable memory.
+ * A pool is a single linked list of vmap_area structures.
+ * Whenever a pool is protected, all the areas it contain at that point
+ * are write protected.
+ * More areas can be added and protected, in the same way.
+ * Memory in a pool cannot be individually unprotected, but the pool can
+ * be destroyed.
+ * Upon destruction of a certain pool, all the related memory is released,
+ * including its metadata.
+ *
+ * Pmalloc memory is intended to complement __read_only_after_init.
+ * It can be used, for example, where there is a write-once variable, for
+ * which it is not possible to know the initialization value before init
+ * is completed (which is what __read_only_after_init requires).
+ *
+ * It can be useful also where the amount of data to protect is not known
+ * at compile time and the memory can only be allocated dynamically.
+ *
+ * Finally, it can be useful also when it is desirable to control
+ * dynamically (for example throguh the command line) if something ought
+ * to be protected or not, without having to rebuild the kernel (like in
+ * the build used for a linux distro).
+ */
+
+
+#define PMALLOC_REFILL_DEFAULT (0)
+#define PMALLOC_ALIGN_DEFAULT (-1)
+
+struct pmalloc_pool *pmalloc_create_custom_pool(unsigned long int refill,
+   short int align_order);
+
+/**
+ * pmalloc_create_pool() - create a protectable memory pool
+ *
+ * Shorthand for pmalloc_create_custom_pool() with default arguments:
+ * * refill is set to PMALLOC_REFILL_DEFAULT, which is one memory page
+ * * align_order is set to PMALLOC_ALIGN_DEFAULT, which is size_of(size_t)
+ *
+ * Return:
+ * * pointer to the new pool   - success
+ * * NULL  - error
+ */
+static inline struct pmalloc_pool *pmalloc_create_pool(void)
+{
+   return pmalloc_create_custom_pool(PMALLOC_REFILL_DEFAULT,
+ PMALLOC_ALIGN_DEFAULT);
+}
+
+
+//bool pmalloc_expand_pool(struct gen_pool *pool, size_t size);
+
+
+void *pmalloc_align(struct pmalloc_pool *pool, size_t size,
+   short int align_order);
+
+
+/**
+ * pmalloc() - allocates protectable memory from a pool
+ * @pool: handle to the pool to be used for memory

[PATCH 5/6] lkdtm: crash on overwriting protected pmalloc var

2018-03-26 Thread Igor Stoppa
Verify that pmalloc read-only protection is in place: trying to
overwrite a protected variable will crash the kernel.

Signed-off-by: Igor Stoppa 
---
 drivers/misc/lkdtm.h   |  1 +
 drivers/misc/lkdtm_core.c  |  3 +++
 drivers/misc/lkdtm_perms.c | 28 
 3 files changed, 32 insertions(+)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 9e513dcfd809..dcda3ae76ceb 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 2154d1bfd18b..c9fd42bda6ee 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PROTECTABLE_MEMORY
+   CRASHTYPE(WRITE_RO_PMALLOC),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
index 53b85c9d16b8..0ac9023fd2b0 100644
--- a/drivers/misc/lkdtm_perms.c
+++ b/drivers/misc/lkdtm_perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -104,6 +105,33 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
 }
 
+#ifdef CONFIG_PROTECTABLE_MEMORY
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct gen_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool("pool", 0);
+   if (unlikely(!pool)) {
+   pr_info("Failed preparing pool for pmalloc test.");
+   return;
+   }
+
+   i = (int *)pmalloc(pool, sizeof(int), GFP_KERNEL);
+   if (unlikely(!i)) {
+   pr_info("Failed allocating memory for pmalloc test.");
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;
+}
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
size_t size;
-- 
2.14.1



[PATCH 6/6] Documentation for Pmalloc

2018-03-26 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 101 +
 2 files changed, 102 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+   pmalloc
 
 Interfaces for kernel debugging
 ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..3d2c19e5deaf
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _pmalloc:
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide read-only status to data that,
+for some reason, could neither be declared as constant, nor could it take
+advantage of the qualifier __ro_after_init, but is write-once and
+read-only in spirit.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+The MMU available in the system can be used to write protect memory pages.
+Unfortunately this feature cannot be used as-it-is, to protect sensitive
+data, because it is typically interleaved with data that must stay
+writeable.
+
+pmalloc introduces the concept of protectable memory pools.
+Each pool contains a list of areas of virtually contiguous pages of
+memory. An area is the minimum amount of memory that pmalloc allows to
+protect, because the data it contains can be larger than a single page.
+
+When an allocation is performed, if there is not enough memory already
+available in the pool, a new area of suitable size is allocated.
+The size chosen is the largest between the roundup (to PAGE_SIZE) of
+the request from pmalloc and friends and the refill parameter specified
+when creating the pool.
+
+When a pool is created, it is possible to specify two parameters:
+- refill size: the minimum size of the memory area to allocate when needed
+- align_order: the default alignment to use when returning to pmalloc
+
+Caveats
+---
+
+- To facilitate the conversion of existing code to pmalloc pools, several
+  helper functions are provided, mirroring their k/vmalloc counterparts.
+  In particular, pfree(), which is mostly meant for error paths, when one
+  or more previous allocations must be rolled back.
+
+- Whatever memory was still available in the previous area (where
+  applicable) is relinquished.
+
+- Freeing of memory is not supported. Pages will be returned to the
+  system upon destruction of the memory pool.
+
+- Considering that not much data is supposed to be dynamically allocated
+  and then marked as read-only, it shouldn't be an issue that the address
+  range for pmalloc is limited, on 32-bit systems.
+
+- Regarding SMP systems, the allocations are expected to happen mostly
+  during an initial transient, after which there should be no more need to
+  perform cross-processor synchronizations of page tables.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+#. create a pool
+
+   :c:func:`pmalloc_create_pool`
+
+#. [optional] pre-allocate some memory in the pool
+
+   :c:func:`pmalloc_prealloc`
+
+#. issue one or more allocation requests to the pool with locking as needed
+
+   :c:func:`pmalloc`
+
+   :c:func:`pzalloc`
+
+#. initialize the memory obtained with desired values
+
+#. write-protect the memory so far allocated
+
+   :c::func:`pmalloc_protect_pool`
+
+#. iterate over the last 3 points as needed
+
+#. [optional] destroy the pool
+
+   :c:func:`pmalloc_destroy_pool`
+
+API
+---
+
+.. kernel-doc:: include/linux/pmalloc.h
+.. kernel-doc:: mm/pmalloc.c
-- 
2.14.1



[PATCH 4/6] Pmalloc selftest

2018-03-26 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
 include/linux/test_pmalloc.h |  24 
 init/main.c  |   2 +
 mm/Kconfig   |  10 
 mm/Makefile  |   1 +
 mm/test_pmalloc.c| 136 +++
 5 files changed, 173 insertions(+)
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 21efbf6ace93..c63c41a33c9b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -661,6 +662,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   test_pmalloc();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index 1ac1dfc60c22..246f66c7e694 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -766,3 +766,13 @@ config PROTECTABLE_MEMORY
 depends on MMU
 depends on ARCH_HAS_SET_MEMORY
 default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+depends on MMU
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 959fdbdac118..1de4be5fd0bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..08274b0324f9
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+
+/* wrapper for is_pmalloc_object() with messages */
+static inline bool validate_alloc(bool expected, void *addr,
+ unsigned long size)
+{
+   bool test;
+
+   test = is_pmalloc_object(addr, size) > 0;
+   pr_notice("must be %s: %s",
+ expected ? "ok" : "no", test ? "ok" : "no");
+   return test == expected;
+}
+
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc(true, variable, size)
+
+
+#define is_alloc_no(variable, size)\
+   validate_alloc(false, variable, size)
+
+/* tests the basic life-cycle of a pool */
+static bool create_and_destroy_pool(void)
+{
+   static struct pmalloc_pool *pool;
+
+   pr_notice("Testing pool creation and destruction capability");
+
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Cannot allocate memory for pmalloc selftest."))
+   return false;
+   pmalloc_destroy_pool(pool);
+   return true;
+}
+
+
+/*  verifies that it's possible to allocate from the pool */
+static bool test_alloc(void)
+{
+   static struct pmalloc_pool *pool;
+   static void *p;
+
+   pr_notice("Testing allocation capability");
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Unable to allocate memory for pmalloc selftest."))
+   return false;
+   p = pmalloc(pool,  SIZE_1 - 1);
+   pmalloc_protect_pool(pool);
+   pmalloc_destroy_pool(pool);
+   if (WARN(!p, "Failed to allocate memory from the pool"))
+   return false;
+   return true;
+}
+
+
+/* tests the identification of pmalloc ranges */
+static bool test_is_pmalloc_object(void)
+{
+   struct pmalloc_pool *pool;
+   void *pmalloc_p;
+   void *vma

Re: [PATCH 3/6] Protectable Memory

2018-03-27 Thread Igor Stoppa


On 27/03/18 05:31, Matthew Wilcox wrote:
> On Tue, Mar 27, 2018 at 04:55:21AM +0300, Igor Stoppa wrote:
>> +static inline void *pmalloc_array_align(struct pmalloc_pool *pool,
>> +size_t n, size_t size,
>> +short int align_order)
>> +{
> 
> You're missing:
> 
> if (size != 0 && n > SIZE_MAX / size)
> return NULL;


ACK

>> +return pmalloc_align(pool, n * size, align_order);
>> +}
> 
>> +static inline void *pcalloc_align(struct pmalloc_pool *pool, size_t n,
>> +  size_t size, short int align_order)
>> +{
>> +return pzalloc_align(pool, n * size, align_order);
>> +}
> 
> Ditto.

ok

>> +static inline void *pcalloc(struct pmalloc_pool *pool, size_t n,
>> +size_t size)
>> +{
>> +return pzalloc_align(pool, n * size, PMALLOC_ALIGN_DEFAULT);
>> +}
> 
> If you make this one:
> 
>   return pcalloc_align(pool, n, size, PMALLOC_ALIGN_DEFAULT)

ok

> then you don't need the check in this function.
> 
> Also, do we really need 'align' as a parameter to the allocator functions
> rather than to the pool?

I actually wrote it first without, but then I wondered how to deal if
one needs to allocate both small fry structures and then something
larger that is page aligned.

However it's just speculation, I do not have any real example.

> I'd just reuse ARCH_KMALLOC_MINALIGN from slab.h as the alignment, and
> then add the special alignment options when we have a real user for them.

ok

--
thanks, igor


[RFC PATCH v21 0/6] mm: security: ro protection for dynamic data

2018-03-27 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is protected, all the
memory that is currently part of it, will become R/O.

A R/O pool can be expanded (adding more protectable memory).
It can also be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading. The pool, therefore, can be destroyed.

An example is provided, in the form of self-testing.

Changes since v20:

[http://www.openwall.com/lists/kernel-hardening/2018/03/27/2]

* removed the align_order parameter from allocation functions
* improved documentation with more explanation
* fixed lkdt test
* reworked the destroy function, removing a possible race with
  use-after-free code.


Igor Stoppa (6):
  struct page: add field for vm_struct
  vmalloc: rename llist field in vmap_area
  Protectable Memory
  Pmalloc selftest
  lkdtm: crash on overwriting protected pmalloc var
  Documentation for Pmalloc

 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 107 +++
 drivers/misc/lkdtm.h   |   1 +
 drivers/misc/lkdtm_core.c  |   3 +
 drivers/misc/lkdtm_perms.c |  25 
 include/linux/mm_types.h   |   1 +
 include/linux/pmalloc.h| 166 +++
 include/linux/test_pmalloc.h   |  24 
 include/linux/vmalloc.h|   5 +-
 init/main.c|   2 +
 mm/Kconfig |  16 +++
 mm/Makefile|   2 +
 mm/pmalloc.c   | 264 +
 mm/test_pmalloc.c  | 136 +++
 mm/usercopy.c  |  33 +
 mm/vmalloc.c   |  10 +-
 16 files changed, 791 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.rst
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/test_pmalloc.c

-- 
2.14.1



[PATCH 1/6] struct page: add field for vm_struct

2018-03-27 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 
Reviewed-by: Jay Freyensee 
Reviewed-by: Matthew Wilcox 
---
 include/linux/mm_types.h | 1 +
 mm/vmalloc.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..c3a4825e10c0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..61a1ca22b0f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
area->nr_pages = i;
goto fail;
}
+   page->area = area;
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();
-- 
2.14.1



[PATCH 2/6] vmalloc: rename llist field in vmap_area

2018-03-27 Thread Igor Stoppa
The vmap_area structure has a field of type struct llist_node, named
purge_list and is used when performing lazy purge of the area.

Such field is left unused during the actual utilization of the
structure.

This patch renames the field to a more generic "area_list", to allow for
utilization outside of the purging phase.

Since the purging happens after the vmap_area is dismissed, its use is
mutually exclusive with any use performed while the area is allocated.

Signed-off-by: Igor Stoppa 
---
 include/linux/vmalloc.h | 2 +-
 mm/vmalloc.c| 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1e5d8c392f15..2d07dfef3cfd 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -47,7 +47,7 @@ struct vmap_area {
unsigned long flags;
struct rb_node rb_node; /* address sorted rbtree */
struct list_head list;  /* address sorted list */
-   struct llist_node purge_list;/* "lazy purge" list */
+   struct llist_node area_list;/* generic list of areas */
struct vm_struct *vm;
struct rcu_head rcu_head;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61a1ca22b0f6..1bb2233bb262 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -682,7 +682,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
lockdep_assert_held(&vmap_purge_lock);
 
valist = llist_del_all(&vmap_purge_list);
-   llist_for_each_entry(va, valist, purge_list) {
+   llist_for_each_entry(va, valist, area_list) {
if (va->va_start < start)
start = va->va_start;
if (va->va_end > end)
@@ -696,7 +696,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
flush_tlb_kernel_range(start, end);
 
spin_lock(&vmap_area_lock);
-   llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+   llist_for_each_entry_safe(va, n_va, valist, area_list) {
int nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 
__free_vmap_area(va);
@@ -743,7 +743,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
&vmap_lazy_nr);
 
/* After this point, we may free va at any time */
-   llist_add(&va->purge_list, &vmap_purge_list);
+   llist_add(&va->area_list, &vmap_purge_list);
 
if (unlikely(nr_lazy > lazy_max_pages()))
try_purge_vmap_area_lazy();
-- 
2.14.1



[PATCH 3/6] Protectable Memory

2018-03-27 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section (that's how __ro_after_init works), but this does not sit very
well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can instantiate a pool, and then refer any allocation request to
the pool handler it has received.

A pool is organized ias list of areas of virtually contiguous memory.
Whenever the protection functionality is invoked on a pool, all the
areas it contains that are not yet read-only are write-protected.

The process of growing and protecting the pool can be iterated at will.
Each iteration will prevent further allocation from the memory area
currently active, turn it into read-only mode and then proceed to
secure whatever other area might still be unprotected.

Write-protcting some part of a pool before completing all the
allocations can be wasteful, however it will guarrantee the minimum
window of vulnerability, sice the data can be allocated, initialized
and protected in a single sweep.

There are pros and cons, depending on the allocation patterns, the size
of the areas being allocated, the time intervals between initialization
and protection.

Dstroying a pool is the only way to claim back the associated memory.
It is up to its user to avoid any further references to the memory that
was allocated, once the destruction is invoked.

An example where it is desirable to destroy a pool and claim back its
memory is when unloading a kernel module.

A module can have as many pools as needed.

Since pmalloc memory is obtained from vmalloc, an attacker that has
gained access to the physical mapping, still has to identify where the
target of the attack (in virtually contiguous mapping) is located.

Compared to plain vmalloc, pmalloc does not generate as much TLB
trashing, since it can host multiple allocations in the same page,
where present.

Signed-off-by: Igor Stoppa 
---
 include/linux/pmalloc.h | 166 ++
 include/linux/vmalloc.h |   3 +
 mm/Kconfig  |   6 ++
 mm/Makefile |   1 +
 mm/pmalloc.c| 264 
 mm/usercopy.c   |  33 ++
 mm/vmalloc.c|   2 +-
 7 files changed, 474 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index ..07d7838f7877
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,166 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017-18 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#ifndef _LINUX_PMALLOC_H
+#define _LINUX_PMALLOC_H
+
+
+#include 
+#include 
+
+/*
+ * Library for dynamic allocation of pools of protectable memory.
+ * A pool is a single linked list of vmap_area structures.
+ * Whenever a pool is protected, all the areas it contain at that point
+ * are write protected.
+ * More areas can be added and protected, in the same way.
+ * Memory in a pool cannot be individually unprotected, but the pool can
+ * be destroyed.
+ * Upon destruction of a certain pool, all the related memory is released,
+ * including its metadata.
+ *
+ * Pmalloc memory is intended to complement __read_only_after_init.
+ * It can be used, for example, where there is a write-once variable, for
+ * which it is not possible to know the initialization value before init
+ * is completed (which is what __read_only_after_init requires).
+ *
+ * It can be useful also where the amount of data to protect is not known
+ * at compile time and the memory can only be allocated dynamically.
+ *
+ * Finally, it can be useful also when it is desirable to control
+ * dynamically (for example throguh the command line) if something ought
+ * to be protected or not, without having to rebuild the kernel (like in
+ * the build used for a linux distro).
+ */
+
+
+#define PMALLOC_REFILL_DEFAULT (0)
+#define PMALLOC_ALIGN_DEFAULT ARCH_KMALLOC_MINALIGN
+
+struct pmalloc_pool *pmalloc_create_custom_pool(unsigned long int refill,
+   unsigned short align_order);
+
+/**
+ * pmalloc_create_pool() - create a protectable memory pool
+ *
+ * Shorthand for pmalloc_create_custom_pool() with default argument:
+ * * refill is set to PMALLOC_REFILL_DEFAULT
+ * * align_order is s

[PATCH 4/6] Pmalloc selftest

2018-03-27 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
 include/linux/test_pmalloc.h |  24 
 init/main.c  |   2 +
 mm/Kconfig   |  10 
 mm/Makefile  |   1 +
 mm/test_pmalloc.c| 136 +++
 5 files changed, 173 insertions(+)
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 21efbf6ace93..c63c41a33c9b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -661,6 +662,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   test_pmalloc();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index 1ac1dfc60c22..246f66c7e694 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -766,3 +766,13 @@ config PROTECTABLE_MEMORY
 depends on MMU
 depends on ARCH_HAS_SET_MEMORY
 default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+depends on MMU
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 959fdbdac118..1de4be5fd0bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..08274b0324f9
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+
+/* wrapper for is_pmalloc_object() with messages */
+static inline bool validate_alloc(bool expected, void *addr,
+ unsigned long size)
+{
+   bool test;
+
+   test = is_pmalloc_object(addr, size) > 0;
+   pr_notice("must be %s: %s",
+ expected ? "ok" : "no", test ? "ok" : "no");
+   return test == expected;
+}
+
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc(true, variable, size)
+
+
+#define is_alloc_no(variable, size)\
+   validate_alloc(false, variable, size)
+
+/* tests the basic life-cycle of a pool */
+static bool create_and_destroy_pool(void)
+{
+   static struct pmalloc_pool *pool;
+
+   pr_notice("Testing pool creation and destruction capability");
+
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Cannot allocate memory for pmalloc selftest."))
+   return false;
+   pmalloc_destroy_pool(pool);
+   return true;
+}
+
+
+/*  verifies that it's possible to allocate from the pool */
+static bool test_alloc(void)
+{
+   static struct pmalloc_pool *pool;
+   static void *p;
+
+   pr_notice("Testing allocation capability");
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Unable to allocate memory for pmalloc selftest."))
+   return false;
+   p = pmalloc(pool,  SIZE_1 - 1);
+   pmalloc_protect_pool(pool);
+   pmalloc_destroy_pool(pool);
+   if (WARN(!p, "Failed to allocate memory from the pool"))
+   return false;
+   return true;
+}
+
+
+/* tests the identification of pmalloc ranges */
+static bool test_is_pmalloc_object(void)
+{
+   struct pmalloc_pool *pool;
+   void *pmalloc_p;
+   void *vma

[PATCH 5/6] lkdtm: crash on overwriting protected pmalloc var

2018-03-27 Thread Igor Stoppa
Verify that pmalloc read-only protection is in place: trying to
overwrite a protected variable will crash the kernel.

Signed-off-by: Igor Stoppa 
---
 drivers/misc/lkdtm.h   |  1 +
 drivers/misc/lkdtm_core.c  |  3 +++
 drivers/misc/lkdtm_perms.c | 25 +
 3 files changed, 29 insertions(+)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 9e513dcfd809..dcda3ae76ceb 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 2154d1bfd18b..c9fd42bda6ee 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PROTECTABLE_MEMORY
+   CRASHTYPE(WRITE_RO_PMALLOC),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
index 53b85c9d16b8..4660ff0bfa44 100644
--- a/drivers/misc/lkdtm_perms.c
+++ b/drivers/misc/lkdtm_perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -104,6 +105,30 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
 }
 
+#ifdef CONFIG_PROTECTABLE_MEMORY
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct pmalloc_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Failed preparing pool for pmalloc test."))
+   return;
+
+   i = (int *)pmalloc(pool, sizeof(int));
+   if (WARN(!i, "Failed allocating memory for pmalloc test.")) {
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;
+}
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
size_t size;
-- 
2.14.1



[PATCH 6/6] Documentation for Pmalloc

2018-03-27 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 107 +
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+   pmalloc
 
 Interfaces for kernel debugging
 ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..c14907485137
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _pmalloc:
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide read-only status to data that,
+for some reason, could neither be declared as constant, nor could it take
+advantage of the qualifier __ro_after_init, but is write-once and
+read-only in spirit. At least as long as it doesn't get teared down.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+The MMU available in the system can be used to write protect memory pages.
+Unfortunately this feature cannot be used as-it-is, to protect sensitive
+data, because this potentially read-only data is typically interleaved
+with other data, which must stay writeable.
+
+pmalloc introduces the concept of protectable memory pools.
+A pool contains a list of areas of virtually contiguous pages of
+memory. An area is the minimum amount of memory that pmalloc allows to
+protect, because the user might have allocated a memory range that
+crosses the boundary between pages.
+
+When an allocation is performed, if there is not enough memory already
+available in the pool, a new area of suitable size is grabbed.
+The size chosen is the largest between the roundup (to PAGE_SIZE) of
+the request from pmalloc and friends and the refill parameter specified
+when creating the pool.
+
+When a pool is created, it is possible to specify two parameters:
+- refill size: the minimum size of the memory area to allocate when needed
+- align_order: the default alignment to use when reserving memory
+
+To facilitate the conversion of existing code to pmalloc pools, several
+helper functions are provided, mirroring their k/vmalloc counterparts.
+However one is missing. There is no pfree() because the memory protected
+by a pool will be released exclusively when the pool is destroyed.
+
+
+
+Caveats
+---
+
+- When a pool is protected, whatever memory would be still available in
+  the current vmap_area (from which allocations are performed) is
+  relinquished.
+
+- As already explained, freeing of memory is not supported. Pages will be
+  returned to the system upon destruction of the memory pool that they
+  belong to.
+
+- The address range available for vmalloc (and thus for pmalloc too) is
+  limited, on 32-bit systems. However it shouldn't be an issue, since not
+  much data is expected tobe dynamically allocated and turned into
+  read-only.
+
+- Regarding SMP systems, the allocations are expected to happen mostly
+  during an initial transient, after which there should be no more need
+  to perform cross-processor synchronizations of page tables.
+  Loading of kernel modules is an exception to this, but it's not expected
+  to happen with such high frequency to become a problem.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+#. create a pool
+
+   :c:func:`pmalloc_create_pool`
+
+#. issue one or more allocation requests to the pool
+
+   :c:func:`pmalloc`
+
+   or
+
+   :c:func:`pzalloc`
+
+#. initialize the memory obtained, with the desired values
+
+#. write-protect the memory so far allocated
+
+   :c::func:`pmalloc_protect_pool`
+
+#. iterate over the last 3 points as needed
+
+#. [optional] destroy the pool
+
+   :c:func:`pmalloc_destroy_pool`
+
+API
+---
+
+.. kernel-doc:: include/linux/pmalloc.h
+.. kernel-doc:: mm/pmalloc.c
-- 
2.14.1



[PATCH 2/6] vmalloc: rename llist field in vmap_area

2018-04-13 Thread Igor Stoppa
The vmap_area structure has a field of type struct llist_node, named
purge_list and is used when performing lazy purge of the area.

Such field is left unused during the actual utilization of the
structure.

This patch renames the field to a more generic "area_list", to allow for
utilization outside of the purging phase.

Since the purging happens after the vmap_area is dismissed, its use is
mutually exclusive with any use performed while the area is allocated.

Signed-off-by: Igor Stoppa 
---
 include/linux/vmalloc.h | 2 +-
 mm/vmalloc.c| 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 1e5d8c392f15..2d07dfef3cfd 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -47,7 +47,7 @@ struct vmap_area {
unsigned long flags;
struct rb_node rb_node; /* address sorted rbtree */
struct list_head list;  /* address sorted list */
-   struct llist_node purge_list;/* "lazy purge" list */
+   struct llist_node area_list;/* generic list of areas */
struct vm_struct *vm;
struct rcu_head rcu_head;
 };
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61a1ca22b0f6..1bb2233bb262 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -682,7 +682,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
lockdep_assert_held(&vmap_purge_lock);
 
valist = llist_del_all(&vmap_purge_list);
-   llist_for_each_entry(va, valist, purge_list) {
+   llist_for_each_entry(va, valist, area_list) {
if (va->va_start < start)
start = va->va_start;
if (va->va_end > end)
@@ -696,7 +696,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, 
unsigned long end)
flush_tlb_kernel_range(start, end);
 
spin_lock(&vmap_area_lock);
-   llist_for_each_entry_safe(va, n_va, valist, purge_list) {
+   llist_for_each_entry_safe(va, n_va, valist, area_list) {
int nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 
__free_vmap_area(va);
@@ -743,7 +743,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
&vmap_lazy_nr);
 
/* After this point, we may free va at any time */
-   llist_add(&va->purge_list, &vmap_purge_list);
+   llist_add(&va->area_list, &vmap_purge_list);
 
if (unlikely(nr_lazy > lazy_max_pages()))
try_purge_vmap_area_lazy();
-- 
2.14.1



[PATCH 1/6] struct page: add field for vm_struct

2018-04-13 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 
Reviewed-by: Jay Freyensee 
Reviewed-by: Matthew Wilcox 
---
 include/linux/mm_types.h | 1 +
 mm/vmalloc.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..c74e2aa9a48b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -86,6 +86,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..61a1ca22b0f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
area->nr_pages = i;
goto fail;
}
+   page->area = area;
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();
-- 
2.14.1



[RFC PATCH v22 0/6] mm: security: ro protection for dynamic data

2018-04-13 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is protected, all the
memory that is currently part of it, will become R/O.

A R/O pool can be expanded (adding more protectable memory).
It can also be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading. The pool, therefore, can be destroyed.

An example is provided, in the form of self-testing.

Since it was advised to give an example of protecting real kernel data
[1],
a well known vulnerability has been used to demo an effective use of
pmalloc.

[1] http://www.openwall.com/lists/kernel-hardening/2018/03/29/7

However it turned out to be almost an how-to for attacking the kernel, so
it was sent first to secur...@kernel.org, for obtaining clearance about
the
publication.

Changes since v21:

[http://www.openwall.com/lists/kernel-hardening/2018/03/27/23]

* fixed type mismatch error in use of max(), detected by gcc 7.3
* converted internal types into size_t
* fixed leak of vmalloc memory in the self-test code

Igor Stoppa (6):
  struct page: add field for vm_struct
  vmalloc: rename llist field in vmap_area
  Protectable Memory
  Documentation for Pmalloc
  Pmalloc selftest
  lkdtm: crash on overwriting protected pmalloc var

Igor Stoppa (6):
  struct page: add field for vm_struct
  vmalloc: rename llist field in vmap_area
  Protectable Memory
  Documentation for Pmalloc
  Pmalloc selftest
  lkdtm: crash on overwriting protected pmalloc var

 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 107 +++
 drivers/misc/lkdtm/core.c  |   3 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c |  25 
 include/linux/mm_types.h   |   1 +
 include/linux/pmalloc.h| 166 +++
 include/linux/test_pmalloc.h   |  24 
 include/linux/vmalloc.h|   5 +-
 init/main.c|   2 +
 mm/Kconfig |  16 +++
 mm/Makefile|   2 +
 mm/pmalloc.c   | 265 +
 mm/test_pmalloc.c  | 137 +++
 mm/usercopy.c  |  33 +
 mm/vmalloc.c   |  10 +-
 16 files changed, 793 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.rst
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/test_pmalloc.c

-- 
2.14.1



[PATCH 6/6] lkdtm: crash on overwriting protected pmalloc var

2018-04-13 Thread Igor Stoppa
Verify that pmalloc read-only protection is in place: trying to
overwrite a protected variable will crash the kernel.

Signed-off-by: Igor Stoppa 
---
 drivers/misc/lkdtm/core.c  |  3 +++
 drivers/misc/lkdtm/lkdtm.h |  1 +
 drivers/misc/lkdtm/perms.c | 25 +
 3 files changed, 29 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2154d1bfd18b..c9fd42bda6ee 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PROTECTABLE_MEMORY
+   CRASHTYPE(WRITE_RO_PMALLOC),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 9e513dcfd809..dcda3ae76ceb 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 53b85c9d16b8..4660ff0bfa44 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -104,6 +105,30 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
 }
 
+#ifdef CONFIG_PROTECTABLE_MEMORY
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct pmalloc_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Failed preparing pool for pmalloc test."))
+   return;
+
+   i = (int *)pmalloc(pool, sizeof(int));
+   if (WARN(!i, "Failed allocating memory for pmalloc test.")) {
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;
+}
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
size_t size;
-- 
2.14.1



[PATCH 3/6] Protectable Memory

2018-04-13 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section (that's how __ro_after_init works), but this does not sit very
well with dynamically allocated ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can instantiate a pool, and then refer any allocation request to
the pool handler it has received.

A pool is organized ias list of areas of virtually contiguous memory.
Whenever the protection functionality is invoked on a pool, all the
areas it contains that are not yet read-only are write-protected.

The process of growing and protecting the pool can be iterated at will.
Each iteration will prevent further allocation from the memory area
currently active, turn it into read-only mode and then proceed to
secure whatever other area might still be unprotected.

Write-protcting some part of a pool before completing all the
allocations can be wasteful, however it will guarrantee the minimum
window of vulnerability, sice the data can be allocated, initialized
and protected in a single sweep.

There are pros and cons, depending on the allocation patterns, the size
of the areas being allocated, the time intervals between initialization
and protection.

Dstroying a pool is the only way to claim back the associated memory.
It is up to its user to avoid any further references to the memory that
was allocated, once the destruction is invoked.

An example where it is desirable to destroy a pool and claim back its
memory is when unloading a kernel module.

A module can have as many pools as needed.

Since pmalloc memory is obtained from vmalloc, an attacker that has
gained access to the physical mapping, still has to identify where the
target of the attack (in virtually contiguous mapping) is located.

Compared to plain vmalloc, pmalloc does not generate as much TLB
trashing, since it can host multiple allocations in the same page,
where present.

Signed-off-by: Igor Stoppa 
---
 include/linux/pmalloc.h | 166 ++
 include/linux/vmalloc.h |   3 +
 mm/Kconfig  |   6 ++
 mm/Makefile |   1 +
 mm/pmalloc.c| 265 
 mm/usercopy.c   |  33 ++
 mm/vmalloc.c|   2 +-
 7 files changed, 475 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index ..1c24067eb167
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,166 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017-18 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#ifndef _LINUX_PMALLOC_H
+#define _LINUX_PMALLOC_H
+
+
+#include 
+#include 
+
+/*
+ * Library for dynamic allocation of pools of protectable memory.
+ * A pool is a single linked list of vmap_area structures.
+ * Whenever a pool is protected, all the areas it contain at that point
+ * are write protected.
+ * More areas can be added and protected, in the same way.
+ * Memory in a pool cannot be individually unprotected, but the pool can
+ * be destroyed.
+ * Upon destruction of a certain pool, all the related memory is released,
+ * including its metadata.
+ *
+ * Pmalloc memory is intended to complement __read_only_after_init.
+ * It can be used, for example, where there is a write-once variable, for
+ * which it is not possible to know the initialization value before init
+ * is completed (which is what __read_only_after_init requires).
+ *
+ * It can be useful also where the amount of data to protect is not known
+ * at compile time and the memory can only be allocated dynamically.
+ *
+ * Finally, it can be useful also when it is desirable to control
+ * dynamically (for example throguh the command line) if something ought
+ * to be protected or not, without having to rebuild the kernel (like in
+ * the build used for a linux distro).
+ */
+
+
+#define PMALLOC_REFILL_DEFAULT (0)
+#define PMALLOC_ALIGN_DEFAULT ARCH_KMALLOC_MINALIGN
+
+struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill,
+   unsigned short align_order);
+
+/**
+ * pmalloc_create_pool() - create a protectable memory pool
+ *
+ * Shorthand for pmalloc_create_custom_pool() with default argument:
+ * * refill is set to PMALLOC_REFILL_DEFAULT
+ * * align_order is set to PMALLOC_ALIGN_DE

[PATCH 4/6] Documentation for Pmalloc

2018-04-13 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 107 +
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+   pmalloc
 
 Interfaces for kernel debugging
 ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..c14907485137
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _pmalloc:
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide read-only status to data that,
+for some reason, could neither be declared as constant, nor could it take
+advantage of the qualifier __ro_after_init, but is write-once and
+read-only in spirit. At least as long as it doesn't get teared down.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+The MMU available in the system can be used to write protect memory pages.
+Unfortunately this feature cannot be used as-it-is, to protect sensitive
+data, because this potentially read-only data is typically interleaved
+with other data, which must stay writeable.
+
+pmalloc introduces the concept of protectable memory pools.
+A pool contains a list of areas of virtually contiguous pages of
+memory. An area is the minimum amount of memory that pmalloc allows to
+protect, because the user might have allocated a memory range that
+crosses the boundary between pages.
+
+When an allocation is performed, if there is not enough memory already
+available in the pool, a new area of suitable size is grabbed.
+The size chosen is the largest between the roundup (to PAGE_SIZE) of
+the request from pmalloc and friends and the refill parameter specified
+when creating the pool.
+
+When a pool is created, it is possible to specify two parameters:
+- refill size: the minimum size of the memory area to allocate when needed
+- align_order: the default alignment to use when reserving memory
+
+To facilitate the conversion of existing code to pmalloc pools, several
+helper functions are provided, mirroring their k/vmalloc counterparts.
+However one is missing. There is no pfree() because the memory protected
+by a pool will be released exclusively when the pool is destroyed.
+
+
+
+Caveats
+---
+
+- When a pool is protected, whatever memory would be still available in
+  the current vmap_area (from which allocations are performed) is
+  relinquished.
+
+- As already explained, freeing of memory is not supported. Pages will be
+  returned to the system upon destruction of the memory pool that they
+  belong to.
+
+- The address range available for vmalloc (and thus for pmalloc too) is
+  limited, on 32-bit systems. However it shouldn't be an issue, since not
+  much data is expected tobe dynamically allocated and turned into
+  read-only.
+
+- Regarding SMP systems, the allocations are expected to happen mostly
+  during an initial transient, after which there should be no more need
+  to perform cross-processor synchronizations of page tables.
+  Loading of kernel modules is an exception to this, but it's not expected
+  to happen with such high frequency to become a problem.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+#. create a pool
+
+   :c:func:`pmalloc_create_pool`
+
+#. issue one or more allocation requests to the pool
+
+   :c:func:`pmalloc`
+
+   or
+
+   :c:func:`pzalloc`
+
+#. initialize the memory obtained, with the desired values
+
+#. write-protect the memory so far allocated
+
+   :c::func:`pmalloc_protect_pool`
+
+#. iterate over the last 3 points as needed
+
+#. [optional] destroy the pool
+
+   :c:func:`pmalloc_destroy_pool`
+
+API
+---
+
+.. kernel-doc:: include/linux/pmalloc.h
+.. kernel-doc:: mm/pmalloc.c
-- 
2.14.1



[PATCH 5/6] Pmalloc selftest

2018-04-13 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
 include/linux/test_pmalloc.h |  24 
 init/main.c  |   2 +
 mm/Kconfig   |  10 
 mm/Makefile  |   1 +
 mm/test_pmalloc.c| 137 +++
 5 files changed, 174 insertions(+)
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index b795aa341a3a..27f8479c4578 100644
--- a/init/main.c
+++ b/init/main.c
@@ -91,6 +91,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -679,6 +680,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   test_pmalloc();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index d7ef40eaa4e8..f98b4c0aebce 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -758,3 +758,13 @@ config PROTECTABLE_MEMORY
 depends on MMU
 depends on ARCH_HAS_SET_MEMORY
 default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+depends on MMU
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 6a6668f99799..802cba37013b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..b0e091bf6329
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+
+/* wrapper for is_pmalloc_object() with messages */
+static inline bool validate_alloc(bool expected, void *addr,
+ unsigned long size)
+{
+   bool test;
+
+   test = is_pmalloc_object(addr, size) > 0;
+   pr_notice("must be %s: %s",
+ expected ? "ok" : "no", test ? "ok" : "no");
+   return test == expected;
+}
+
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc(true, variable, size)
+
+
+#define is_alloc_no(variable, size)\
+   validate_alloc(false, variable, size)
+
+/* tests the basic life-cycle of a pool */
+static bool create_and_destroy_pool(void)
+{
+   static struct pmalloc_pool *pool;
+
+   pr_notice("Testing pool creation and destruction capability");
+
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Cannot allocate memory for pmalloc selftest."))
+   return false;
+   pmalloc_destroy_pool(pool);
+   return true;
+}
+
+
+/*  verifies that it's possible to allocate from the pool */
+static bool test_alloc(void)
+{
+   static struct pmalloc_pool *pool;
+   static void *p;
+
+   pr_notice("Testing allocation capability");
+   pool = pmalloc_create_pool();
+   if (WARN(!pool, "Unable to allocate memory for pmalloc selftest."))
+   return false;
+   p = pmalloc(pool,  SIZE_1 - 1);
+   pmalloc_protect_pool(pool);
+   pmalloc_destroy_pool(pool);
+   if (WARN(!p, "Failed to allocate memory from the pool"))
+   return false;
+   return true;
+}
+
+
+/* tests the identification of pmalloc ranges */
+static bool test_is_pmalloc_object(void)
+{
+   struct pmalloc_pool *pool;
+   void *pmalloc_p;
+   void *vma

[RFC PATCH v19 0/8] mm: security: ro protection for dynamic data

2018-03-13 Thread Igor Stoppa
This patch-set introduces the possibility of protecting memory that has
been allocated dynamically.

The memory is managed in pools: when a memory pool is turned into R/O,
all the memory that is part of it, will become R/O.

A R/O pool can be destroyed, to recover its memory, but it cannot be
turned back into R/W mode.

This is intentional. This feature is meant for data that doesn't need
further modifications after initialization.

However the data might need to be released, for example as part of module
unloading.
To do this, the memory must first be freed, then the pool can be destroyed.

An example is provided, in the form of self-testing.

Changes since v18:

[http://www.openwall.com/lists/kernel-hardening/2018/02/28/21]

* Code refactoring in pmalloc & genalloc:
  - simplify the logic for handling pools before and after sysf init
  - reduced section holding mutex on pmalloc list, when adding a pool
  - reduced the steps in finding length of an existing allocation
  - split various functions into smaller ones
* clarified in the comments the need for pfree()
* Various fixes to the documentation:
  - remove kerneldoc duplicates
  - added cross-reference lables
  - miscellaneous typos
* improved error notifications: use WARNs with specific messages
* added missing tests for possible error conditions


Discussion topics that are unclear if they are closed and would need
comment from those who initiated them, if my answers are accepted or not:

* @Kees Cook proposed to have first self testing for genalloc, to
  validate the following patch, adding tracing of allocations
  My answer was that such tests would also need patching, therefore they 
  could not certify that the functionality is corect both before and
  after the genalloc bitmap modification.

* @Kees Cook proposed to turn the self testing into modules.
  My answer was that the functionality is intentionally tested very early
  in the boot phase, to prevent unexplainable errors, should the feature
  really fail.

* @Matthew Wilcox proposed to use a different mechanism for th genalloc
  bitmap: 2 bitmaps, one for occupation and one for start.
  And possibly use an rbtree for the starts.
  My answer was that this solution is less optimized, because it scatters
  the data of one allocation across multiple words/pages, plus is not
  a transaction anymore. And the particular distribution of sizes of
  allocation is likely to eat up much more memory than the bitmap.

Igor Stoppa (8):
  genalloc: track beginning of allocations
  Add label to genalloc.rst for cross reference
  genalloc: selftest
  struct page: add field for vm_struct
  Protectable Memory
  Pmalloc selftest
  lkdtm: crash on overwriting protected pmalloc var
  Documentation for Pmalloc

 Documentation/core-api/genalloc.rst |   2 +
 Documentation/core-api/index.rst|   1 +
 Documentation/core-api/pmalloc.rst  | 111 ++
 drivers/misc/lkdtm.h|   1 +
 drivers/misc/lkdtm_core.c   |   3 +
 drivers/misc/lkdtm_perms.c  |  28 ++
 include/linux/genalloc.h| 116 +++---
 include/linux/mm_types.h|   1 +
 include/linux/pmalloc.h | 163 
 include/linux/test_genalloc.h   |  26 ++
 include/linux/test_pmalloc.h|  24 ++
 include/linux/vmalloc.h |   1 +
 init/main.c |   4 +
 lib/Kconfig |  15 +
 lib/Makefile|   1 +
 lib/genalloc.c  | 765 ++--
 lib/test_genalloc.c | 410 +++
 mm/Kconfig  |  17 +
 mm/Makefile |   2 +
 mm/pmalloc.c| 643 ++
 mm/test_pmalloc.c   | 238 +++
 mm/usercopy.c   |  33 ++
 mm/vmalloc.c|   2 +
 23 files changed, 2352 insertions(+), 255 deletions(-)
 create mode 100644 Documentation/core-api/pmalloc.rst
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 include/linux/test_genalloc.h
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 lib/test_genalloc.c
 create mode 100644 mm/pmalloc.c
 create mode 100644 mm/test_pmalloc.c

-- 
2.14.1



[PATCH 1/8] genalloc: track beginning of allocations

2018-03-13 Thread Igor Stoppa
The genalloc library is only capable of tracking if a certain unit of
allocation is in use or not.

It is not capable of discerning where the memory associated to an
allocation request begins and where it ends.

The reason is that units of allocations are tracked by using a bitmap,
where each bit represents that the unit is either allocated (1) or
available (0).

The user of the API must keep track of how much space was requested, if
it ever needs to be freed.

This can cause errors being undetected.
Examples:
* Only a subset of the memory provided to an allocation request is freed
* The memory from a subsequent allocation is freed
* The memory being freed doesn't start at the beginning of an
  allocation.

The bitmap is used because it allows to perform lockless read/write
access, where this is supported by hw through cmpxchg.
Similarly, it is possible to scan the bitmap for a sufficiently long
sequence of zeros, to identify zones available for allocation.

This patch doubles the space reserved in the bitmap for each allocation,
to track their beginning.

For details, see the documentation inside lib/genalloc.c

The primary effect of this patch is that code using the gen_alloc
library does not need anymore to keep track of the size of the
allocations it makes.

Prior to this patch, it was necessary to keep track of the size of the
allocation, so that it would be possible, later on, to know how much
space should be freed.

Now, users of the api can choose to etiher still specify explicitly the
size, or let the library determine it, by giving a value of 0.

However, even when the value is specified, the library still uses its on
understanding of the space associated with a certain allocation, to
confirm that they are consistent.

This verification also confirms that the patch works correctly.

Eventually, the extra parameter (and the corresponding verification)
could be dropped, in favor of a simplified API.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h | 112 +++
 lib/genalloc.c   | 742 ++-
 2 files changed, 599 insertions(+), 255 deletions(-)

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index 872f930f1b06..ff7229520656 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -32,7 +32,7 @@
 
 #include 
 #include 
-#include 
+#include 
 
 struct device;
 struct device_node;
@@ -76,7 +76,7 @@ struct gen_pool_chunk {
phys_addr_t phys_addr;  /* physical starting address of memory 
chunk */
unsigned long start_addr;   /* start address of memory chunk */
unsigned long end_addr; /* end address of memory chunk 
(inclusive) */
-   unsigned long bits[0];  /* bitmap for allocating memory chunk */
+   unsigned long entries[0];   /* bitmap for allocating memory chunk */
 };
 
 /*
@@ -93,74 +93,82 @@ struct genpool_data_fixed {
unsigned long offset;   /* The offset of the specific region */
 };
 
-extern struct gen_pool *gen_pool_create(int, int);
-extern phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long);
-extern int gen_pool_add_virt(struct gen_pool *, unsigned long, phys_addr_t,
-size_t, int);
-/**
- * gen_pool_add - add a new chunk of special memory to the pool
- * @pool: pool to add new memory chunk to
- * @addr: starting address of memory chunk to add to pool
- * @size: size in bytes of the memory chunk to add to pool
- * @nid: node id of the node the chunk structure and bitmap should be
- *   allocated on, or -1
- *
- * Add a new chunk of special memory to the specified pool.
- *
- * Returns 0 on success or a -ve errno on failure.
- */
+struct gen_pool *gen_pool_create(int min_alloc_order, int nid);
+
+int gen_pool_add_virt(struct gen_pool *pool, unsigned long virt,
+ phys_addr_t phys, size_t size, int nid);
+
+
 static inline int gen_pool_add(struct gen_pool *pool, unsigned long addr,
   size_t size, int nid)
 {
return gen_pool_add_virt(pool, addr, -1, size, nid);
 }
-extern void gen_pool_destroy(struct gen_pool *);
-extern unsigned long gen_pool_alloc(struct gen_pool *, size_t);
-extern unsigned long gen_pool_alloc_algo(struct gen_pool *, size_t,
-   genpool_algo_t algo, void *data);
-extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
-   dma_addr_t *dma);
-extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
-extern void gen_pool_for_each_chunk(struct gen_pool *,
-   void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
-extern size_t gen_pool_avail(struct gen_pool *);
-extern size_t gen_pool_size(struct gen_pool *);
 
-extern void gen_pool_set_algo(struct gen_pool *pool, genpool_algo_t algo,
-   void *data);
+phys_addr_t gen_pool_virt_to_phys(struct gen_pool *pool, unsigned long addr);
 
-extern unsigned long gen_pool_firs

[PATCH 3/8] genalloc: selftest

2018-03-13 Thread Igor Stoppa
Introduce a set of macros for writing concise test cases for genalloc.

The test cases are meant to provide regression testing, when working on
new functionality for genalloc.

Primarily they are meant to confirm that the various allocation strategy
will continue to work as expected.

The execution of the self testing is controlled through a Kconfig option.

The testing takes place in the very early stages of main.c, to ensure
that failures in genalloc are caught before they can cause unexplained
erratic behavior in any of genalloc users.

Therefore, it would not be advisable to implement it as module.

Signed-off-by: Igor Stoppa 
---
 include/linux/test_genalloc.h |  26 +++
 init/main.c   |   2 +
 lib/Kconfig   |  15 ++
 lib/Makefile  |   1 +
 lib/test_genalloc.c   | 410 ++
 5 files changed, 454 insertions(+)
 create mode 100644 include/linux/test_genalloc.h
 create mode 100644 lib/test_genalloc.c

diff --git a/include/linux/test_genalloc.h b/include/linux/test_genalloc.h
new file mode 100644
index ..cc45c6c859cf
--- /dev/null
+++ b/include/linux/test_genalloc.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_genalloc.h
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_GENALLOC_H
+#define __LINUX_TEST_GENALLOC_H
+
+
+#ifdef CONFIG_TEST_GENERIC_ALLOCATOR
+
+#include 
+
+void test_genalloc(void);
+
+#else
+
+static inline void test_genalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 969eaf140ef0..2bf1312fd2fe 100644
--- a/init/main.c
+++ b/init/main.c
@@ -90,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -661,6 +662,7 @@ asmlinkage __visible void __init start_kernel(void)
 */
mem_encrypt_init();
 
+   test_genalloc();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/lib/Kconfig b/lib/Kconfig
index e96089499371..361514324d64 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -287,6 +287,21 @@ config DECOMPRESS_LZ4
 config GENERIC_ALLOCATOR
bool
 
+config TEST_GENERIC_ALLOCATOR
+   bool "genalloc tester"
+   default n
+   select GENERIC_ALLOCATOR
+   help
+ Enable automated testing of the generic allocator.
+ The testing is primarily for the tracking of allocated space.
+
+config TEST_GENERIC_ALLOCATOR_VERBOSE
+   bool "make the genalloc tester more verbose"
+   default n
+   select TEST_GENERIC_ALLOCATOR
+   help
+ More information will be displayed during the self-testing.
+
 #
 # reed solomon support is select'ed if needed
 #
diff --git a/lib/Makefile b/lib/Makefile
index a90d4fcd748f..5b5ee8d8f6d6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_LIBCRC32C) += libcrc32c.o
 obj-$(CONFIG_CRC8) += crc8.o
 obj-$(CONFIG_XXHASH)   += xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
+obj-$(CONFIG_TEST_GENERIC_ALLOCATOR) += test_genalloc.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
diff --git a/lib/test_genalloc.c b/lib/test_genalloc.c
new file mode 100644
index ..12a61c9e7558
--- /dev/null
+++ b/lib/test_genalloc.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_genalloc.c
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+
+/*
+ * In case of failure of any of these tests, memory corruption is almost
+ * guarranteed; allowing the boot to continue means risking to corrupt
+ * also any filesystem/block device accessed write mode.
+ * Therefore, BUG_ON() is used, when testing.
+ */
+
+
+/*
+ * Keep the bitmap small, while including case of cross-ulong mapping.
+ * For simplicity, the test cases use only 1 chunk of memory.
+ */
+#define BITMAP_SIZE_C 16
+#define ALLOC_ORDER 0
+
+#define ULONG_SIZE (sizeof(unsigned long))
+#define BITMAP_SIZE_UL (BITMAP_SIZE_C / ULONG_SIZE)
+#define MIN_ALLOC_SIZE (1 << ALLOC_ORDER)
+#define ENTRIES (BITMAP_SIZE_C * 8)
+#define CHUNK_SIZE  (MIN_ALLOC_SIZE * ENTRIES)
+
+#ifndef CONFIG_TEST_GENERIC_ALLOCATOR_VERBOSE
+
+static inline void print_first_chunk_bitmap(struct gen_pool *pool) {}
+
+#else
+
+static void print_first_chunk_bitmap(struct gen_pool *pool)
+{
+   struct gen_pool_chunk *chunk;
+   char bitmap[BITMAP_SIZE_C * 2 + 1];
+   unsigned long i;
+   char *bm = bitmap;
+   char *entry;
+
+   if (unlikely(pool == NULL || pool->chunks.next == NULL))
+   return;
+
+   chunk = container_of(pool->chunks.next, struct gen_pool_chunk,
+next_chunk);
+

[PATCH 2/8] Add label to genalloc.rst for cross reference

2018-03-13 Thread Igor Stoppa
Put a label at the beginning of the genalloc.rst, to allow other
documents to cross-reference it.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/genalloc.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/core-api/genalloc.rst 
b/Documentation/core-api/genalloc.rst
index 6b38a39fab24..39dba5bb7b05 100644
--- a/Documentation/core-api/genalloc.rst
+++ b/Documentation/core-api/genalloc.rst
@@ -1,3 +1,5 @@
+.. _genalloc:
+
 The genalloc/genpool subsystem
 ==
 
-- 
2.14.1



[PATCH 4/8] struct page: add field for vm_struct

2018-03-13 Thread Igor Stoppa
When a page is used for virtual memory, it is often necessary to obtain
a handler to the corresponding vm_struct, which refers to the virtually
continuous area generated when invoking vmalloc.

The struct page has a "mapping" field, which can be re-used, to store a
pointer to the parent area.

This will avoid more expensive searches, later on.

Signed-off-by: Igor Stoppa 
---
 include/linux/mm_types.h | 1 +
 mm/vmalloc.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b9591d..c3a4825e10c0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -84,6 +84,7 @@ struct page {
void *s_mem;/* slab first object */
atomic_t compound_mapcount; /* first tail page */
/* page_deferred_list().next -- second tail page */
+   struct vm_struct *area;
};
 
/* Second double word */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ebff729cc956..61a1ca22b0f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
struct page *page = area->pages[i];
 
BUG_ON(!page);
+   page->area = NULL;
__free_pages(page, 0);
}
 
@@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
area->nr_pages = i;
goto fail;
}
+   page->area = area;
area->pages[i] = page;
if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
cond_resched();
-- 
2.14.1



[PATCH 5/8] Protectable Memory

2018-03-13 Thread Igor Stoppa
The MMU available in many systems running Linux can often provide R/O
protection to the memory pages it handles.

However, the MMU-based protection works efficiently only when said pages
contain exclusively data that will not need further modifications.

Statically allocated variables can be segregated into a dedicated
section, but this does not sit very well with dynamically allocated
ones.

Dynamic allocation does not provide, currently, any means for grouping
variables in memory pages that would contain exclusively data suitable
for conversion to read only access mode.

The allocator here provided (pmalloc - protectable memory allocator)
introduces the concept of pools of protectable memory.

A module can request a pool and then refer any allocation request to the
pool handler it has received.

Once all the chunks of memory associated to a specific pool are
initialized, the pool can be protected.

After this point, the pool can only be destroyed (it is up to the module
to avoid any further references to the memory from the pool, after
the destruction is invoked).

The latter case is mainly meant for releasing memory, when a module is
unloaded.

A module can have as many pools as needed, for example to support the
protection of data that is initialized in sufficiently distinct phases.

Since pmalloc memory is obtained from vmalloc, an attacker that has
gained access to the physical mapping, still has to identify where the
target of the attack is actually located.

At the same time, being also based on genalloc, pmalloc does not
generate as much trashing of the TLB as it would be caused by using
directly only vmalloc.

Signed-off-by: Igor Stoppa 
---
 include/linux/genalloc.h |   4 +
 include/linux/pmalloc.h  | 163 
 include/linux/vmalloc.h  |   1 +
 lib/genalloc.c   |  23 ++
 mm/Kconfig   |   7 +
 mm/Makefile  |   1 +
 mm/pmalloc.c | 643 +++
 mm/usercopy.c|  33 +++
 8 files changed, 875 insertions(+)
 create mode 100644 include/linux/pmalloc.h
 create mode 100644 mm/pmalloc.c

diff --git a/include/linux/genalloc.h b/include/linux/genalloc.h
index ff7229520656..9e98f3c991a8 100644
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@ -120,6 +120,10 @@ void *gen_pool_dma_alloc(struct gen_pool *pool, size_t 
size, dma_addr_t *dma);
 void gen_pool_free(struct gen_pool *pool, unsigned long addr, size_t size);
 
 
+void gen_pool_flush_chunk(struct gen_pool *pool,
+ struct gen_pool_chunk *chunk);
+
+
 void gen_pool_for_each_chunk(struct gen_pool *pool,
 void (*func)(struct gen_pool *pool,
  struct gen_pool_chunk *chunk,
diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h
new file mode 100644
index ..3c393069c9f1
--- /dev/null
+++ b/include/linux/pmalloc.h
@@ -0,0 +1,163 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * pmalloc.h: Header for Protectable Memory Allocator
+ *
+ * (C) Copyright 2017 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#ifndef _LINUX_PMALLOC_H
+#define _LINUX_PMALLOC_H
+
+
+#include 
+#include 
+
+#define PMALLOC_DEFAULT_ALLOC_ORDER (-1)
+
+/*
+ * Library for dynamic allocation of pools of memory that can be,
+ * after initialization, marked as read-only.
+ *
+ * This is intended to complement __read_only_after_init, for those cases
+ * where either it is not possible to know the initialization value before
+ * init is completed, or the amount of data is variable and can be
+ * determined only at run-time.
+ *
+ * ***WARNING***
+ * The user of the API is expected to synchronize:
+ * 1) allocation,
+ * 2) writes to the allocated memory,
+ * 3) write protection of the pool,
+ * 4) freeing of the allocated memory, and
+ * 5) destruction of the pool.
+ *
+ * For a non-threaded scenario, this type of locking is not even required.
+ *
+ * Even if the library were to provide support for locking, point 2)
+ * would still depend on the user taking the lock.
+ */
+
+
+struct gen_pool *pmalloc_create_pool(const char *name,
+int min_alloc_order);
+
+
+int is_pmalloc_object(const void *ptr, const unsigned long n);
+
+
+bool pmalloc_expand_pool(struct gen_pool *pool, size_t size);
+
+
+void *pmalloc(struct gen_pool *pool, size_t size, gfp_t gfp);
+
+
+/**
+ * pzalloc() - zero-initialized version of pmalloc
+ * @pool: handle to the pool to be used for memory allocation
+ * @size: amount of memory (in bytes) requested
+ * @gfp: flags for page allocation
+ *
+ * Executes pmalloc, initializing the memory requested to 0,
+ * before returning the pointer to it.
+ *
+ * Return:
+ * * pointer to the memory requested   - success
+ * * NULL  - either no memory available or
+ *   pool already read-only
+ */
+static inline void *pzalloc(struct gen_pool *pool

[PATCH 8/8] Documentation for Pmalloc

2018-03-13 Thread Igor Stoppa
Detailed documentation about the protectable memory allocator.

Signed-off-by: Igor Stoppa 
---
 Documentation/core-api/index.rst   |   1 +
 Documentation/core-api/pmalloc.rst | 111 +
 2 files changed, 112 insertions(+)
 create mode 100644 Documentation/core-api/pmalloc.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8f5de42d6571 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+   pmalloc
 
 Interfaces for kernel debugging
 ===
diff --git a/Documentation/core-api/pmalloc.rst 
b/Documentation/core-api/pmalloc.rst
new file mode 100644
index ..10e01187d049
--- /dev/null
+++ b/Documentation/core-api/pmalloc.rst
@@ -0,0 +1,111 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _pmalloc:
+
+Protectable memory allocator
+
+
+Purpose
+---
+
+The pmalloc library is meant to provide read-only status to data that,
+for some reason, could neither be declared as constant, nor could it take
+advantage of the qualifier __ro_after_init, but is write-once and
+read-only in spirit.
+It protects data from both accidental and malicious overwrites.
+
+Example: A policy that is loaded from userspace.
+
+
+Concept
+---
+
+pmalloc builds on top of :ref:`genalloc `, using the same
+concept of memory pools.
+
+The value added by pmalloc is that now the memory contained in a pool can
+become read-only, for the rest of the life of the pool.
+
+Different kernel drivers and threads can use different pools, for finer
+control of what becomes read_only and when.
+And for improved lockless concurrency.
+
+
+Caveats
+---
+
+- To facilitate the conversion of existing code to pmalloc pools, several
+  helper functions are provided, mirroring their k/vmalloc counterparts.
+  In particular, pfree(), which is mostly meant for error paths, when one
+  or more previous allocations must be rolled back.
+
+- Memory freed while a pool is not yet protected will be reused.
+
+- Once a pool is protected, it's not possible to allocate any more memory
+  from it.
+
+- Memory "freed" from a protected pool indicates that such memory is not
+  in use anymore by the requester; however, it will not become available
+  for further use, until the pool is destroyed.
+
+- pmalloc does not provide locking support with respect to allocating vs
+  protecting an individual pool, for performance reasons.
+  It is recommended not to share the same pool between unrelated functions.
+  Should sharing be a necessity, the user of the shared pool is expected
+  to implement locking for that pool.
+
+- pmalloc uses genalloc to optimize the use of the space it allocates
+  through vmalloc. Some more TLB entries will be used, however less than
+  in the case of using vmalloc directly. The exact number depends on the
+  size of each allocation request and possible slack.
+
+- Considering that not much data is supposed to be dynamically allocated
+  and then marked as read-only, it shouldn't be an issue that the address
+  range for pmalloc is limited, on 32-bit systems.
+
+- Regarding SMP systems, the allocations are expected to happen mostly
+  during an initial transient, after which there should be no more need to
+  perform cross-processor synchronizations of page tables.
+
+
+Use
+---
+
+The typical sequence, when using pmalloc, is:
+
+#. create a pool
+
+   :c:func:`pmalloc_create_pool`
+
+#. [optional] pre-allocate some memory in the pool
+
+   :c:func:`pmalloc_prealloc`
+
+#. issue one or more allocation requests to the pool with locking as needed
+
+   :c:func:`pmalloc`
+
+   :c:func:`pzalloc`
+
+#. initialize the memory obtained with desired values
+
+#. [optional] iterate over points 3 & 4 as needed
+
+#. write-protect the pool
+
+   :c::func:`pmalloc_protect_pool`
+
+#. use in read-only mode the handles obtained through the allocations
+
+#. [optional] release all the memory allocated
+
+   :c:func:`pfree`
+
+#. [optional, but depends on point 8] destroy the pool
+   :c:func:`pmalloc_destroy_pool`
+
+API
+---
+
+.. kernel-doc:: include/linux/pmalloc.h
+.. kernel-doc:: mm/pmalloc.c
-- 
2.14.1



[PATCH 6/8] Pmalloc selftest

2018-03-13 Thread Igor Stoppa
Add basic self-test functionality for pmalloc.

The testing is introduced as early as possible, right after the main
dependency, genalloc, has passed successfully, so that it can help
diagnosing failures in pmalloc users.

Signed-off-by: Igor Stoppa 
---
 include/linux/test_pmalloc.h |  24 +
 init/main.c  |   2 +
 mm/Kconfig   |  10 ++
 mm/Makefile  |   1 +
 mm/test_pmalloc.c| 238 +++
 5 files changed, 275 insertions(+)
 create mode 100644 include/linux/test_pmalloc.h
 create mode 100644 mm/test_pmalloc.c

diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h
new file mode 100644
index ..c7e2e451c17c
--- /dev/null
+++ b/include/linux/test_pmalloc.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * test_pmalloc.h
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+
+#ifndef __LINUX_TEST_PMALLOC_H
+#define __LINUX_TEST_PMALLOC_H
+
+
+#ifdef CONFIG_TEST_PROTECTABLE_MEMORY
+
+void test_pmalloc(void);
+
+#else
+
+static inline void test_pmalloc(void){};
+
+#endif
+
+#endif
diff --git a/init/main.c b/init/main.c
index 2bf1312fd2fe..ea44c940070a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -91,6 +91,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -663,6 +664,7 @@ asmlinkage __visible void __init start_kernel(void)
mem_encrypt_init();
 
test_genalloc();
+   test_pmalloc();
 #ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
diff --git a/mm/Kconfig b/mm/Kconfig
index 016d29b9400b..47b0843b02d2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -767,3 +767,13 @@ config PROTECTABLE_MEMORY
 depends on ARCH_HAS_SET_MEMORY
 select GENERIC_ALLOCATOR
 default y
+
+config TEST_PROTECTABLE_MEMORY
+   bool "Run self test for pmalloc memory allocator"
+depends on MMU
+   depends on ARCH_HAS_SET_MEMORY
+   select PROTECTABLE_MEMORY
+   default n
+   help
+ Tries to verify that pmalloc works correctly and that the memory
+ is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index 959fdbdac118..1de4be5fd0bc 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o
+obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c
new file mode 100644
index ..598119ffb0ed
--- /dev/null
+++ b/mm/test_pmalloc.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * test_pmalloc.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#define SIZE_1 (PAGE_SIZE * 3)
+#define SIZE_2 1000
+
+static struct gen_pool *pool_unprot;
+static struct gen_pool *pool_prot;
+static struct gen_pool *pool_pre;
+
+static void *var_prot;
+static void *var_unprot;
+static void *var_vmall;
+
+/**
+ * validate_alloc() - wrapper for is_pmalloc_object with messages
+ * @expected: whether if the test is supposed to be ok or not
+ * @addr: base address of the range to test
+ * @size: length of he range to test
+ */
+static inline bool validate_alloc(bool expected, void *addr,
+ unsigned long size)
+{
+   bool test;
+
+   test = is_pmalloc_object(addr, size) > 0;
+   pr_notice("must be %s: %s",
+ expected ? "ok" : "no", test ? "ok" : "no");
+   return test == expected;
+}
+
+
+#define is_alloc_ok(variable, size)\
+   validate_alloc(true, variable, size)
+
+
+#define is_alloc_no(variable, size)\
+   validate_alloc(false, variable, size)
+
+/**
+ * create_pools() - tries to instantiate the pools needed for the test
+ *
+ * Creates the respective instances for each pool used in the test.
+ * In case of error, it rolls back whatever previous step passed
+ * successfully.
+ *
+ * Return:
+ * * true  - success
+ * * false - something failed
+ */
+static bool create_pools(void)
+{
+   pr_notice("Testing pool creation capability");
+
+   pool_pre = pmalloc_create_pool("preallocated", 0);
+   if (unlikely(!pool_pre))
+   goto err_pre;
+
+   pool_unprot = pmalloc_create_pool("unprotected", 0);
+   if (unlikely(!pool_unprot))
+   goto err_unprot;
+
+   pool_prot = pmalloc_create_pool("protected", 0);
+   if (unlikely(!(pool_prot)))
+   goto err_prot;
+   return true;
+err_prot:
+

[PATCH 7/8] lkdtm: crash on overwriting protected pmalloc var

2018-03-13 Thread Igor Stoppa
Verify that pmalloc read-only protection is in place: trying to
overwrite a protected variable will crash the kernel.

Signed-off-by: Igor Stoppa 
---
 drivers/misc/lkdtm.h   |  1 +
 drivers/misc/lkdtm_core.c  |  3 +++
 drivers/misc/lkdtm_perms.c | 28 
 3 files changed, 32 insertions(+)

diff --git a/drivers/misc/lkdtm.h b/drivers/misc/lkdtm.h
index 9e513dcfd809..dcda3ae76ceb 100644
--- a/drivers/misc/lkdtm.h
+++ b/drivers/misc/lkdtm.h
@@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+void lkdtm_WRITE_RO_PMALLOC(void);
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm_core.c b/drivers/misc/lkdtm_core.c
index 2154d1bfd18b..c9fd42bda6ee 100644
--- a/drivers/misc/lkdtm_core.c
+++ b/drivers/misc/lkdtm_core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PROTECTABLE_MEMORY
+   CRASHTYPE(WRITE_RO_PMALLOC),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm_perms.c b/drivers/misc/lkdtm_perms.c
index 53b85c9d16b8..0ac9023fd2b0 100644
--- a/drivers/misc/lkdtm_perms.c
+++ b/drivers/misc/lkdtm_perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -104,6 +105,33 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
 }
 
+#ifdef CONFIG_PROTECTABLE_MEMORY
+void lkdtm_WRITE_RO_PMALLOC(void)
+{
+   struct gen_pool *pool;
+   int *i;
+
+   pool = pmalloc_create_pool("pool", 0);
+   if (unlikely(!pool)) {
+   pr_info("Failed preparing pool for pmalloc test.");
+   return;
+   }
+
+   i = (int *)pmalloc(pool, sizeof(int), GFP_KERNEL);
+   if (unlikely(!i)) {
+   pr_info("Failed allocating memory for pmalloc test.");
+   pmalloc_destroy_pool(pool);
+   return;
+   }
+
+   *i = INT_MAX;
+   pmalloc_protect_pool(pool);
+
+   pr_info("attempting bad pmalloc write at %p\n", i);
+   *i = 0;
+}
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
size_t size;
-- 
2.14.1



Re: [RFC PATCH v19 0/8] mm: security: ro protection for dynamic data

2018-03-14 Thread Igor Stoppa
On 13/03/18 23:45, Igor Stoppa wrote:

[...]

Some more thoughts about the open topics:

> Discussion topics that are unclear if they are closed and would need
> comment from those who initiated them, if my answers are accepted or not:
> 
> * @Kees Cook proposed to have first self testing for genalloc, to
>   validate the following patch, adding tracing of allocations
>   My answer was that such tests would also need patching, therefore they 
>   could not certify that the functionality is corect both before and
>   after the genalloc bitmap modification.

This is the only one where I still couldn't find a solution.
If Matthew Wilcox's proposal about implementing the the genalloc bitmap
would work, then this too could be done, but I think this alternate
bitmap proposed has problems. More on it below.

> * @Kees Cook proposed to turn the self testing into modules.
>   My answer was that the functionality is intentionally tested very early
>   in the boot phase, to prevent unexplainable errors, should the feature
>   really fail.

This could be workable, if it's acceptable that the early testing is
performed only when the module is compiled in.
I do not expect the module-based testing to bring much value, but it
doesn't do harm. Is this acceptable?

> * @Matthew Wilcox proposed to use a different mechanism for the genalloc
>   bitmap: 2 bitmaps, one for occupation and one for start.
>   And possibly use an rbtree for the starts.
>   My answer was that this solution is less optimized, because it scatters
>   the data of one allocation across multiple words/pages, plus is not
>   a transaction anymore. And the particular distribution of sizes of
>   allocation is likely to eat up much more memory than the bitmap.

I think I can describe a scenario where the split bitmaps would not work
(based on my understanding of the proposal), but I would appreciate a
review. Here it is:

* One allocation (let's call it allocation A) is already present in both
bitmaps:
  - its units of allocation are marked in the "space" bitmap
  - its starting bit is marked in the "starts" bitmap

* Another allocation (let's call it allocation B) is undergoing:
  - some of its units of allocation (starting from the beginning) are
marked in the "space" bitmap
  - the starting bit is *not* yet marked in the "starts" bitmap

* B occupies the space immediately after A

* While B is being written, A is freed

* Having to determine the length of A, the "space" bitmap will be
  searched, then the "starts" bitmap


The space initially allocated for B will be wrongly accounted for A,
because there is no empty gap in-between and the beginning of B is not
yet marked.

The implementation which interleaves "space" and "start" does not suffer
from this sort of races, because the alteration of the interleaved
bitmaps is atomic.

However, at the very least, some more explanation is needed in the
documentation/code, because this scenario is not exactly obvious.

Does this justification for the use of interleaved bitmaps (iow the
current implementation) make sense?

--
igor


Re: [RFC PATCH v19 0/8] mm: security: ro protection for dynamic data

2018-03-14 Thread Igor Stoppa


On 14/03/18 13:56, Matthew Wilcox wrote:
> On Wed, Mar 14, 2018 at 01:21:54PM +0200, Igor Stoppa wrote:

[...]

> You misread my proposal.  I did not suggest storing the 'start', but the
> 'end'.

Ok, but doesn't that only change the race scenario?

Attempting to free one allocation, while it is in progress, so that all
the "space" bits are written, but the "end bit" is not yet written.
That will eat up also the following, complete allocation, if there is no
locking in place.

[...]

>> The implementation which interleaves "space" and "start" does not suffer
>> from this sort of races, because the alteration of the interleaved
>> bitmaps is atomic.
> 
> This would be a bug in the allocator implementation.  Obviously it has to
> maintain the integrity of its own data structures.

But I cannot imagine how to do it, with the split bitmaps, without a
lock :-/
And genalloc is supposed to be lockless.

>> Does this justification for the use of interleaved bitmaps (iow the
>> current implementation) make sense?
> 
> I think you're making a mistake by basing the pmalloc allocator on
> genalloc.

It was recommended to me because it was a close match to the allocator
that I was writing from scratch and, when I looked at it, I could only
agree that it was very close.

But I have no particular reason for preferring it, if something better
is available. It was just never brought up before.
At least not that I noticed.

>  The page_frag allocator seems like a much better place to
> start than genalloc.  It has a significantly lower overhead and is
> much more suited to the kind of probably-identical-lifespan that the
> pmalloc API is going to persuade its users to have.


Could you please provide me a pointer?
I did a quick search on 4.16-rc5 and found the definition of page_frag
and sk_page_frag(). Is this what you are referring to?

--
igor


Re: [PATCH 5/8] Protectable Memory

2018-03-14 Thread Igor Stoppa


On 14/03/18 14:15, Matthew Wilcox wrote:
> On Tue, Mar 13, 2018 at 11:45:51PM +0200, Igor Stoppa wrote:
>> +static inline void *pmalloc_array(struct gen_pool *pool, size_t n,
>> +  size_t size, gfp_t flags)
>> +{
>> +if (unlikely(!(pool && n && size)))
>> +return NULL;
> 
> Why not use the same formula as kvmalloc_array here?  You've failed to
> protect against integer overflow, which is the whole point of pmalloc_array.
> 
>   if (size != 0 && n > SIZE_MAX / size)
>   return NULL;


oops :-(

>> +static inline char *pstrdup(struct gen_pool *pool, const char *s, gfp_t gfp)
>> +{
>> +size_t len;
>> +char *buf;
>> +
>> +if (unlikely(pool == NULL || s == NULL))
>> +return NULL;
> 
> No, delete these checks.  They'll mask real bugs.

I thought I got rid of all of them, but some have escaped me

>> +static inline void pfree(struct gen_pool *pool, const void *addr)
>> +{
>> +gen_pool_free(pool, (unsigned long)addr, 0);
>> +}
> 
> It's poor form to use a different subsystem's type here.  It ties you
> to genpool, so if somebody wants to replace it, you have to go through
> all the users and change them.  If you use your own type, it's a much
> easier task.

I thought about it, but typedef came to my mind and knowing it's usually
frowned upon, I restrained myself.

> struct pmalloc_pool {
>   struct gen_pool g;
> }

I didn't think this could be acceptable either. But if it is, then ok.

> then:
> 
> static inline void pfree(struct pmalloc_pool *pool, const void *addr)
> {
>   gen_pool_free(&pool->g, (unsigned long)addr, 0);
> }
> 
> Looking further down, you could (should) move the contents of pmalloc_data
> into pmalloc_pool; that's one fewer object to keep track of.
> 
>> +struct pmalloc_data {
>> +struct gen_pool *pool;  /* Link back to the associated pool. */
>> +bool protected; /* Status of the pool: RO or RW. */
>> +struct kobj_attribute attr_protected; /* Sysfs attribute. */
>> +struct kobj_attribute attr_avail; /* Sysfs attribute. */
>> +struct kobj_attribute attr_size;  /* Sysfs attribute. */
>> +struct kobj_attribute attr_chunks;/* Sysfs attribute. */
>> +struct kobject *pool_kobject;
>> +struct list_head node; /* list of pools */
>> +};
> 
> sysfs attributes aren't free, you know.  I appreciate you want something
> to help debug / analyse, but having one file for the whole subsystem or
> at least one per pool would be a better idea.

Which means that it should not be normal sysfs, but rather debugfs, if I
understand correctly, since in sysfs 1 value -> 1 file.

--
igor



Re: [RFC PATCH v19 0/8] mm: security: ro protection for dynamic data

2018-03-14 Thread Igor Stoppa


On 14/03/18 15:04, Matthew Wilcox wrote:

> I don't necessarily think you should use it as-is,

I think I simply cannot use it as-is, because it seems to use linear
memory, while I need virtual. This reason alone would require a rewrite
of several parts.

> but the principle it uses
> seems like a better match to me than the rather complex genalloc.

It uses meta data in a different way than genalloc.
There is probably a tipping point where one implementation becomes more
space-efficient than the other.

Probably page_frag does well with relatively large allocations, while
genalloc seems to be better for small (few allocation units) allocations.

Also, in case of high variance in the size of the allocations, genalloc
requires the allocation unit to be small enough to fit the smallest
request (otherwise one must accept some slack), while page_frag doesn't
care if the allocation is small or large.

page_frag otoh, seems to not support the reuse of space that was freed,
since there is only

But could you please explain to what you are referring to, when you say
that page_frag has "significantly lower overhead" ?

Is it because it doesn't try to reclaim space that was freed, until the
whole page is empty?

I see different trade-offs, but I am probably either missing or
underestimating the main reason why you think this is better.

And probably I am missing the capability of judging what is acceptable
in certain cases.

Ex: if the pfree is called only on error paths, is it ok to not claim
back the memory released, if it's less than one page?

To be clear: I do not want to hold to genalloc just because I have
already implemented it. I can at least sketch a version with page_frag,
but I would like to understand why its trade-offs are better :-)

> Just allocate some pages and track the offset within those pages that 

> is the current allocation point.


> It's less than 100 lines of code!

Strictly speaking it is true, but it all relies on other functions,
which must be rewritten, because they use linear address, while this
must work with virtual (vmalloc) addresses.

Also, I see that the code relies a lot on order of allocation.
I think we had similar discussion wrt compound pages.

It seems to me wasteful, if I have a request of, say, 5 pages, and I end
up allocating 8.

I do not recall anyone giving a justification like:
"yeah, it uses extra pages, but it's preferable, for reasons X, Y and Z,
so it's a good trade-off"

Could it be that it's normal RAM is considered less precious than the
special memory genalloc is written for, so normal RAM is not really
proactively reused, while special memory is treated as a more valuable
resource that should not be wasted?


--
igor


Re: [PATCH 10/17] prmem: documentation

2018-11-21 Thread Igor Stoppa

Hi,

On 13/11/2018 20:36, Andy Lutomirski wrote:

On Tue, Nov 13, 2018 at 10:33 AM Igor Stoppa  wrote:


I forgot one sentence :-(

On 13/11/2018 20:31, Igor Stoppa wrote:

On 13/11/2018 19:47, Andy Lutomirski wrote:


For general rare-writish stuff, I don't think we want IRQs running
with them mapped anywhere for write.  For AVC and IMA, I'm less sure.


Why would these be less sensitive?

But I see a big difference between my initial implementation and this one.

In my case, by using a shared mapping, visible to all cores, freezing
the core that is performing the write would have exposed the writable
mapping to a potential attack run from another core.

If the mapping is private to the core performing the write, even if it
is frozen, it's much harder to figure out what it had mapped and where,
from another core.

To access that mapping, the attack should be performed from the ISR, I
think.


Unless the secondary mapping is also available to other cores, through
the shared mm_struct ?



I don't think this matters much.  The other cores will only be able to
use that mapping when they're doing a rare write.



I'm still mulling over this.
There might be other reasons for replicating the mm_struct.

If I understand correctly how the text patching works, it happens 
sequentially, because of the text_mutex used by arch_jump_label_transform


Which might be fine for this specific case, but I think I shouldn't 
introduce a global mutex, when it comes to data.
Most likely, if two or more cores want to perform a write rare 
operation, there is no correlation between them, they could proceed in 
parallel. And if there really is, then the user of the API should 
introduce own locking, for that specific case.


A bit unrelated question, related to text patching: I see that each 
patching operation is validated, but wouldn't it be more robust to first 
validate  all of then, and only after they are all found to be 
compliant, to proceed with the actual modifications?


And about the actual implementation of the write rare for the statically 
allocated variables, is it expected that I use Nadav's function?

Or that I refactor the code?

The name, referring to text would definitely not be ok for data.
And I would have to also generalize it, to deal with larger amounts of data.

I would find it easier, as first cut, to replicate its behavior and 
refactor only later, once it has stabilized and possibly Nadav's patches 
have been acked.


--
igor


Re: [PATCH 10/17] prmem: documentation

2018-11-13 Thread Igor Stoppa
On 13/11/2018 19:16, Andy Lutomirski wrote:
> On Tue, Nov 13, 2018 at 6:25 AM Igor Stoppa  wrote:

[...]

>> How about having one mm_struct for each writer (core or thread)?
>>
> 
> I don't think that helps anything.  I think the mm_struct used for
> prmem (or rare_write or whatever you want to call it)

write_rare / rarely can be shortened to wr_  which is kinda less
confusing than rare_write, since it would become rw_ and easier to
confuse with R/W

Any advice for better naming is welcome.

> should be
> entirely abstracted away by an appropriate API, so neither SELinux nor
> IMA need to be aware that there's an mm_struct involved.

Yes, that is fine. In my proposal I was thinking about tying it to the
core/thread that performs the actual write.

The high level API could be something like:

wr_memcpy(void *src, void *dst, uint_t size)

>  It's also
> entirely possible that some architectures won't even use an mm_struct
> behind the scenes -- x86, for example, could have avoided it if there
> were a kernel equivalent of PKRU.  Sadly, there isn't.

The mm_struct - or whatever is the means to do the write on that
architecture - can be kept hidden from the API.

But the reason why I was proposing to have one mm_struct per writer is
that, iiuc, the secondary mapping is created in the secondary mm_struct
for each writer using it.

So the updating of IMA measurements would have, theoretically, also
write access to the SELinux AVC. Which I was trying to avoid.
And similarly any other write rare updater. Is this correct?

>> 2) Iiuc, the purpose of the 2 pages being remapped is that the target of
>> the patch might spill across the page boundary, however if I deal with
>> the modification of generic data, I shouldn't (shouldn't I?) assume that
>> the data will not span across multiple pages.
> 
> The reason for the particular architecture of text_poke() is to avoid
> memory allocation to get it working.  i think that prmem/rare_write
> should have each rare-writable kernel address map to a unique user
> address, possibly just by offsetting everything by a constant.  For
> rare_write, you don't actually need it to work as such until fairly
> late in boot, since the rare_writable data will just be writable early
> on.

Yes, that is true. I think it's safe to assume, from an attack pattern,
that as long as user space is not started, the system can be considered
ok. Even user-space code run from initrd should be ok, since it can be
bundled (and signed) as a single binary with the kernel.

Modules loaded from a regular filesystem are a bit more risky, because
an attack might inject a rogue key in the key-ring and use it to load
malicious modules.

>> If the data spans across multiple pages, in unknown amount, I suppose
>> that I should not keep interrupts disabled for an unknown time, as it
>> would hurt preemption.
>>
>> What I thought, in my initial patch-set, was to iterate over each page
>> that must be written to, in a loop, re-enabling interrupts in-between
>> iterations, to give pending interrupts a chance to be served.
>>
>> This would mean that the data being written to would not be consistent,
>> but it's a problem that would have to be addressed anyways, since it can
>> be still read by other cores, while the write is ongoing.
> 
> This probably makes sense, except that enabling and disabling
> interrupts means you also need to restore the original mm_struct (most
> likely), which is slow.  I don't think there's a generic way to check
> whether in interrupt is pending without turning interrupts on.

The only "excuse" I have is that write_rare is opt-in and is "rare".
Maybe the enabling/disabling of interrupts - and the consequent switch
of mm_struct - could be somehow tied to the latency configuration?

If preemption is disabled, the expectations on the system latency are
anyway more relaxed.

But I'm not sure how it would work against I/O.

--
igor


Re: [PATCH 2/6] __wr_after_init: write rare for static allocation

2018-12-09 Thread Igor Stoppa

On 06/12/2018 01:13, Andy Lutomirski wrote:


+   kasan_disable_current();
+   if (op == WR_MEMCPY)
+   memcpy((void *)wr_poking_addr, (void *)src, len);
+   else if (op == WR_MEMSET)
+   memset((u8 *)wr_poking_addr, (u8)src, len);
+   else if (op == WR_RCU_ASSIGN_PTR)
+   /* generic version of rcu_assign_pointer */
+   smp_store_release((void **)wr_poking_addr,
+ RCU_INITIALIZER((void **)src));
+   kasan_enable_current();


Hmm.  I suspect this will explode quite badly on sane architectures
like s390.  (In my book, despite how weird s390 is, it has a vastly
nicer model of "user" memory than any other architecture I know
of...).


I see. I can try to setup also a qemu target for s390, for my tests.
There seems to be a Debian image, to have a fully bootable system.


I think you should use copy_to_user(), etc, instead.


I'm having troubles with the "etc" part: as far as I can see, there are 
both generic and specific support for both copying and clearing 
user-space memory from kernel, however I couldn't find something that 
looks like a memset_user().


I can of course roll my own, for example iterating copy_to_user() with 
the support of a pre-allocated static buffer (1 page should be enough).


But, before I go down this path, I wanted to confirm that there's really 
nothing better that I could use.


If that's really the case, the static buffer instance should be 
replicated for each core, I think, since each core could be performing 
its own memset_user() at the same time.


Alternatively, I could do a loop of WRITE_ONCE(), however I'm not sure 
how that would work with (lack-of) alignment and might require also a 
preamble/epilogue to deal with unaligned data?



 I'm not
entirely sure what the best smp_store_release() replacement is.
Making this change may also mean you can get rid of the
kasan_disable_current().


+
+   barrier(); /* XXX redundant? */


I think it's redundant.  If unuse_temporary_mm() allows earlier stores
to hit the wrong address space, then something is very very wrong, and
something is also very very wrong if the optimizer starts moving
stores across a function call that is most definitely a barrier.


ok, thanks


+
+   unuse_temporary_mm(prev);
+   /* XXX make the verification optional? */
+   if (op == WR_MEMCPY)
+   BUG_ON(memcmp((void *)dst, (void *)src, len));
+   else if (op == WR_MEMSET)
+   BUG_ON(memtst((void *)dst, (u8)src, len));
+   else if (op == WR_RCU_ASSIGN_PTR)
+   BUG_ON(*(unsigned long *)dst != src);


Hmm.  If you allowed cmpxchg or even plain xchg, then these bug_ons
would be thoroughly buggy, but maybe they're okay.  But they should,
at most, be WARN_ON_ONCE(), 


I have to confess that I do not understand why Nadav's patchset was 
required to use BUG_ON(), while here it's not correct, not even for 
memcopy or memset .


Is it because it is single-threaded?
Or is it because text_poke() is patching code, instead of data?
I can turn to WARN_ON_ONCE(), but I'd like to understand the reason.


given that you can trigger them by writing
the same addresses from two threads at once, and this isn't even
entirely obviously bogus given the presence of smp_store_release().


True, however would it be reasonable to require the use of an explicit 
writer lock, from the user?


This operation is not exactly fast and should happen seldom; I'm not 
sure if it's worth supporting cmpxchg. The speedup would be minimal.


I'd rather not implement the locking implicitly, even if it would be 
possible to detect simultaneous writes, because it might lead to overall 
inconsistent data.


--
igor


Re: [PATCH 2/6] __wr_after_init: write rare for static allocation

2018-12-09 Thread Igor Stoppa




On 06/12/2018 06:44, Matthew Wilcox wrote:

On Tue, Dec 04, 2018 at 02:18:01PM +0200, Igor Stoppa wrote:

+void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
+ enum wr_op_type op)
+{
+   temporary_mm_state_t prev;
+   unsigned long flags;
+   unsigned long offset;
+   unsigned long wr_poking_addr;
+
+   /* Confirm that the writable mapping exists. */
+   BUG_ON(!wr_ready);
+
+   if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
+   WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
+   return (void *)dst;
+
+   offset = dst - (unsigned long)&__start_wr_after_init;
+   wr_poking_addr = wr_poking_base + offset;
+   local_irq_save(flags);


Why not local_irq_disable()?  Do we have a use-case for wanting to access
this from interrupt context?


No, not that I can think of. It was "just in case", but I can remove it.


+   /* XXX make the verification optional? */


Well, yes.  It seems like debug code to me.


Ok, I was not sure about this, because text_poke() does it as part of 
its normal operations.



+   /* Randomize the poking address base*/
+   wr_poking_base = TASK_UNMAPPED_BASE +
+   (kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
+   (TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));


I don't think this is a great idea.  We want to use the same mm for both
static and dynamic wr memory, yes?  So we should have enough space for
all of ram, not splatter the static section all over the address space.

On x86-64 (4 level page tables), we have a 64TB space for all of physmem
and 128TB of user space, so we can place the base anywhere in a 64TB
range.


I was actually wondering about the dynamic part.
It's still not clear to me if it's possible to write the code in a 
sufficiently generic way that it could work on all 64 bit architectures.

I'll start with x86-64 as you suggest.

--
igor



Re: [PATCH 2/6] __wr_after_init: write rare for static allocation

2018-12-09 Thread Igor Stoppa




On 06/12/2018 11:44, Peter Zijlstra wrote:

On Wed, Dec 05, 2018 at 03:13:56PM -0800, Andy Lutomirski wrote:


+   if (op == WR_MEMCPY)
+   memcpy((void *)wr_poking_addr, (void *)src, len);
+   else if (op == WR_MEMSET)
+   memset((u8 *)wr_poking_addr, (u8)src, len);
+   else if (op == WR_RCU_ASSIGN_PTR)
+   /* generic version of rcu_assign_pointer */
+   smp_store_release((void **)wr_poking_addr,
+ RCU_INITIALIZER((void **)src));
+   kasan_enable_current();


Hmm.  I suspect this will explode quite badly on sane architectures
like s390.  (In my book, despite how weird s390 is, it has a vastly
nicer model of "user" memory than any other architecture I know
of...).  I think you should use copy_to_user(), etc, instead.  I'm not
entirely sure what the best smp_store_release() replacement is.
Making this change may also mean you can get rid of the
kasan_disable_current().


If you make the MEMCPY one guarantee single-copy atomicity for native
words then you're basically done.

smp_store_release() can be implemented with:

smp_mb();
WRITE_ONCE();

So if we make MEMCPY provide the WRITE_ONCE(), all we need is that
barrier, which we can easily place at the call site and not overly
complicate our interface with this.


Ok, so the 3rd case (WR_RCU_ASSIGN_PTR) could be handled outside of this 
function.
But, since now memcpy() will be replaced by copy_to_user(), can I assume 
that also copy_to_user() will be atomic, if the destination is properly 
aligned? On x86_64 it seems yes, however it's not clear to me if this is 
the outcome of an optimization or if I can expect it to be always true.



--
igor


[PATCH 3/6] rodata_test: refactor tests

2018-12-04 Thread Igor Stoppa
Refactor the test cases, in preparation for using them also for testing
__wr_after_init memory.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 mm/rodata_test.c | 48 
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/mm/rodata_test.c b/mm/rodata_test.c
index d908c8769b48..3c1e515ca9b1 100644
--- a/mm/rodata_test.c
+++ b/mm/rodata_test.c
@@ -14,44 +14,52 @@
 #include 
 #include 
 
-static const int rodata_test_data = 0xC3;
+#define INIT_TEST_VAL 0xC3
 
-void rodata_test(void)
+static const int rodata_test_data = INIT_TEST_VAL;
+
+static bool test_data(char *data_type, const int *data,
+ unsigned long start, unsigned long end)
 {
-   unsigned long start, end;
int zero = 0;
 
/* test 1: read the value */
/* If this test fails, some previous testrun has clobbered the state */
-   if (!rodata_test_data) {
-   pr_err("test 1 fails (start data)\n");
-   return;
+   if (*data != INIT_TEST_VAL) {
+   pr_err("%s: test 1 fails (init data value)\n", data_type);
+   return false;
}
 
/* test 2: write to the variable; this should fault */
-   if (!probe_kernel_write((void *)&rodata_test_data,
-   (void *)&zero, sizeof(zero))) {
-   pr_err("test data was not read only\n");
-   return;
+   if (!probe_kernel_write((void *)data, (void *)&zero, sizeof(zero))) {
+   pr_err("%s: test data was not read only\n", data_type);
+   return false;
}
 
/* test 3: check the value hasn't changed */
-   if (rodata_test_data == zero) {
-   pr_err("test data was changed\n");
-   return;
+   if (*data != INIT_TEST_VAL) {
+   pr_err("%s: test data was changed\n", data_type);
+   return false;
}
 
/* test 4: check if the rodata section is PAGE_SIZE aligned */
-   start = (unsigned long)__start_rodata;
-   end = (unsigned long)__end_rodata;
if (start & (PAGE_SIZE - 1)) {
-   pr_err("start of .rodata is not page size aligned\n");
-   return;
+   pr_err("%s: start of data is not page size aligned\n",
+  data_type);
+   return false;
}
if (end & (PAGE_SIZE - 1)) {
-   pr_err("end of .rodata is not page size aligned\n");
-   return;
+   pr_err("%s: end of data is not page size aligned\n",
+  data_type);
+   return false;
}
+   return true;
+}
 
-   pr_info("all tests were successful\n");
+void rodata_test(void)
+{
+   if (test_data("rodata", &rodata_test_data,
+ (unsigned long)&__start_rodata,
+ (unsigned long)&__end_rodata))
+   pr_info("all tests were successful\n");
 }
-- 
2.19.1



[RFC v1 PATCH 0/6] hardening: statically allocated protected memory

2018-12-04 Thread Igor Stoppa
This patch-set is the first-cut implementation of write-rare memory
protection, as previously agreed [1]
Its purpose it to keep data write protected kernel data which is seldom
modified.
There is no read overhead, however writing requires special operations that
are probably unsitable for often-changing data.
The use is opt-in, by applying the modifier __wr_after_init to a variable
declaration.

As the name implies, the write protection kicks in only after init() is
completed; before that moment, the data is modifiable in the usual way.

Current Limitations:
* supports only data which is allocated statically, at build time.
* supports only x86_64
* might not work for very large amount of data, since it relies on the
  assumption that said data can be entirely remapped, at init.


Some notes:
- even if the code is only for x86_64, it is placed in the generic
  locations, with the intention of extending it also to arm64
- the current section used for collecting wr-after-init data might need to
  be moved, to work with arm64 MMU
- the functionality is in its own c and h files, for now, to ease the
  introduction (and refactoring) of code dealing with dynamic allocation
- recently some updated patches were posted for live-patch on arm64 [2],
  they might help with adding arm64 support here
- to avoid the risk of weakening __ro_after_init, __wr_after_init data is
  in a separate set of pages, and any invocation will confirm that the
  memory affected falls within this range.
  I have modified rodata_test accordingly, to check als othis case.
- to avoid replicating the code which does the change of mapping, there is
  only one function performing multiple, selectable, operations, such as
  memcpy(), memset(). I have added also rcu_assign_pointer() as further
  example. But I'm not too fond of this implementation either. I just
  couldn't think of any that I would like significantly better.
- I have left out the patchset from Nadav that these patches depend on,
  but it can be found here [3] (Should have I resubmitted it?)
- I am not sure what is the correct form for giving proper credit wrt the
  authoring of the wr_after_init mechanism, guidance would be appreciated
- In an attempt to spam less people, I have curbed the list of recipients.
  If I have omitted someone who should have been kept/added, please
  add them to the thread.


[1] https://www.openwall.com/lists/kernel-hardening/2018/11/22/8
[2] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793199.html
[3] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1810245.html

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org



Igor Stoppa (6):
[PATCH 1/6] __wr_after_init: linker section and label
[PATCH 2/6] __wr_after_init: write rare for static allocation
[PATCH 3/6] rodata_test: refactor tests
[PATCH 4/6] rodata_test: add verification for __wr_after_init
[PATCH 5/6] __wr_after_init: test write rare functionality
[PATCH 6/6] __wr_after_init: lkdtm test

drivers/misc/lkdtm/core.c |   3 +
drivers/misc/lkdtm/lkdtm.h|   3 +
drivers/misc/lkdtm/perms.c|  29 
include/asm-generic/vmlinux.lds.h |  20 ++
include/linux/cache.h |  17 +
include/linux/prmem.h | 134 +
init/main.c   |   2 +
mm/Kconfig|   4 ++
mm/Kconfig.debug  |   9 +++
mm/Makefile   |   2 +
mm/prmem.c| 124 ++
mm/rodata_test.c  |  63 --
mm/test_write_rare.c  | 135 ++
13 files changed, 525 insertions(+), 20 deletions(-)





[PATCH 1/6] __wr_after_init: linker section and label

2018-12-04 Thread Igor Stoppa
Introduce a section and a label for statically allocated write rare
data. The label is named "__wr_after_init".
As the name implies, after the init phase is completed, this section
will be modifiable only by invoking write rare functions.
The section must take up a set of full pages.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/asm-generic/vmlinux.lds.h | 20 
 include/linux/cache.h | 17 +
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..b711dbe6999f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -311,6 +311,25 @@
KEEP(*(__jump_table))   \
__stop___jump_table = .;
 
+/*
+ * Allow architectures to handle wr_after_init data on their
+ * own by defining an empty WR_AFTER_INIT_DATA.
+ * However, it's important that pages containing WR_RARE data do not
+ * hold anything else, to avoid both accidentally unprotecting something
+ * that is supposed to stay read-only all the time and also to protect
+ * something else that is supposed to be writeable all the time.
+ */
+#ifndef WR_AFTER_INIT_DATA
+#define WR_AFTER_INIT_DATA(align)  \
+   . = ALIGN(PAGE_SIZE);   \
+   __start_wr_after_init = .;  \
+   . = ALIGN(align);   \
+   *(.data..wr_after_init) \
+   . = ALIGN(PAGE_SIZE);   \
+   __end_wr_after_init = .;\
+   . = ALIGN(align);
+#endif
+
 /*
  * Allow architectures to handle ro_after_init data on their
  * own by defining an empty RO_AFTER_INIT_DATA.
@@ -332,6 +351,7 @@
__start_rodata = .; \
*(.rodata) *(.rodata.*) \
RO_AFTER_INIT_DATA  /* Read only after init */  \
+   WR_AFTER_INIT_DATA(align) /* wr after init */   \
KEEP(*(__vermagic)) /* Kernel version magic */  \
. = ALIGN(8);   \
__start___tracepoints_ptrs = .; \
diff --git a/include/linux/cache.h b/include/linux/cache.h
index 750621e41d1c..9a7e7134b887 100644
--- a/include/linux/cache.h
+++ b/include/linux/cache.h
@@ -31,6 +31,23 @@
 #define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
 #endif
 
+/*
+ * __wr_after_init is used to mark objects that cannot be modified
+ * directly after init (i.e. after mark_rodata_ro() has been called).
+ * These objects become effectively read-only, from the perspective of
+ * performing a direct write, like a variable assignment.
+ * However, they can be altered through a dedicated function.
+ * It is intended for those objects which are occasionally modified after
+ * init, however they are modified so seldomly, that the extra cost from
+ * the indirect modification is either negligible or worth paying, for the
+ * sake of the protection gained.
+ */
+#ifndef __wr_after_init
+#define __wr_after_init \
+   __attribute__((__section__(".data..wr_after_init")))
+#endif
+
+
 #ifndef cacheline_aligned
 #define cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))
 #endif
-- 
2.19.1



[PATCH 2/6] __wr_after_init: write rare for static allocation

2018-12-04 Thread Igor Stoppa
Implementation of write rare for statically allocated data, located in a
specific memory section through the use of the __write_rare label.

The basic functions are:
- wr_memset(): write rare counterpart of memset()
- wr_memcpy(): write rare counterpart of memcpy()
- wr_assign(): write rare counterpart of the assignment ('=') operator
- wr_rcu_assign_pointer(): write rare counterpart of rcu_assign_pointer()

The implementation is based on code from Andy Lutomirski and Nadav Amit
for patching the text on x86 [here goes reference to commits, once merged]

The modification of write protected data is done through an alternate
mapping of the same pages, as writable.
This mapping is local to each core and is active only for the duration
of each write operation.
Local interrupts are disabled, while the alternate mapping is active.

In theory, it could introduce a non-predictable delay, in a preemptible
system, however the amount of data to be altered is likely to be far
smaller than a page.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/prmem.h | 133 ++
 init/main.c   |   2 +
 mm/Kconfig|   4 ++
 mm/Makefile   |   1 +
 mm/prmem.c| 124 +++
 5 files changed, 264 insertions(+)
 create mode 100644 include/linux/prmem.h
 create mode 100644 mm/prmem.c

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
new file mode 100644
index ..b0131c1f5dc0
--- /dev/null
+++ b/include/linux/prmem.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * prmem.h: Header for memory protection library
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ *
+ * Support for:
+ * - statically allocated write rare data
+ */
+
+#ifndef _LINUX_PRMEM_H
+#define _LINUX_PRMEM_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * memtst() - test n bytes of the source to match the c value
+ * @p: beginning of the memory to test
+ * @c: byte to compare against
+ * @len: amount of bytes to test
+ *
+ * Returns 0 on success, non-zero otherwise.
+ */
+static inline int memtst(void *p, int c, __kernel_size_t len)
+{
+   __kernel_size_t i;
+
+   for (i = 0; i < len; i++) {
+   u8 d =  *(i + (u8 *)p) - (u8)c;
+
+   if (unlikely(d))
+   return d;
+   }
+   return 0;
+}
+
+
+#ifndef CONFIG_PRMEM
+
+static inline void *wr_memset(void *p, int c, __kernel_size_t len)
+{
+   return memset(p, c, len);
+}
+
+static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
+{
+   return memcpy(p, q, size);
+}
+
+#define wr_assign(var, val)((var) = (val))
+
+#define wr_rcu_assign_pointer(p, v)\
+   rcu_assign_pointer(p, v)
+
+#else
+
+enum wr_op_type {
+   WR_MEMCPY,
+   WR_MEMSET,
+   WR_RCU_ASSIGN_PTR,
+   WR_OPS_NUMBER,
+};
+
+void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
+ enum wr_op_type op);
+
+/**
+ * wr_memset() - sets n bytes of the destination to the c value
+ * @p: beginning of the memory to write to
+ * @c: byte to replicate
+ * @len: amount of bytes to copy
+ *
+ * Returns true on success, false otherwise.
+ */
+static inline void *wr_memset(void *p, int c, __kernel_size_t len)
+{
+   return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
+}
+
+/**
+ * wr_memcpy() - copyes n bytes from source to destination
+ * @dst: beginning of the memory to write to
+ * @src: beginning of the memory to read from
+ * @n_bytes: amount of bytes to copy
+ *
+ * Returns pointer to the destination
+ */
+static inline void *wr_memcpy(void *p, const void *q, __kernel_size_t size)
+{
+   return __wr_op((unsigned long)p, (unsigned long)q, size, WR_MEMCPY);
+}
+
+/**
+ * wr_assign() - sets a write-rare variable to a specified value
+ * @var: the variable to set
+ * @val: the new value
+ *
+ * Returns: the variable
+ *
+ * Note: it might be possible to optimize this, to use wr_memset in some
+ * cases (maybe with NULL?).
+ */
+
+#define wr_assign(var, val) ({ \
+   typeof(var) tmp = (typeof(var))val; \
+   \
+   wr_memcpy(&var, &tmp, sizeof(var)); \
+   var;\
+})
+
+/**
+ * wr_rcu_assign_pointer() - initialize a pointer in rcu mode
+ * @p: the rcu pointer
+ * @v: the new value
+ *
+ * Returns the value assigned to the rcu pointer.
+ *
+ * It is provided as macro, to match rcu_assign_pointer()
+ */
+#define wr_rcu_assign_pointer(p, v) ({ \
+   __wr_op((unsigned long)&p, v, siz

[PATCH 4/6] rodata_test: add verification for __wr_after_init

2018-12-04 Thread Igor Stoppa
The write protection of the __wr_after_init data can be verified with the
same methodology used for const data.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 mm/rodata_test.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/rodata_test.c b/mm/rodata_test.c
index 3c1e515ca9b1..a98d088ad9cc 100644
--- a/mm/rodata_test.c
+++ b/mm/rodata_test.c
@@ -16,7 +16,19 @@
 
 #define INIT_TEST_VAL 0xC3
 
+/*
+ * Note: __ro_after_init data is, for every practical effect, equivalent to
+ * const data, since they are even write protected at the same time; there
+ * is no need for separate testing.
+ * __wr_after_init data, otoh, is altered also after the write protection
+ * takes place and it cannot be exploitable for altering more permanent
+ * data.
+ */
+
 static const int rodata_test_data = INIT_TEST_VAL;
+static int wr_after_init_test_data __wr_after_init = INIT_TEST_VAL;
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
 
 static bool test_data(char *data_type, const int *data,
  unsigned long start, unsigned long end)
@@ -60,6 +72,9 @@ void rodata_test(void)
 {
if (test_data("rodata", &rodata_test_data,
  (unsigned long)&__start_rodata,
- (unsigned long)&__end_rodata))
+ (unsigned long)&__end_rodata) &&
+   test_data("wr after init data", &wr_after_init_test_data,
+ (unsigned long)&__start_wr_after_init,
+ (unsigned long)&__end_wr_after_init))
pr_info("all tests were successful\n");
 }
-- 
2.19.1



[PATCH 5/6] __wr_after_init: test write rare functionality

2018-12-04 Thread Igor Stoppa
Set of test cases meant to confirm that the write rare functionality
works as expected.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 include/linux/prmem.h |   7 ++-
 mm/Kconfig.debug  |   9 +++
 mm/Makefile   |   1 +
 mm/test_write_rare.c  | 135 ++
 4 files changed, 149 insertions(+), 3 deletions(-)
 create mode 100644 mm/test_write_rare.c

diff --git a/include/linux/prmem.h b/include/linux/prmem.h
index b0131c1f5dc0..d2492ec24c8c 100644
--- a/include/linux/prmem.h
+++ b/include/linux/prmem.h
@@ -125,9 +125,10 @@ static inline void *wr_memcpy(void *p, const void *q, 
__kernel_size_t size)
  *
  * It is provided as macro, to match rcu_assign_pointer()
  */
-#define wr_rcu_assign_pointer(p, v) ({ \
-   __wr_op((unsigned long)&p, v, sizeof(p), WR_RCU_ASSIGN_PTR);\
-   p;  \
+#define wr_rcu_assign_pointer(p, v) ({ \
+   __wr_op((unsigned long)&p, (unsigned long)v, sizeof(p), \
+   WR_RCU_ASSIGN_PTR); \
+   p;  \
 })
 #endif
 #endif
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 9a7b8b049d04..a26ecbd27aea 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -94,3 +94,12 @@ config DEBUG_RODATA_TEST
 depends on STRICT_KERNEL_RWX
 ---help---
   This option enables a testcase for the setting rodata read-only.
+
+config DEBUG_PRMEM_TEST
+tristate "Run self test for statically allocated protected memory"
+depends on STRICT_KERNEL_RWX
+select PRMEM
+default n
+help
+  Tries to verify that the protection for statically allocated memory
+  works correctly and that the memory is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index ef3867c16ce0..8de1d468f4e7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_PRMEM) += prmem.o
+obj-$(CONFIG_DEBUG_PRMEM_TEST) += test_write_rare.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_write_rare.c b/mm/test_write_rare.c
new file mode 100644
index ..240cc43793d1
--- /dev/null
+++ b/mm/test_write_rare.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * test_write_rare.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+
+static __wr_after_init int scalar = '0';
+static __wr_after_init u8 array[PAGE_SIZE * 3] __aligned(PAGE_SIZE);
+
+/* The section must occupy a non-zero number of whole pages */
+static bool test_alignment(void)
+{
+   unsigned long pstart = (unsigned long)&__start_wr_after_init;
+   unsigned long pend = (unsigned long)&__end_wr_after_init;
+
+   if (WARN((pstart & ~PAGE_MASK) || (pend & ~PAGE_MASK) ||
+(pstart >= pend), "Boundaries test failed."))
+   return false;
+   pr_info("Boundaries test passed.");
+   return true;
+}
+
+static inline bool test_pattern(void)
+{
+   return (memtst(array, '0', PAGE_SIZE / 2) ||
+   memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 3 / 4) ||
+   memtst(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2) ||
+   memtst(array + PAGE_SIZE * 7 / 4, '1', PAGE_SIZE * 3 / 4) ||
+   memtst(array + PAGE_SIZE * 5 / 2, '0', PAGE_SIZE / 2));
+}
+
+static bool test_wr_memset(void)
+{
+   int new_val = '1';
+
+   wr_memset(&scalar, new_val, sizeof(scalar));
+   if (WARN(memtst(&scalar, new_val, sizeof(scalar)),
+"Scalar write rare memset test failed."))
+   return false;
+
+   pr_info("Scalar write rare memset test passed.");
+
+   wr_memset(array, '0', PAGE_SIZE * 3);
+   if (WARN(memtst(array, '0', PAGE_SIZE * 3),
+"Array write rare memset test failed."))
+   return false;
+
+   wr_memset(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2);
+   if (WARN(memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2),
+"Array write rare memset test failed."))
+   return fa

[PATCH 6/6] __wr_after_init: lkdtm test

2018-12-04 Thread Igor Stoppa
Verify that trying to modify a variable with the __wr_after_init
modifier wil lcause a crash.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 drivers/misc/lkdtm/core.c  |  3 +++
 drivers/misc/lkdtm/lkdtm.h |  3 +++
 drivers/misc/lkdtm/perms.c | 29 +
 3 files changed, 35 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..73c34b17c433 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(ACCESS_USERSPACE),
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
+#ifdef CONFIG_PRMEM
+   CRASHTYPE(WRITE_WR_AFTER_INIT),
+#endif
CRASHTYPE(WRITE_KERN),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..abba2f52ffa6 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -38,6 +38,9 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void);
 void __init lkdtm_perms_init(void);
 void lkdtm_WRITE_RO(void);
 void lkdtm_WRITE_RO_AFTER_INIT(void);
+#ifdef CONFIG_PRMEM
+void lkdtm_WRITE_WR_AFTER_INIT(void);
+#endif
 void lkdtm_WRITE_KERN(void);
 void lkdtm_EXEC_DATA(void);
 void lkdtm_EXEC_STACK(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 53b85c9d16b8..f681730aa652 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -27,6 +28,10 @@ static const unsigned long rodata = 0xAA55AA55;
 /* This is marked __ro_after_init, so it should ultimately be .rodata. */
 static unsigned long ro_after_init __ro_after_init = 0x55AA5500;
 
+/* This is marked __wr_after_init, so it should be in .rodata. */
+static
+unsigned long wr_after_init __wr_after_init = 0x55AA5500;
+
 /*
  * This just returns to the caller. It is designed to be copied into
  * non-executable memory regions.
@@ -104,6 +109,28 @@ void lkdtm_WRITE_RO_AFTER_INIT(void)
*ptr ^= 0xabcd1234;
 }
 
+#ifdef CONFIG_PRMEM
+
+void lkdtm_WRITE_WR_AFTER_INIT(void)
+{
+   unsigned long *ptr = &wr_after_init;
+
+   /*
+* Verify we were written to during init. Since an Oops
+* is considered a "success", a failure is to just skip the
+* real test.
+*/
+   if ((*ptr & 0xAA) != 0xAA) {
+   pr_info("%p was NOT written during init!?\n", ptr);
+   return;
+   }
+
+   pr_info("attempting bad wr_after_init write at %p\n", ptr);
+   *ptr ^= 0xabcd1234;
+}
+
+#endif
+
 void lkdtm_WRITE_KERN(void)
 {
size_t size;
@@ -200,4 +227,6 @@ void __init lkdtm_perms_init(void)
/* Make sure we can write to __ro_after_init values during __init */
ro_after_init |= 0xAA;
 
+   /* Make sure we can write to __wr_after_init during __init */
+   wr_after_init |= 0xAA;
 }
-- 
2.19.1



[RFC: Coding Style] Best way to split a long function declaration with modifiers

2018-05-12 Thread Igor Stoppa

Hi,
I have been wondering if it's ok to break a long (function declaration) 
line in the following way:


static __always_inline
struct foo_bar *__get_foo_bar(type1 parm1, type2 parm2, type3 parm3)


instead of:

static __always_inline struct foo_bar *__get_foo_bar(type1 parm1,
 type2 parm2,
 type3 parm3)


I personally like more the former, not to mention that it uses also one 
line less, but it seems less common in the sources.
The coding style references do not seem to say anything explicit about 
which style to prefer.


And not all the code in the kernel is of the same quality, so finding an 
example doesn't automatically mean that it's a good practice to follow :-)


--
thanks, igor


Re: [RFC: Coding Style] Best way to split a long function declaration with modifiers

2018-05-12 Thread Igor Stoppa

On 12/05/18 18:41, Joe Perches wrote:


I personally like more the former, not to mention that it uses also one
line less, but it seems less common in the sources.
The coding style references do not seem to say anything explicit about
which style to prefer.


thank you, I could provide a patch to the docs for this case, if it's 
not considered too much of a corner case.


--
igor


[PATCH] checkpatch.pl: Improve WARNING on Kconfig help

2018-12-19 Thread Igor Stoppa
The checkpatch.pl script complains when the help section of a Kconfig
entry is too short, but it doesn't really explain what it is looking
for. Instead, it gives a generic warning that one should consider writing
a paragraph.

But what it *really* checks is that the help section is at least
.$min_conf_desc_length lines long.

Since the definition of what is a paragraph is not really carved in
stone (and actually the primary descriptions is "5 sentences"), make the
warning less ambiguous by expliciting the actual test condition, so that
one doesn't have to read checkpatch.pl sources, to figure out the actual
test.

Signed-off-by: Igor Stoppa 
CC: Andy Whitcroft 
CC: Joe Perches 
CC: linux-kernel@vger.kernel.org
---
 scripts/checkpatch.pl | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index c883ec55654f..e255f0423cca 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2931,7 +2931,8 @@ sub process {
}
if ($is_start && $is_end && $length < 
$min_conf_desc_length) {
WARN("CONFIG_DESCRIPTION",
-"please write a paragraph that describes 
the config symbol fully\n" . $herecurr);
+"please write a paragraph (" 
.$min_conf_desc_length . " lines)" .
+" that describes the config symbol 
fully\n" . $herecurr);
}
#print "is_start<$is_start> is_end<$is_end> 
length<$length>\n";
}
-- 
2.19.1



Re: [PATCH] checkpatch.pl: Improve WARNING on Kconfig help

2018-12-19 Thread Igor Stoppa




On 19/12/2018 14:29, Joe Perches wrote:

On Wed, 2018-12-19 at 11:59 +, Andy Whitcroft wrote:

On Wed, Dec 19, 2018 at 02:44:36AM -0800, Joe Perches wrote:



To cover both cases perhaps:

"please ensure that this config symbols is described fully (less than
 $min_conf_desc_length lines is quite brief)"


This is one of those checkpatch bleats I never
really thought was appropriate as some or many
Kconfig symbols are fully descriptive in even
with only a single line.

Also, it seems you are arguing for a checkpatch
--verbose-help output style rather than the
intentionally terse single line output that the
script produces today.


If I have to use --verbose, to understand that the warning is about me 
writing 3 lines when the script expects 4, I don't think it's 
particularly user friendly.


Let's write "Expected 4+ lines" or something equally clear.
It will fit in a row and get the job done.


That is something Al Viro once suggested in this thread:
https://lore.kernel.org/patchwork/patch/775901/

On Sat, 2017-04-01 at 05:08 +0100, Al Viro wrote:

On Fri, Mar 31, 2017 at 08:52:50PM -0700, Joe Perches wrote:

checkpatch messages are single line.


Too bad... Incidentally, being able to get more detailed explanation of
a warning might be a serious improvement, especially if it contains
the rationale.  Hell, something like TeX handling of errors might be
a good idea - warning printed, offered actions include 'give more help',
'continue', 'exit', 'from now on suppress this kind of warning', 'from
now on just dump this kind of warning into log and keep going', 'from
now on dump all warnings into log and keep going'.


It's all good in general, but here the word "paragraph" is being abused, 
in the sense that it has been given an arbitrary meaning of "4 lines".
And the warning is even worse because it doesn't even acknowledge that I 
wrote something, even if it's a meager 1 or 2 lines.

Which is even more confusing.

As user, if I'm running checkpatch.pl and I get a warning, I should 
spend my time trying to decide if/how to fix it, not re-invoking it with 
extra options or reading its sources.


--
igor





[PATCH] checkpatch.pl: Improve WARNING on Kconfig help

2018-12-19 Thread Igor Stoppa
The checkpatch.pl script complains when the help section of a Kconfig
entry is too short, but it doesn't really explain what it is looking
for. Instead, it gives a generic warning that one should consider writing
a paragraph.

But what it *really* checks is that the help section is at least
.$min_conf_desc_length lines long.

Since the definition of what is a paragraph is not really carved in
stone (and actually the primary descriptions is "5 sentences"), make the
warning less ambiguous by expliciting the actual test condition, so that
one doesn't have to read checkpatch.pl sources, to figure out the actual
test.

Signed-off-by: Igor Stoppa 
CC: Andy Whitcroft 
CC: Joe Perches 
CC: linux-kernel@vger.kernel.org
---
 scripts/checkpatch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index c883ec55654f..33568d7e28d1 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2931,7 +2931,7 @@ sub process {
}
if ($is_start && $is_end && $length < 
$min_conf_desc_length) {
WARN("CONFIG_DESCRIPTION",
-"please write a paragraph that describes 
the config symbol fully\n" . $herecurr);
+"expecting a 'help' section of " 
.$min_conf_desc_length . "+ lines\n" . $herecurr);
}
#print "is_start<$is_start> is_end<$is_end> 
length<$length>\n";
}
-- 
2.19.1



[PATCH 01/12] x86_64: memset_user()

2018-12-19 Thread Igor Stoppa
Create x86_64 specific version of memset for user space, based on
clear_user().
This will be used for implementing wr_memset() in the __wr_after_init
scenario, where write-rare variables have an alternate mapping for
writing.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/uaccess_64.h |  6 
 arch/x86/lib/usercopy_64.c| 54 +++
 2 files changed, 60 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h 
b/arch/x86/include/asm/uaccess_64.h
index a9d637bc301d..f194bfce4866 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -213,4 +213,10 @@ copy_user_handle_tail(char *to, char *from, unsigned len);
 unsigned long
 mcsafe_handle_tail(char *to, char *from, unsigned len);
 
+unsigned long __must_check
+memset_user(void __user *mem, int c, unsigned long len);
+
+unsigned long __must_check
+__memset_user(void __user *mem, int c, unsigned long len);
+
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 1bd837cdc4b1..84f8f8a20b30 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -9,6 +9,60 @@
 #include 
 #include 
 
+/*
+ * Memset Userspace
+ */
+
+unsigned long __memset_user(void __user *addr, int c, unsigned long size)
+{
+   long __d0;
+   unsigned long  pattern = 0;
+   int i;
+
+   for (i = 0; i < 8; i++)
+   pattern = (pattern << 8) | (0xFF & c);
+   might_fault();
+   /* no memory constraint: gcc doesn't know about this memory */
+   stac();
+   asm volatile(
+   "   movq %[val], %%rdx\n"
+   "   testq  %[size8],%[size8]\n"
+   "   jz 4f\n"
+   "0: mov %%rdx,(%[dst])\n"
+   "   addq   $8,%[dst]\n"
+   "   decl %%ecx ; jnz   0b\n"
+   "4: movq  %[size1],%%rcx\n"
+   "   testl %%ecx,%%ecx\n"
+   "   jz 2f\n"
+   "1: movb   %%dl,(%[dst])\n"
+   "   incq   %[dst]\n"
+   "   decl %%ecx ; jnz  1b\n"
+   "2:\n"
+   ".section .fixup,\"ax\"\n"
+   "3: lea 0(%[size1],%[size8],8),%[size8]\n"
+   "   jmp 2b\n"
+   ".previous\n"
+   _ASM_EXTABLE_UA(0b, 3b)
+   _ASM_EXTABLE_UA(1b, 2b)
+   : [size8] "=&c"(size), [dst] "=&D" (__d0)
+   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr),
+ [val] "ri"(pattern)
+   : "rdx");
+
+   clac();
+   return size;
+}
+EXPORT_SYMBOL(__memset_user);
+
+unsigned long memset_user(void __user *to, int c, unsigned long n)
+{
+   if (access_ok(VERIFY_WRITE, to, n))
+   return __memset_user(to, c, n);
+   return n;
+}
+EXPORT_SYMBOL(memset_user);
+
+
 /*
  * Zero Userspace
  */
-- 
2.19.1



[PATCH 02/12] __wr_after_init: linker section and label

2018-12-19 Thread Igor Stoppa
Introduce a section and a label for statically allocated write rare
data. The label is named "__wr_after_init".
As the name implies, after the init phase is completed, this section
will be modifiable only by invoking write rare functions.
The section must take up a set of full pages.

To activate both section and label, the arch must set CONFIG_ARCH_HAS_PRMEM

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 arch/Kconfig  | 15 +++
 include/asm-generic/vmlinux.lds.h | 25 +
 include/linux/cache.h | 21 +
 init/main.c   |  2 ++
 4 files changed, 63 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..8668ffec8098 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -802,6 +802,21 @@ config VMAP_STACK
  the stack to map directly to the KASAN shadow map using a formula
  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_HAS_PRMEM
+   def_bool n
+   help
+ architecture specific symbol stating that the architecture provides
+ a back-end function for the write rare operation.
+
+config PRMEM
+   bool "Write protect critical data that doesn't need high write speed."
+   depends on ARCH_HAS_PRMEM
+   default y
+   help
+ If the architecture supports it, statically allocated data which
+ has been selected for hardening becomes (mostly) read-only.
+ The selection happens by labelling the data "__wr_after_init".
+
 config ARCH_OPTIONAL_KERNEL_RWX
def_bool n
 
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..ddb1fd608490 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -311,6 +311,30 @@
KEEP(*(__jump_table))   \
__stop___jump_table = .;
 
+/*
+ * Allow architectures to handle wr_after_init data on their
+ * own by defining an empty WR_AFTER_INIT_DATA.
+ * However, it's important that pages containing WR_RARE data do not
+ * hold anything else, to avoid both accidentally unprotecting something
+ * that is supposed to stay read-only all the time and also to protect
+ * something else that is supposed to be writeable all the time.
+ */
+#ifndef WR_AFTER_INIT_DATA
+#ifdef CONFIG_PRMEM
+#define WR_AFTER_INIT_DATA(align)  \
+   . = ALIGN(PAGE_SIZE);   \
+   __start_wr_after_init = .;  \
+   . = ALIGN(align);   \
+   *(.data..wr_after_init) \
+   . = ALIGN(PAGE_SIZE);   \
+   __end_wr_after_init = .;\
+   . = ALIGN(align);
+#else
+#define WR_AFTER_INIT_DATA(align)  \
+   . = ALIGN(align);
+#endif
+#endif
+
 /*
  * Allow architectures to handle ro_after_init data on their
  * own by defining an empty RO_AFTER_INIT_DATA.
@@ -332,6 +356,7 @@
__start_rodata = .; \
*(.rodata) *(.rodata.*) \
RO_AFTER_INIT_DATA  /* Read only after init */  \
+   WR_AFTER_INIT_DATA(align) /* wr after init */   \
KEEP(*(__vermagic)) /* Kernel version magic */  \
. = ALIGN(8);   \
__start___tracepoints_ptrs = .; \
diff --git a/include/linux/cache.h b/include/linux/cache.h
index 750621e41d1c..09bd0b9284b6 100644
--- a/include/linux/cache.h
+++ b/include/linux/cache.h
@@ -31,6 +31,27 @@
 #define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
 #endif
 
+/*
+ * __wr_after_init is used to mark objects that cannot be modified
+ * directly after init (i.e. after mark_rodata_ro() has been called).
+ * These objects become effectively read-only, from the perspective of
+ * performing a direct write, like a variable assignment.
+ * However, they can be altered through a dedicated function.
+ * It is intended for those objects which are occasionally modified after
+ * init, however they are modified so seldomly, that the extra cost from
+ * the indirect modification is either negligible or worth paying, for the
+ * sake of the protection gained.
+ */
+#ifndef __wr_after_init
+#ifdef CONFIG_PRMEM
+#define __wr_after_init \
+   __attribute__((__section__(".data..wr_after_i

[RFC v2 PATCH 0/12] hardening: statically allocated protected memory

2018-12-19 Thread Igor Stoppa
Patch-set implementing write-rare memory protection for statically
allocated data.
Its purpose it to keep data write protected kernel data which is seldom
modified.
There is no read overhead, however writing requires special operations that
are probably unsitable for often-changing data.
The use is opt-in, by applying the modifier __wr_after_init to a variable
declaration.

As the name implies, the write protection kicks in only after init() is
completed; before that moment, the data is modifiable in the usual way.

Current Limitations:
* supports only data which is allocated statically, at build time.
* supports only x86_64, other earchitectures need to provide own backend

Some notes:
- there is a part of generic code which is basically a NOP, but should
  allow using unconditionally the write protection. It will automatically
  default to non-protected functionality, if the specific architecture
  doesn't support write-rare
- to avoid the risk of weakening __ro_after_init, __wr_after_init data is
  in a separate set of pages, and any invocation will confirm that the
  memory affected falls within this range.
  rodata_test is modified accordingly, to check also this case.
- for now, the patchset addresses only x86_64, as each architecture seems
  to have own way of dealing with user space. Once a few are implemented,
  it should be more obvious what code can be refactored as common.
- the memset_user() assembly function seems to work, but I'm not too sure
  it's really ok
- I've added a simple example: the protection of ima_policy_flags
- the last patch is optional, but it seemed worth to do the refactoring

Changelog:

v1->v2

* introduce cleaner split between generic and arch code
* add x86_64 specific memset_user()
* replace kernel-space memset() memcopy() with userspace counterpart
* randomize the base address for the alternate map across the entire
  available address range from user space (128TB - 64TB)
* convert BUG() to WARN()
* turn verification of written data into debugging option
* wr_rcu_assign_pointer() as special case of wr_assign()
* example with protection of ima_policy_flags
* documentation

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org

Igor Stoppa (12):
[PATCH 01/12] x86_64: memset_user()
[PATCH 02/12] __wr_after_init: linker section and label
[PATCH 03/12] __wr_after_init: generic header
[PATCH 04/12] __wr_after_init: x86_64: __wr_op
[PATCH 05/12] __wr_after_init: x86_64: debug writes
[PATCH 06/12] __wr_after_init: Documentation: self-protection
[PATCH 07/12] __wr_after_init: lkdtm test
[PATCH 08/12] rodata_test: refactor tests
[PATCH 09/12] rodata_test: add verification for __wr_after_init
[PATCH 10/12] __wr_after_init: test write rare functionality
[PATCH 11/12] IMA: turn ima_policy_flags into __wr_after_init
[PATCH 12/12] x86_64: __clear_user as case of __memset_user


Documentation/security/self-protection.rst |  14 ++-
arch/Kconfig   |  15 +++
arch/x86/Kconfig   |   1 +
arch/x86/include/asm/uaccess_64.h  |   6 +
arch/x86/lib/usercopy_64.c |  41 +--
arch/x86/mm/Makefile   |   2 +
arch/x86/mm/prmem.c| 127 +
drivers/misc/lkdtm/core.c  |   3 +
drivers/misc/lkdtm/lkdtm.h |   3 +
drivers/misc/lkdtm/perms.c |  29 +
include/asm-generic/vmlinux.lds.h  |  25 +
include/linux/cache.h  |  21 
include/linux/prmem.h  | 142 
init/main.c|   2 +
mm/Kconfig.debug   |  16 +++
mm/Makefile|   1 +
mm/rodata_test.c   |  69 
mm/test_write_rare.c   | 135 ++
security/integrity/ima/ima.h   |   3 +-
security/integrity/ima/ima_init.c  |   5 +-
security/integrity/ima/ima_policy.c|   9 +-
21 files changed, 629 insertions(+), 40 deletions(-)



[PATCH 12/12] x86_64: __clear_user as case of __memset_user

2018-12-19 Thread Igor Stoppa
To avoid code duplication, re-use __memset_user(), when clearing
user-space memory.

The overhead should be minimal (2 extra register assignments) and
outside of the writing loop.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/lib/usercopy_64.c | 29 +
 1 file changed, 1 insertion(+), 28 deletions(-)

diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index 84f8f8a20b30..ab6aabb62055 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -69,34 +69,7 @@ EXPORT_SYMBOL(memset_user);
 
 unsigned long __clear_user(void __user *addr, unsigned long size)
 {
-   long __d0;
-   might_fault();
-   /* no memory constraint because it doesn't change any memory gcc knows
-  about */
-   stac();
-   asm volatile(
-   "   testq  %[size8],%[size8]\n"
-   "   jz 4f\n"
-   "0: movq $0,(%[dst])\n"
-   "   addq   $8,%[dst]\n"
-   "   decl %%ecx ; jnz   0b\n"
-   "4: movq  %[size1],%%rcx\n"
-   "   testl %%ecx,%%ecx\n"
-   "   jz 2f\n"
-   "1: movb   $0,(%[dst])\n"
-   "   incq   %[dst]\n"
-   "   decl %%ecx ; jnz  1b\n"
-   "2:\n"
-   ".section .fixup,\"ax\"\n"
-   "3: lea 0(%[size1],%[size8],8),%[size8]\n"
-   "   jmp 2b\n"
-   ".previous\n"
-   _ASM_EXTABLE_UA(0b, 3b)
-   _ASM_EXTABLE_UA(1b, 2b)
-   : [size8] "=&c"(size), [dst] "=&D" (__d0)
-   : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr));
-   clac();
-   return size;
+   return __memset_user(addr, 0, size);
 }
 EXPORT_SYMBOL(__clear_user);
 
-- 
2.19.1



[PATCH 06/12] __wr_after_init: Documentation: self-protection

2018-12-19 Thread Igor Stoppa
Update the self-protection documentation, to mention also the use of the
__wr_after_init attribute.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 Documentation/security/self-protection.rst | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/Documentation/security/self-protection.rst 
b/Documentation/security/self-protection.rst
index f584fb74b4ff..df2614bc25b9 100644
--- a/Documentation/security/self-protection.rst
+++ b/Documentation/security/self-protection.rst
@@ -84,12 +84,14 @@ For variables that are initialized once at ``__init`` time, 
these can
 be marked with the (new and under development) ``__ro_after_init``
 attribute.
 
-What remains are variables that are updated rarely (e.g. GDT). These
-will need another infrastructure (similar to the temporary exceptions
-made to kernel code mentioned above) that allow them to spend the rest
-of their lifetime read-only. (For example, when being updated, only the
-CPU thread performing the update would be given uninterruptible write
-access to the memory.)
+Others, which are statically allocated, but still need to be updated
+rarely, can be marked with the ``__wr_after_init`` attribute.
+
+The update mechanism must avoid exposing the data to rogue alterations
+during the update. For example, only the CPU thread performing the update
+would be given uninterruptible write access to the memory.
+
+Currently there is no protection available for data allocated dynamically.
 
 Segregation of kernel memory from userspace memory
 ~~
-- 
2.19.1



[PATCH 10/12] __wr_after_init: test write rare functionality

2018-12-19 Thread Igor Stoppa
Set of test cases meant to confirm that the write rare functionality
works as expected.
It can be optionally compiled as module.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 mm/Kconfig.debug |   8 +++
 mm/Makefile  |   1 +
 mm/test_write_rare.c | 135 +++
 3 files changed, 144 insertions(+)
 create mode 100644 mm/test_write_rare.c

diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index b10305cfac3c..ae018e56c4e4 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -102,3 +102,11 @@ config DEBUG_PRMEM
 help
   After any write rare operation, compares the data written with the
   value provided by the caller.
+
+config DEBUG_PRMEM_TEST
+tristate "Run self test for statically allocated protected memory"
+depends on PRMEM
+default n
+help
+  Tries to verify that the protection for statically allocated memory
+  works correctly and that the memory is effectively protected.
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..62d719c0ee1e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,6 +58,7 @@ obj-$(CONFIG_SPARSEMEM)   += sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_DEBUG_PRMEM_TEST) += test_write_rare.o
 obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_SLAB) += slab.o
diff --git a/mm/test_write_rare.c b/mm/test_write_rare.c
new file mode 100644
index ..30574bc34a20
--- /dev/null
+++ b/mm/test_write_rare.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * test_write_rare.c
+ *
+ * (C) Copyright 2018 Huawei Technologies Co. Ltd.
+ * Author: Igor Stoppa 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+
+static __wr_after_init int scalar = '0';
+static __wr_after_init u8 array[PAGE_SIZE * 3] __aligned(PAGE_SIZE);
+
+/* The section must occupy a non-zero number of whole pages */
+static bool test_alignment(void)
+{
+   unsigned long pstart = (unsigned long)&__start_wr_after_init;
+   unsigned long pend = (unsigned long)&__end_wr_after_init;
+
+   if (WARN((pstart & ~PAGE_MASK) || (pend & ~PAGE_MASK) ||
+(pstart >= pend), "Boundaries test failed."))
+   return false;
+   pr_info("Boundaries test passed.");
+   return true;
+}
+
+static bool test_pattern(void)
+{
+   return (memtst(array, '0', PAGE_SIZE / 2) ||
+   memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 3 / 4) ||
+   memtst(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2) ||
+   memtst(array + PAGE_SIZE * 7 / 4, '1', PAGE_SIZE * 3 / 4) ||
+   memtst(array + PAGE_SIZE * 5 / 2, '0', PAGE_SIZE / 2));
+}
+
+static bool test_wr_memset(void)
+{
+   int new_val = '1';
+
+   wr_memset(&scalar, new_val, sizeof(scalar));
+   if (WARN(memtst(&scalar, new_val, sizeof(scalar)),
+"Scalar write rare memset test failed."))
+   return false;
+
+   pr_info("Scalar write rare memset test passed.");
+
+   wr_memset(array, '0', PAGE_SIZE * 3);
+   if (WARN(memtst(array, '0', PAGE_SIZE * 3),
+"Array write rare memset test failed."))
+   return false;
+
+   wr_memset(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2);
+   if (WARN(memtst(array + PAGE_SIZE / 2, '1', PAGE_SIZE * 2),
+"Array write rare memset test failed."))
+   return false;
+
+   wr_memset(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2);
+   if (WARN(memtst(array + PAGE_SIZE * 5 / 4, '0', PAGE_SIZE / 2),
+"Array write rare memset test failed."))
+   return false;
+
+   if (WARN(test_pattern(), "Array write rare memset test failed."))
+   return false;
+
+   pr_info("Array write rare memset test passed.");
+   return true;
+}
+
+static u8 array_1[PAGE_SIZE * 2];
+static u8 array_2[PAGE_SIZE * 2];
+
+static bool test_wr_memcpy(void)
+{
+   int new_val = 0x12345678;
+
+   wr_assign(scalar, new_val);
+   if (WARN(memcmp(&scalar, &new_val, sizeof(scalar)),
+"Scalar write rare memcpy test failed."))
+   return false;
+   pr_info("Scalar write rare 

[PATCH 11/12] IMA: turn ima_policy_flags into __wr_after_init

2018-12-19 Thread Igor Stoppa
The policy flags could be targeted by an attacker aiming at disabling IMA,
so that there would be no trace of a file system modification in the
measurement list.

Since the flags can be altered at runtime, it is not possible to make
them become fully read-only, for example with __ro_after_init.

__wr_after_init can still provide some protection, at least against
simple memory overwrite attacks

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 security/integrity/ima/ima.h| 3 ++-
 security/integrity/ima/ima_init.c   | 5 +++--
 security/integrity/ima/ima_policy.c | 9 +
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index cc12f3449a72..297c25f5122e 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "../integrity.h"
@@ -50,7 +51,7 @@ enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8 };
 #define IMA_TEMPLATE_IMA_FMT "d|n"
 
 /* current content of the policy */
-extern int ima_policy_flag;
+extern int ima_policy_flag __wr_after_init;
 
 /* set during initialization */
 extern int ima_hash_algo;
diff --git a/security/integrity/ima/ima_init.c 
b/security/integrity/ima/ima_init.c
index 59d834219cd6..5f4e13e671bf 100644
--- a/security/integrity/ima/ima_init.c
+++ b/security/integrity/ima/ima_init.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ima.h"
 
@@ -98,9 +99,9 @@ void __init ima_load_x509(void)
 {
int unset_flags = ima_policy_flag & IMA_APPRAISE;
 
-   ima_policy_flag &= ~unset_flags;
+   wr_assign(ima_policy_flag, ima_policy_flag & ~unset_flags);
integrity_load_x509(INTEGRITY_KEYRING_IMA, CONFIG_IMA_X509_PATH);
-   ima_policy_flag |= unset_flags;
+   wr_assign(ima_policy_flag, ima_policy_flag | unset_flags);
 }
 #endif
 
diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index 7489cb7de6dc..2004de818d92 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -47,7 +47,7 @@
 #define INVALID_PCR(a) (((a) < 0) || \
(a) >= (FIELD_SIZEOF(struct integrity_iint_cache, measured_pcrs) * 8))
 
-int ima_policy_flag;
+int ima_policy_flag __wr_after_init;
 static int temp_ima_appraise;
 static int build_ima_appraise __ro_after_init;
 
@@ -452,12 +452,13 @@ void ima_update_policy_flag(void)
 
list_for_each_entry(entry, ima_rules, list) {
if (entry->action & IMA_DO_MASK)
-   ima_policy_flag |= entry->action;
+   wr_assign(ima_policy_flag,
+ ima_policy_flag | entry->action);
}
 
ima_appraise |= (build_ima_appraise | temp_ima_appraise);
if (!ima_appraise)
-   ima_policy_flag &= ~IMA_APPRAISE;
+   wr_assign(ima_policy_flag, ima_policy_flag & ~IMA_APPRAISE);
 }
 
 static int ima_appraise_flag(enum ima_hooks func)
@@ -574,7 +575,7 @@ void ima_update_policy(void)
list_splice_tail_init_rcu(&ima_temp_rules, policy, synchronize_rcu);
 
if (ima_rules != policy) {
-   ima_policy_flag = 0;
+   wr_assign(ima_policy_flag, 0);
ima_rules = policy;
}
ima_update_policy_flag();
-- 
2.19.1



[PATCH 09/12] rodata_test: add verification for __wr_after_init

2018-12-19 Thread Igor Stoppa
The write protection of the __wr_after_init data can be verified with the
same methodology used for const data.

Signed-off-by: Igor Stoppa 

CC: Andy Lutomirski 
CC: Nadav Amit 
CC: Matthew Wilcox 
CC: Peter Zijlstra 
CC: Kees Cook 
CC: Dave Hansen 
CC: Mimi Zohar 
CC: linux-integr...@vger.kernel.org
CC: kernel-harden...@lists.openwall.com
CC: linux...@kvack.org
CC: linux-kernel@vger.kernel.org
---
 mm/rodata_test.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/mm/rodata_test.c b/mm/rodata_test.c
index e1349520b436..a669cf9f5a61 100644
--- a/mm/rodata_test.c
+++ b/mm/rodata_test.c
@@ -16,8 +16,23 @@
 
 #define INIT_TEST_VAL 0xC3
 
+/*
+ * Note: __ro_after_init data is, for every practical effect, equivalent to
+ * const data, since they are even write protected at the same time; there
+ * is no need for separate testing.
+ * __wr_after_init data, otoh, is altered also after the write protection
+ * takes place and it cannot be exploitable for altering more permanent
+ * data.
+ */
+
 static const int rodata_test_data = INIT_TEST_VAL;
 
+#ifdef CONFIG_PRMEM
+static int wr_after_init_test_data __wr_after_init = INIT_TEST_VAL;
+extern long __start_wr_after_init;
+extern long __end_wr_after_init;
+#endif
+
 static bool test_data(char *data_type, const int *data,
  unsigned long start, unsigned long end)
 {
@@ -59,7 +74,13 @@ static bool test_data(char *data_type, const int *data,
 
 void rodata_test(void)
 {
-   test_data("rodata", &rodata_test_data,
- (unsigned long)&__start_rodata,
- (unsigned long)&__end_rodata);
+   if (!test_data("rodata", &rodata_test_data,
+  (unsigned long)&__start_rodata,
+  (unsigned long)&__end_rodata))
+   return;
+#ifdef CONFIG_PRMEM
+   test_data("wr after init data", &wr_after_init_test_data,
+ (unsigned long)&__start_wr_after_init,
+ (unsigned long)&__end_wr_after_init);
+#endif
 }
-- 
2.19.1



  1   2   3   4   5   >