from:"Kanoj Sarcar"

Re: [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Kanoj Sarcar

--- Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 13 Feb 2008, Christian Bell wrote:
> 
> > not always be in the thousands but you're still
> claiming scalability
> > for a mechanism that essentially logs who accesses
> the regions.  Then
> > there's the fact that reclaim becomes a collective
> communication
> > operation over all region accessors.  Makes me
> nervous.
> 
> Well reclaim is not a very fast process (and we
> usually try to avoid it 
> as much as possible for our HPC). Essentially its
> only there to allow 
> shifts of processing loads and to allow efficient
> caching of application 
> data.
> 
> > However, short of providing user-level
> notifications for pinned pages
> > that are inadvertently released to the O/S, I
> don't believe that the
> > patchset provides any significant added value for
> the HPC community
> > that can't optimistically do RDMA demand paging.
> 
> We currently also run XPmem with pinning. Its great
> as long as you just 
> run one load on the system. No reclaim ever iccurs.
> 
> However, if you do things that require lots of
> allocations etc etc then 
> the page pinning can easily lead to livelock if
> reclaim is finally 
> triggerd and also strange OOM situations since the
> VM cannot free any 
> pages. So the main issue that is addressed here is
> reliability of pinned 
> page operations. Better VM integration avoids these
> issues because we can 
> unpin on request to deal with memory shortages.
> 
> 

I have a question on the basic need for the mmu
notifier stuff wrt rdma hardware and pinning memory.

It seems that the need is to solve potential memory
shortage and overcommit issues by being able to
reclaim pages pinned by rdma driver/hardware. Is my
understanding correct?

If I do understand correctly, then why is rdma page
pinning any different than eg mlock pinning? I imagine
Oracle pins lots of memory (using mlock), how come
they do not run into vm overcommit issues?

Are we up against some kind of breaking c-o-w issue
here that is different between mlock and rdma pinning?

Asked another way, why should effort be spent on a
notifier scheme, and rather not on fixing any memory
accounting problems and unifying how pin pages are
accounted for that get pinned via mlock() or rdma
drivers?

Startup benefits are well understood with the notifier
scheme (ie, not all pages need to be faulted in at
memory region creation time), specially when most of
the memory region is not accessed at all. I would
imagine most of HPC does not work this way though.
Then again, as rdma hardware is applied
(increasingly?) towards apps with short lived
connections, the notifier scheme will help with
startup times.

Kanoj

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Kanoj Sarcar


--- Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Wed, 13 Feb 2008, Kanoj Sarcar wrote:
> 
> > It seems that the need is to solve potential
> memory
> > shortage and overcommit issues by being able to
> > reclaim pages pinned by rdma driver/hardware. Is
> my
> > understanding correct?
> 
> Correct.
> 
> > If I do understand correctly, then why is rdma
> page
> > pinning any different than eg mlock pinning? I
> imagine
> > Oracle pins lots of memory (using mlock), how come
> > they do not run into vm overcommit issues?
> 
> Mlocked pages are not pinned. They are movable by
> f.e. page migration and 
> will be potentially be moved by future memory defrag
> approaches. Currently 
> we have the same issues with mlocked pages as with
> pinned pages. There is 
> work in progress to put mlocked pages onto a
> different lru so that reclaim 
> exempts these pages and more work on limiting the
> percentage of memory 
> that can be mlocked.
> 
> > Are we up against some kind of breaking c-o-w
> issue
> > here that is different between mlock and rdma
> pinning?
> 
> Not that I know.
> 
> > Asked another way, why should effort be spent on a
> > notifier scheme, and rather not on fixing any
> memory
> > accounting problems and unifying how pin pages are
> > accounted for that get pinned via mlock() or rdma
> > drivers?
> 
> There are efforts underway to account for and limit
> mlocked pages as 
> described above. Page pinning the way it is done by
> Infiniband through
> increasing the page refcount is treated by the VM as
> a temporary 
> condition not as a permanent pin. The VM will
> continually try to reclaim 
> these pages thinking that the temporary usage of the
> page must cease 
> soon. This is why the use of large amounts of pinned
> pages can lead to 
> livelock situations.

Oh ok, yes, I did see the discussion on this; sorry I
missed it. I do see what notifiers bring to the table
now (without endorsing it :-)).

An orthogonal question is this: is IB/rdma the only
"culprit" that elevates page refcounts? Are there no
other subsystems which do a similar thing?

The example I am thinking about is rawio (Oracle's
mlock'ed SHM regions are handed to rawio, isn't it?).
My understanding of how rawio works in Linux is quite
dated though ...

Kanoj

> 
> If we want to have pinning behavior then we could
> mark pinned pages 
> specially so that the VM will not continually try to
> evict these pages. We 
> could manage them similar to mlocked pages but just
> not allow page 
> migration, memory unplug and defrag to occur on
> pinned memory. All of 
> theses would have to fail. With the notifier scheme
> the device driver 
> could be told to get rid of the pinned memory. This
> would make these 3 
> techniques work despite having an RDMA memory
> section.
> 
> > Startup benefits are well understood with the
> notifier
> > scheme (ie, not all pages need to be faulted in at
> > memory region creation time), specially when most
> of
> > the memory region is not accessed at all. I would
> > imagine most of HPC does not work this way though.
> 
> No for optimal performance  you would want to
> prefault all pages like 
> it is now. The notifier scheme would only become
> relevant in memory 
> shortage situations.
> 
> > Then again, as rdma hardware is applied
> (increasingly?) towards apps 
> > with short lived connections, the notifier scheme
> will help with startup 
> > times.
> 
> The main use of the notifier scheme is for stability
> and reliability. The 
> "pinned" pages become unpinnable on request by the
> VM. So the VM can work 
> itself out of memory shortage situations in
> cooperation with the 
> RDMA logic instead of simply failing.
> 
> --
> To unsubscribe, send a message with 'unsubscribe
> linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on
> Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"[EMAIL PROTECTED]">
> [EMAIL PROTECTED] 
> 



  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Kanoj Sarcar

> 
> 
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
> 
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
> 
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

Actually, not _much_ work has been done in this area. Alongwith a bunch
of other people, I have some ideas about what needs to be done. For 
example, for NUMA, we need to try hard to schedule a thread on the 
node that has most of its memory (for no reason other than to decrease
memory latency). Independently, some NUMA machines build in multilevel 
caches and local snoops that also means that specific processors on
the same node as the last_processor are also good candidates to run 
the process next.

To handle a single layer of shared caches, I have tried certain simple
things, mostly as hacks, but am not pleased with the results yet. More
testing needed.

Kanoj

> 
> but the original claim was that the scheduling of thousands of runnable
> processes (which is not equal to having thousands of sleeping processes)
> must perform well - which is a completely different issue.
> 
>   Ingo
> 
> 
> ___
> Lse-tech mailing list
> [EMAIL PROTECTED]
> http://lists.sourceforge.net/lists/listinfo/lse-tech
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Kanoj Sarcar


> 
> I didn't seen anything from Kanoj but I did something myself for the wildfire:
> 
>   
>ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1
> 
> this is mostly an userspace issue, not really intended as a kernel optimization
> (however it's also partly a kernel optimization). Basically it splits the load
> of the numa machine into per-node load, there can be unbalanced load across the
> nodes but fairness is guaranteed inside each node. It's not extremely well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail 
to understand how this is helping you ... here's a simple theory though. 
If your system is lightly loaded, your pernode queues are actually 
implementing some sort of affinity, making sure processes stick to 
cpus on nodes where they have allocated most of their memory on. I am 
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Kanoj Sarcar


> 
> 
> 
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants. 

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Re: a quest for a better scheduler

2001-04-04 Thread Kanoj Sarcar


> 
> It helps by keeping the task in the same node if it cannot keep it in
> the same cpu anymore.
> 
> Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
> and it gets running and for some reason cpu 8 is busy and there are other
> cpus idle in the system. Now with the current scheduler it can be moved in any
> cpu in the system, with the numa sched applied we will try to first reschedule
> it in the idles cpus of node 2 for example. The per-node runqueue are mainly
> necessary to implement the heuristic.
>

Yes. But this is not the best solution, if I can add on to the example
and make some assumptions.

Imagine that most of the program's memory is on node 1, it was scheduled
on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
higher priority processes took over other cpus on node 1, etc). 

Then, your patch will try to keep the process on node 2, which is not
neccessarily the best solution. Of course, as I mentioned before, if
you have a node local cache on node 2, that cache might have been warmed
enough to make scheduling on node 2 a good option. 

I am not saying there is a wrong or right answer, there are so many
possibilities, everything probably works and breaks under different
circumstances. 

Btw, while we are swapping patches, the patch at

http://oss.sgi.com/projects/numa/download/sched242.patch

tries to implement per-arch scheduling. The current scheduler behavior
is smp_arch_goodness() and smp_pick_cpu(), but the patch allows the
possibility for a specific platform to change that to something else. 

Linus has seen this patch, and agrees to it in principle. He does not
consider this 2.4 material though. Of course, I am completely open to
Ingo (or someone else) coming up with a different way of providing the
same freedom to arch specific code.

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Lse-tech] Bug in sys_sched_yield

2001-04-15 Thread Kanoj Sarcar


> 
> 
> George, while this is needed as pointed out in a previous message,
> due to non-contiguous physical IDs, I think the current usage is
> pretty bad (at least looking from a x86 perspective). Maybe somebody
> can chime in from a different architecture.
> 
> I think that all data accesses particularly to __aligned_data
> should be performed through logical ids. There's a lot of remapping
> going on, due to the mix of logical and physical IDs.
>

I _think_ cpu_logical_map() can be deleted from the kernel, and all
places that use it can just use the [0 ... (smp_num_cpus-1)] number.
This is for the generic kernel code. The only place that should need
to convert from this number space to a "physical" space would be the
intercpu interrupt code (arch specific code). 

Only a handful of architectures (mips64, sparc*, alpha) do array
lookups for cpu_logical_map() anyway, those probably can be changed 
to the x86 definition of cpu_logical_map().

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 ptep_get_and_clear question

2001-02-15 Thread Kanoj Sarcar


> 
> [Added Linus and linux-kernel as I think it's of general interest]
> 
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
> 
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
> 
> > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ...
> > change_pte_range(). Specifically at the sequence:
> > 
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > Go ahead and pull your x86 specs, and prove to me that between the 
> > ptep_get_and_clear(), which zeroes out the pte (specifically, when the 
> > dirty bit is not set), processor 2 can not come in and set the dirty 
> > bit on the in-memory pte. Which immediately gets overwritten by the 
> > set_pte(). For an example of how this can happen, look at my previous 
> > postings.
> 

Now you are talking my language!

> Let's see.  We'll assume processor 2 does a write between the
> ptep_get_and_clear and the set_pte, which are done on processor 1.
> 
> Now, ptep_get_and_clear is atomic, so we can talk about "before" and
> "after".  Before it, either processor 2 has a TLB entry with the dirty
> bit set, or it does not (it has either a clean TLB entry or no TLB entry
> at all).
> 
> After ptep_get_and_clear, processor 2 does a write.  If it already has a
> dirty TLB entry, then `entry' will also be dirty so the dirty bit is
> preserved.  If processor 2 does not have a dirty TLB entry, then it will
> look up the pte.  Processor 2 finds the pte is clear, so raises a page fault.
> Spinlocks etc. sort everything out in the page fault.
> 
> Here's the important part: when processor 2 wants to set the pte's dirty
> bit, it *rereads* the pte and *rechecks* the permission bits again.
> Even though it has a non-dirty TLB entry for that pte.
> 
> That is how I read Ben LaHaise's description, and his test program tests
> exactly this.
> 

Okay, I asked Ben, he couldn't point me at specs and shut me up.

> If the processor worked by atomically setting the dirty bit in the pte
> without rechecking the permissions when it reads that pte bit, then this
> scheme would fail and you'd be right about the lost dirty bits.  I would

Exactly. This is why I did not implement this scheme earlier when Alan
and I talked about this scenario, almost a couple of years back.

> have thought it would be simpler to implement a CPU this way, but
> clearly it is not as efficient for SMP OS design so perhaps CPU
> designers thought about this.
> 
> The only remaining question is: is the observed behaviour defined for
> x86 CPUs in general, or are we depending on the results of testing a few
> particular CPUs?

Exactly!

So my claim still stands: ptep_get_and_clear() doesn't do what it claims
to do. I would be more than happy if someone can give me logic to break
this claim ... which would mean one longstanding data integrity problem
on Linux has been fixed satisfactorily.

Kanoj

> 
> -- Jamie
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 ptep_get_and_clear question

2001-02-15 Thread Kanoj Sarcar


> 
> [Added Linus and linux-kernel as I think it's of general interest]
> 
> Kanoj Sarcar wrote:
> > Whether Jamie was trying to illustrate a different problem, I am not
> > sure.
> 
> Yes, I was talking about pte_test_and_clear_dirty in the earlier post.
> 
> > Look in mm/mprotect.c. Look at the call sequence change_protection() -> ...
> > change_pte_range(). Specifically at the sequence:
> > 
> > entry = ptep_get_and_clear(pte);
> > set_pte(pte, pte_modify(entry, newprot));
> >
> > Go ahead and pull your x86 specs, and prove to me that between the 
> > ptep_get_and_clear(), which zeroes out the pte (specifically, when the 
> > dirty bit is not set), processor 2 can not come in and set the dirty 
> > bit on the in-memory pte. Which immediately gets overwritten by the 
> > set_pte(). For an example of how this can happen, look at my previous 
> > postings.
> 
> Let's see.  We'll assume processor 2 does a write between the
> ptep_get_and_clear and the set_pte, which are done on processor 1.
> 
> Now, ptep_get_and_clear is atomic, so we can talk about "before" and
> "after".  Before it, either processor 2 has a TLB entry with the dirty
> bit set, or it does not (it has either a clean TLB entry or no TLB entry
> at all).
> 
> After ptep_get_and_clear, processor 2 does a write.  If it already has a
> dirty TLB entry, then `entry' will also be dirty so the dirty bit is
> preserved.  If processor 2 does not have a dirty TLB entry, then it will
> look up the pte.  Processor 2 finds the pte is clear, so raises a page fault.
> Spinlocks etc. sort everything out in the page fault.
> 
> Here's the important part: when processor 2 wants to set the pte's dirty
> bit, it *rereads* the pte and *rechecks* the permission bits again.
> Even though it has a non-dirty TLB entry for that pte.
> 
> That is how I read Ben LaHaise's description, and his test program tests
> exactly this.

Okay, I will quote from Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:

"Bus cycles to the page directory and page tables in memory are performed
only when the TLBs do not contain the translation information for a 
requested page."

And on the same page:

"Whenever a page directory or page table entry is changed (including when 
the present flag is set to zero), the operating system must immediately
invalidate the corresponding entry in the TLB so that it can be updated
the next time the entry is referenced."

So, it looks highly unlikely to me that the basic assumption about how
x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct.

Kanoj

> 
> If the processor worked by atomically setting the dirty bit in the pte
> without rechecking the permissions when it reads that pte bit, then this
> scheme would fail and you'd be right about the lost dirty bits.  I would
> have thought it would be simpler to implement a CPU this way, but
> clearly it is not as efficient for SMP OS design so perhaps CPU
> designers thought about this.
> 
> The only remaining question is: is the observed behaviour defined for
> x86 CPUs in general, or are we depending on the results of testing a few
> particular CPUs?
> 
> -- Jamie
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 ptep_get_and_clear question

2001-02-15 Thread Kanoj Sarcar


> 
> Kanoj Sarcar wrote:
> > > Here's the important part: when processor 2 wants to set the pte's dirty
> > > bit, it *rereads* the pte and *rechecks* the permission bits again.
> > > Even though it has a non-dirty TLB entry for that pte.
> > > 
> > > That is how I read Ben LaHaise's description, and his test program tests
> > > exactly this.
> > 
> > Okay, I will quote from Intel Architecture Software Developer's Manual
> > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
> > 
> > "Bus cycles to the page directory and page tables in memory are performed
> > only when the TLBs do not contain the translation information for a 
> > requested page."
> > 
> > And on the same page:
> > 
> > "Whenever a page directory or page table entry is changed (including when 
> > the present flag is set to zero), the operating system must immediately
> > invalidate the corresponding entry in the TLB so that it can be updated
> > the next time the entry is referenced."
> > 
> > So, it looks highly unlikely to me that the basic assumption about how
> > x86 works wrt tlb/ptes in the ptep_get_and_clear() solution is correct.
> 
> To me those quotes don't address the question we're asking.  We know
> that bus cycles _do_ occur when a TLB entry is switched from clean to
> dirty, and furthermore they are locked cycles.  (Don't ask me how I know
> this though).
> 
> Does that mean, in jargon, the TLB does not "contain
> the translation information" for a write?
> 
> The second quote: sure, if we want the TLB updated we have to flush it.
> And eventually in mm/mprotect.c we do.  But what before, it keeps on
> using the old TLB entry?  That's ok.  If the entry was already dirty
> then we don't mind if processor 2 continues with the old TLB entry for a
> while, until we do the big TLB range flush.
> 
> In other words I don't think those two quotes address our question at
> all.

Agreed. But these are the only relevant quotes I could come up with. And
to me, these quotes make the ptep_get_and_clear() assumption look risky
at best ... even though they do not give clear answers either way.

> 
> What worries more is that this is quite a subtle requirement, and the
> code in mm/mprotect.c is not specific to one architecture.  Do all SMP
> CPUs support by Linux do the same thing on converting TLB entries from
> clean to dirty, or do they have a subtle, easily missed data integrity
> problem?

No. All architectures do not have this problem. For example, if the
Linux "dirty" (not the pte dirty) bit is managed by software, a fault
will actually be taken when processor 2 tries to do the write. The fault
is solely to make sure that the Linux "dirty" bit can be tracked. As long
as the fault handler grabs the right locks before updating the Linux "dirty"
bit, things should be okay. This is the case with mips, for example.

The problem with x86 is that we depend on automatic x86 dirty bit
update to manage the Linux "dirty" bit (they are the same!). So appropriate
locks are not grabbed.

Kanoj


> 
> -- Jamie
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 ptep_get_and_clear question

2001-02-15 Thread Kanoj Sarcar


> 
> Kanoj Sarcar wrote:
> > 
> > Okay, I will quote from Intel Architecture Software Developer's Manual
> > Volume 3: System Programming Guide (1997 print), section 3.7, page 3-27:
> > 
> > "Bus cycles to the page directory and page tables in memory are performed
> > only when the TLBs do not contain the translation information for a
> > requested page."
> > 
> > And on the same page:
> > 
> > "Whenever a page directory or page table entry is changed (including when
> > the present flag is set to zero), the operating system must immediately
> > invalidate the corresponding entry in the TLB so that it can be updated
> > the next time the entry is referenced."
> >
> 
> But there is another paragraph that mentions that an OS may use lazy tlb
> shootdowns.
> [search for shootdown]
> 
> You check the far too obvious chapters, remember that Intel wrote the
> documentation ;-)

:-) :-)

The good part is, there are a lot of Intel folks now active on Linux,
I can go off and ask one of them, if we are sufficiently confused. I
am trying to see whether we are.

> I searched for 'dirty' though Vol 3 and found
> 
> Chapter 7.1.2.1 Automatic locking.
> 
> .. the processor uses locked cycles to set the accessed and dirty flag
> in the page-directory and page-table entries.
> 
> But that obviously doesn't answer your question.
> 
> Is the sequence
> << lock;
> read pte
> pte |= dirty
> write pte
> >> end lock;
> or
> << lock;
> read pte
> if (!present(pte))
>   do_page_fault();
> pte |= dirty
> write pte.
> >> end lock;

No, it is a little more complicated. You also have to include in the
tlb state into this algorithm. Since that is what we are talking about.
Specifically, what does the processor do when it has a tlb entry allowing
RW, the processor has only done reads using the translation, and the 
in-memory pte is clear?

Kanoj

> 
> --
>   Manfred
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 ptep_get_and_clear question

2001-02-15 Thread Kanoj Sarcar

> 
> On Thu, 15 Feb 2001, Kanoj Sarcar wrote:
> 
> > No. All architectures do not have this problem. For example, if the
> > Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> > will actually be taken when processor 2 tries to do the write. The fault
> > is solely to make sure that the Linux "dirty" bit can be tracked. As long
> > as the fault handler grabs the right locks before updating the Linux "dirty"
> > bit, things should be okay. This is the case with mips, for example.
> >
> > The problem with x86 is that we depend on automatic x86 dirty bit
> > update to manage the Linux "dirty" bit (they are the same!). So appropriate
> > locks are not grabbed.
> 
> Will you please go off and prove that this "problem" exists on some x86
> processor before continuing this rant?  None of the PII, PIII, Athlon,

And will you please stop behaving like this is not an issue? 

> K6-2 or 486s I checked exhibited the worrisome behaviour you're

And I maintain that this kind of race condition can not be tickled
deterministically. There might be some piece of logic (or absence of it),
that can show that your finding of a thousand runs is not relevant.

> speculating about, plus it is logically consistent with the statements the
> manual does make about updating ptes; otherwise how could an smp os

Don't say this anymore, specially if you can not point me to the specs.

> perform a reliable shootdown by doing an atomic bit clear on the present
> bit of a pte?

OS clears present bit, processors can keep using their TLBs and access 
the page, no problems at all. That is why after clearing the present bit, 
the processor must flush all tlbs before it can assume no one is using
the page. Hardware updated access bit could also be a problem, but an
error there does not destroy data, it just leads the os to choosing the
wrong page to evict during memory pressure.

Kanoj

> 
>   -ben
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on Linux MM,
> see: http://www.linux.eu.org/Linux-MM/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Linux/mips64 on 64 node, 128p, 64G machine

2001-03-02 Thread Kanoj Sarcar


Hi,

Just a quick note to mention that I was successful in booting up a
64 node, 128p, 64G mips64 machine on a 2.4.1 based kernel. To be able
to handle the amount of io devices connected, I had to make some 
fixes in the arch/mips64 code. And a few to handle 128 cpus.

A couple of generic patches needed to be made on top of 2.4.1 
(obviously, the prime one was that NR_CPUS had to be bumped to 128).
I will clean the patches up and send them in to Linus.

For some output, visit

http://oss.sgi.com/projects/LinuxScalability/download/mips128.out

I ommitted the bootup messages, since they are similar (just a lot
longer!) to the 32p bootup messages at

http://oss.sgi.com/projects/LinuxScalability/download/mips64.out

Kanoj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Stuck at 1GB again

2000-09-03 Thread Kanoj Sarcar


> 
> Some time ago, the list was very helpful in solving my programs
> failing at the limit of real memory rather than expanding into
> swap under linux 2.2.
>

I can;t say what your actual problem is, but in previous experiments,
I have seen these as the main cause:

1. shortage of real memory (ram + swap). I don't think this is your
problem.

2. resource limit problems: some resource limits were defined as 
"int/long" instead of "unsigned int/long", but these should have
been fixed.

3. inability of malloc to find a contiguous range of virtual space in
userland: this depends on libraries used etc, that eat up chunks of
the user space. This might be your problem. (Hint: code a while(1)
loop before any malloc happens in your program, then use "cat 
/proc/pid/maps", where pid is the pid of your running program, to
see the user space virtual address allocation; you might not see
a contiguous 3Gb chunk for malloc).

Kanoj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] 2.4.0-test10-pre6 TLB flush race in establish_pte

2000-10-30 Thread Kanoj Sarcar


> 
> So while there may be a more elegant solution down the road, I would like
> to see the simple fix put back into 2.4.  Here is the patch to essential
> put the code back to the way it was before the S/390 merge.  Patch is
> against 2.4.0-test10pre6.
> 
> --- linux/mm/memory.cFri Oct 27 15:26:14 2000
> +++ linux-2.4.0-test10patch/mm/memory.c  Fri Oct 27 15:45:54 2000
> @@ -781,8 +781,8 @@
>   */
>  static inline void establish_pte(struct vm_area_struct * vma, unsigned long 
>address, pte_t *page_table, pte_t entry)
>  {
> -flush_tlb_page(vma, address);
>  set_pte(page_table, entry);
> +flush_tlb_page(vma, address);
>  update_mmu_cache(vma, address, entry);
>  }
>

Great, lets do it. Definitely solves one race. 

Kanoj 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Updated Linux 2.4 Status/TODO List (from the ALS show)

2000-10-13 Thread Kanoj Sarcar

> 
> 
> On Thu, 12 Oct 2000, David S. Miller wrote:
> > 
> >page_table_lock is supposed to protect normal page table activity (like
> >what's done in page fault handler) from swapping out.
> >However, grabbing this lock in swap-out code is completely missing!
> > 
> > Audrey, vmlist_access_{un,}lock == unlocking/locking page_table_lock.
> 
> Yeah, it's an easy mistake to make.
> 
> I've made it myself - grepping for page_table_lock and coming up empty in
> places where I expected it to be.
> 
> In fact, if somebody sends me patches to remove the "vmlist_access_lock()"
> stuff completely, and replace them with explicit page_table_lock things,
> I'll apply it pretty much immediately. I don't like information hiding,
> and right now that's the only thing that the vmlist_access_lock() stuff is
> doing.

Linus,

I came up with the vmlist_access_lock/vmlist_modify_lock names early in 
2.3. The reasoning behind that was that in most places where the "vmlist
lock" was being taken was to protect the vmlist chain, vma_t fields or
mm_struct fields. The fact that implementation wise this lock could be
the same as page_table_lock was a good idea that you suggested. 

Nevertheless, the name was chosen to indicate what type of things it was
guarding. For example, in the future, you might actually have a different
(possibly sleeping) lock to guard the vmachain etc, but still have a 
spin lock for the page_table_lock (No, I don't want to be drawn into a 
discussion of why this might be needed right now). Some of this is 
mentioned in Documentation/vm/locking.

Just thought I would mention, in case you don't recollect some of this
history. Of course, I understand the "information hiding" part.

Kanoj

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [ofa-general] Re: Demand paging for memory regions

Re: [ofa-general] Re: Demand paging for memory regions

Re: [Lse-tech] Re: a quest for a better scheduler

Re: [Lse-tech] Re: a quest for a better scheduler

Re: [Lse-tech] Re: a quest for a better scheduler

Re: [Lse-tech] Re: a quest for a better scheduler

Re: [Lse-tech] Bug in sys_sched_yield

Re: x86 ptep_get_and_clear question

Re: x86 ptep_get_and_clear question

Re: x86 ptep_get_and_clear question

Re: x86 ptep_get_and_clear question

Re: x86 ptep_get_and_clear question

Linux/mips64 on 64 node, 128p, 64G machine

Re: Stuck at 1GB again

Re: [PATCH] 2.4.0-test10-pre6 TLB flush race in establish_pte

Re: Updated Linux 2.4 Status/TODO List (from the ALS show)

16 matches

Site Navigation

Mail list logo

Footer information