In addition the the two patches, there are two more patches that I would like to get some feedback.
The two patches are more radical: the 3rd deals with free path zone->lock contention by avoiding doing any merge for order0 pages while the 4th deals with allocation path zone->lock contention by taking pcp->batch pages off the free_area order0 list without the need to iterate the list. Both patches are developed based on "the most time consuming part of operations under zone->lock is cache misses on struct page". The 3rd patch may be controversial but doesn't have correctness problem; the 4th is in an early stage and serves only as a proof-of-concept. Your comments are appreciated, thanks.