Hi Lipeng,
This patch try to introduce the rwlock and split the read/write to unit_root tree and unit_cache with rwlock instead of the mutex to increase CPU efficiency. In the get_gfc_unit function, the percentage to step into the insert_unit function is around 30%, in most instances, we can get the unit in the phase of reading the unit_cache or unit_root tree. So split the read/write phase by rwlock would be an approach to make it more parallel.
No comment on the code itself, as yet... but I'd like to know how throroughly you tested it, using which tools, and on which programs. Did you use valgrind --tool=helgrind or --tool=drd? Since it is prone to race conditions, did you also test Fortran's asynchronous I/O? Best regards Thomas