- Loading...
Table of Contents:
This page describes adding support for Async Monitor Deflation to OpenJDK. The primary goal of this project is to reduce the time spent in safepoint cleanup operations.
RFE: 8153224 Monitor deflation prolong safepoints
https://bugs.openjdk.java.net/browse/JDK-8153224
Full Webrev: 16-for-jdk15+24.v2.13.full
Inc Webrev: 16-for-jdk15+24.v2.13.inc
This patch for Async Monitor Deflation is based on Carsten Varming's
http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/
which has been ported to work with monitor lists. Monitor lists were optional via the '-XX:+MonitorInUseLists' option in JDK8, the option became default 'true' in JDK9, the option became deprecated in JDK10 via JDK-8180768, and the option became obsolete in JDK12 via JDK-8211384. Carsten's webrev is based on JDK10 so there was a bit of porting work needed to merge his code and/or algorithms with jdk/jdk.
Carsten also submitted a JEP back in the JDK10 time frame:
JDK-8183909 Concurrent Monitor Deflation
https://bugs.openjdk.java.net/browse/JDK-8183909
The OpenJDK JEP process has evolved a bit since JDK10 and a JEP is no longer required for a project that is well defined to be within one area of responsibility. Async Monitor Deflation is clearly defined to be in the JVM Runtime team's area of responsibility so it is likely that the JEP (JDK-8183909) will be withdrawn and the work will proceed via the RFE (JDK-8153224).
The current idle monitor deflation mechanism executes at a safepoint during cleanup operations. Due to this execution environment, the current mechanism does not have to worry about interference from concurrently executing JavaThreads. Async Monitor Deflation uses the ServiceThread to deflate idle monitors so the new mechanism has to detect interference and adapt as appropriate. In other words, data races are natural part of Async Monitor Deflation and the algorithms have to detect the races and react without data loss or corruption.
ObjectSynchronizer::deflate_monitor_using_JT() is the new counterpart to ObjectSynchronizer::deflate_monitor() and does the heavy lifting of asynchronously deflating a monitor using a three part prototcol:
If we lose any of the races, the monitor cannot be deflated at this time.
Once we know it is safe to deflate the monitor (which is mostly field resetting and monitor list management), we have to restore the object's header. That's another racy operation that is described below in "Restoring the Header With Interference Detection".
ObjectMonitor::install_displaced_markword_in_object() is the new piece of code that handles all the racy situations with restoring an object's header asynchronously. The function is called from three places (deflation, ObjectMonitor::enter(), and FastHashCode). Only one of the possible racing scenarios can win and the losing scenarios all adapt to the winning scenario's object header value.
Various code paths have been updated to recognize an owner field equal to DEFLATER_MARKER or a negative contentions field and those code paths will retry their operation. This is the shortest "Key Part" description, but don't be fooled. See "Gory Details" below.
ObjectMonitor::enter() can change an idle monitor into a busy monitor. ObjectSynchronizer::deflate_monitor_using_JT() is used to asynchronously deflate an idle monitor. enter() and deflate_monitor_using_JT() can interfere with each other. The thread calling enter() (T-enter) is potentially racing with another JavaThread (T-deflate) so both threads have to check the results of the races.
T-enter ObjectMonitor T-deflate
------------------------ +-----------------------+ ----------------------------------------
enter() { | owner=NULL | deflate_monitor_using_JT() {
1> atomic inc contentions | contentions=0 | 1> cmpxchg(DEFLATER_MARKER, &owner, NULL)
+-----------------------+
T-enter ObjectMonitor T-deflate
------------------------ +-----------------------+ --------------------------------------------
enter() { | owner=DEFLATER_MARKER | deflate_monitor_using_JT() {
1> atomic inc contentions | contentions=0 | cmpxchg(DEFLATER_MARKER, &owner, NULL)
+-----------------------+ :
1> prev = cmpxchg(-max_jint, &contentions, 0)
T-enter ObjectMonitor T-deflate
--------------------------------- +-------------------------+ --------------------------------------------
enter() { | owner=DEFLATER_MARKER | deflate_monitor_using_JT() {
atomic inc contentions | contentions=-max_jint+1 | cmpxchg(DEFLATER_MARKER, &owner, NULL)
1> if (owner == DEFLATER_MARKER && +-------------------------+ :
contentions <= 0) { || prev = cmpxchg(-max_jint, &contentions, 0)
restore obj header \/ 1> if (prev == 0 &&
atomic dec contentions +-------------------------+ owner == DEFLATER_MARKER) {
2> return false to force retry | owner=DEFLATER_MARKER | restore obj header
} | contentions=-max_jint | 2> finish the deflation
+-------------------------+ }
T-enter ObjectMonitor T-deflate
--------------------------------- +-------------------------+ --------------------------------------------
enter() { | owner=DEFLATER_MARKER. | deflate_monitor_using_JT() {
atomic inc contentions | contentions=1 | cmpxchg(DEFLATER_MARKER, &owner, NULL)
1> if (owner == DEFLATER_MARKER && +-------------------------+ :
contentions <= 0) { || prev = cmpxchg(-max_jint, &contentions, 0)
} \/ 1> if (prev == 0 &&
2> <continue contended enter> +-------------------------+ owner == DEFLATER_MARKER) {
| owner=NULL | } else {
| contentions=1 | cmpxchg(NULL, &owner, DEFLATER_MARKER)
+-------------------------+ 2> return
T-enter ObjectMonitor T-deflate
-------------------------------------------- +-------------------------+ --------------------------------------------
ObjectMonitor::enter() { | owner=DEFLATER_MARKER | deflate_monitor_using_JT() {
increment contentions | contentions=1 | cmpxchg(DEFLATER_MARKER, &owner, NULL)
1> EnterI() { +-------------------------+ 1> :
if (owner == DEFLATER_MARKER && || 2> : <thread_stalls>
cmpxchg(Self, &owner, \/ :
DEFLATER_MARKER) +-------------------------+ :
== DEFLATER_MARKER) { | owner=Self/T-enter | :
// EnterI is done | contentions=0 | : <thread_resumes>
return +-------------------------+ prev = cmpxchg(-max_jint, &contentions, 0)
} || if (prev == 0 &&
decrement contentions \/ 3> owner == DEFLATER_MARKER) {
} // enter() is done +-------------------------+ } else {
2> : <does app work> | owner=Self/T-enter|NULL | cmpxchg(NULL, &owner, DEFLATER_MARKER)
3> : | contentions=-max_jint | atomic add max_jint to contentions
exit() monitor +-------------------------+ 4> bailout on deflation
4> owner = NULL || }
\/
+-------------------------+
| owner=Self/T-enter|NULL |
| contentions=0 |
+-------------------------+
NULL → DEFLATER_MARKER → Self/T-enter
so we really have A1-B-A2, but the A-B-A principal still holds.
If the T-enter thread has managed to enter and exit the monitor during the T-deflate stall, then our owner field A-B-A transition is:
NULL → DEFLATER_MARKER → Self/T-enter → NULL
so we really have A-B1-B2-A, but the A-B-A principal still holds.
T-enter finished doing app work and is about to exit the monitor (or it has already exited the monitor).
The fourth ObjectMonitor box is showing the fields at this point and the "4>" markers are showing where each thread is at for that ObjectMonitor box.
After T-deflate has won the race for deflating an ObjectMonitor it has to restore the header in the associated object. Of course another thread can be trying to do something to the object's header at the same time. Isn't asynchronous work exciting?!?!
ObjectMonitor::install_displaced_markword_in_object() is called from two places so we can have a race between a T-enter thread and a T-deflate thread:
T-enter object T-deflate
----------------------------------------------- +-------------+ -----------------------------------------------
install_displaced_markword_in_object(oop obj) { | mark=om_ptr | install_displaced_markword_in_object(oop obj) {
dmw = header() +-------------+ dmw = header()
obj->cas_set_mark(dmw, this) obj->cas_set_mark(dmw, this) }
T-enter object T-deflate
----------------------------------------------- +-------------+ -----------------------------------------------
install_displaced_markword_in_object(oop obj) { | mark=dmw | install_displaced_markword_in_object(oop obj) {
dmw = header() +-------------+ dmw = header()
obj->cas_set_mark(dmw, this) obj->cas_set_mark(dmw, this)
Please notice that install_displaced_markword_in_object() does not do any retries on any code path:
There are a few races that can occur between a T-deflate thread and a thread trying to get/set a hashcode (T-hash) in an ObjectMonitor:
The common fall thru code path (executed by T-hash) that inflates the ObjectMonitor in order to set the hashcode can race with an async deflation (T-deflate). After the hashcode has been stored in the ObjectMonitor, we (T-hash) check if the ObjectMonitor has been async deflated (by T-deflate). If it has, then we (T-hash) retry because we don't know if the hashcode was stored in the ObjectMonitor before the object's header was restored (by T-deflate). Retrying (by T-hash) will result in the hashcode being stored in either object's header or in the re-inflated ObjectMonitor's header as appropriate.
Use of specialized measurement code with the CR5/v2.05/8-for-jdk13 bits revealed that the gListLock contention is responsible for much of the performance degradation observed with SPECjbb2015. Consequently the primary focus of the next round of changes is/was on switching from course grained Thread::muxAcquire(&gListLock) and Thread::muxRelease(&gListLock) pairs to spin-lock monitor list management. Of course, since the Java Monitor subsystem is full of special cases, the spin-lock list management code has to have a number of special cases which are described here.
The Spin-Lock Monitor List management code was pushed to JDK15 using the following bug id:
JDK-8235795 replace monitor list mux{Acquire,Release}(&gListLock) with spin locks
The Async Monitor Deflation project makes a few additional changes on top of what was pushed via JDK-8235795.
There is one simple case of spin-lock list management with the Java Monitor subsystem so we'll start with that code as a way to introduce the spin-lock concepts:
L1: while (true) {
L2: PaddedObjectMonitor* cur = Atomic::load(&g_block_list);
L3: Atomic::store(&new_blk[0]._next_om, cur);
L4: if (Atomic::cmpxchg(&g_block_list, cur, new_blk) == cur) {
L5: Atomic::add(&om_list_globals.population, _BLOCKSIZE - 1);
L6: break;
L7: }
L8: }
What the above block of code does is:
The above block of code can be called by multiple threads in parallel and must not lose track of any blocks. Of course, the "must not lose track of any blocks" part is where all the details come in:
At the point that cmpxchg has published the new 'g_block_list' value, 'new_blk' is now first block in the list and the 0th element's next field is used to find the previous first block; all of the monitor list blocks are chained together via the next field in the block's 0th element. It is the use of cmpxchg to update 'g_block_list' and the checking of the return value from cmpxchg that insures that we don't lose track of any blocks.
This example is considered to be the "simple case" because we only prepend to the list (no deletes) and we only use:
to achieve the safe update of the 'g_block_list' value; the atomic increment of the 'om_list_globals.population' counter is considered to be just accounting (pun intended).
The concepts introduced here are:
Note: The above code snippet comes from ObjectSynchronizer::prepend_block_to_lists(); see that function for more complete context (and comments).
Note: This subsection is talking about "Simple Take" and "Simple Prepend" in abstract terms. The purpose of this code and A-B-A example is to introduce the race concepts. The code shown here is not an exact match for the project code and the specific A-B-A example is not (currently) found in the project code.
The left hand column shows "T1" taking a node "A" from the front of the list and it shows the simple code that does that operation. The right hand column shows "T2" prepending a node "B" to the front of the list and it shows the simple code that does that operation. We have a third thread, "T3", that does a take followed by a prepend, but we don't show a column for "T3". Instead we have a column in the middle that shows the results of the interleaved operations of all three threads:
T1: Simple Take: | | T2: Simple Prepend:
---------------- | T1 and T3 see this initial list: | -------------------
+---+ +---+ +---+ | +---+ +---+ +---+ | +---+ +---+
head -> | A | -> | X | -> | Y | | head -> | A | -> | X | -> | Y | | head -> | X | -> | Y |
+---+ +---+ +---+ | +---+ +---+ +---+ | +---+ +---+
| T3 takes "A", T2 sees this list: |
Take a node || | +---+ +---+ | Prepend a node ||
from the front || | head -> | X | -> | Y | | to the front ||
of the list || | +---+ +---+ | of the list ||
\/ | T2 prepends "B": | \/
| +---+ +---+ +---+ |
+---+ +---+ | head -> | B | -> | X | -> | Y | | +---+ +---+ +---+
head -> | X | -> | Y | | +---+ +---+ +---+ | head -> | B | -> | X | -> | Y |
+---+ +---+ | T3 prepends "A": | +---+ +---+ +---+
+---+ | +---+ +---+ +---+ +---+ |
cur -> | A | | head -> | A | -> | B | -> | X | -> | Y | |
+---+ | +---+ +---+ +---+ +---+ |
| T1 takes "A", loses "B": |
// "take" a node: | +---+ | // "prepend" a node:
while (true) { | | B | ----+ | while (true) {
cur = head; | +---+ | | cur = head;
next = cur->next; | V | new->next = cur;
if (cmpxchg(next, &head, cur) == cur) { | +---+ +---+ | if (cmpxchg(new, &head, cur) == cur) {
break; // success changing head | head -> | X | -> | Y | | break; // success changing head
} | +---+ +---+ | }
} | +---+ | }
return cur; | cur -> | A | |
| +---+ |
The "Simple Take" and "Simple Prepend" algorithms are just fine by themselves. The "Simple Prepend" algorithm is almost identical to the algorithm in the "The Simple Case" and just like that algorithm, it works fine if we are only doing prepend operations on the list. Similarly, the "Simple Take" algorithm works just fine if we are only doing take operations on the list; the only thing missing is an empty list check, but that would have clouded the example.
When we allow simultaneous take and prepend operations on the same list, the simple algorithms are exposed to A-B-A races. An A-B-A race is a situation where the head of the list can change from node "A" to node "B" and back to node "A" again without the simple algorithm being aware that critical state has changed. In the middle column of the above diagram, we show what happens when T3 causes the head of the list to change from node "A" to node "B" (a take operation) and back to node "A" (a prepend operation). That A-B-A race causes T1 to lose node "B" when it updates the list head to node "X" instead of node "B" because T1 was unaware that its local 'next' value was stale.
Here's the diagram again with the code in T1 and T2 lined up with the effects of the A-B-A race executed by T3:
T1: Simple Take: | | T2: Simple Prepend:
---------------- | T1 and T3 see this initial list: | -------------------
while (true) { | +---+ +---+ +---+ | :
cur = head; | head -> | A | -> | X | -> | Y | | :
next = cur->next; | +---+ +---+ +---+ | :
: | T3 takes "A", T2 sees this list: | :
: | +---+ +---+ | :
: | head -> | X | -> | Y | | :
: | +---+ +---+ | while (true) {
: | T2 prepends "B": | cur = head;
: | +---+ +---+ +---+ | new->next = cur;
: | head -> | B | -> | X | -> | Y | | if (cmpxchg(new, &head, cur) == cur) {
: | +---+ +---+ +---+ | break;
: | T3 prepends "A": | }
: | +---+ +---+ +---+ +---+ | }
: | head -> | A | -> | B | -> | X | -> | Y | |
: | +---+ +---+ +---+ +---+ |
: | T1 takes "A", loses "B": |
: | +---+ |
: | | B | ----+ |
: | +---+ | |
: | V |
: | +---+ +---+ |
if (cmpxchg(next, &head, cur) == cur) { | head -> | X | -> | Y | |
} | +---+ +---+ |
} | +---+ |
return cur; | cur -> | A | |
| +---+ |
So the simple algorithms are not sufficient when we allow simultaneous take and prepend operations.
Note: This subsection is talking about "Spin-Locking" as a solution to the A-B-A race in abstract terms. The purpose of this spin-locking code and A-B-A example is to introduce the solution concepts. The code shown here is not an exact match for the project code.
One solution to the A-B-A race is to spin-lock the next field in a node to indicate that the node is busy. Only one thread can successfully spin-lock the next field in a node at a time and other threads must loop around and retry their spin-locking operation until they succeed. Each thread that spin-locks the next field in a node must unlock the next field when it is done with the node so that other threads can proceed.
Here's the take algorithm modified with spin-locking (still ignores the empty list for clarity):
// "take" a node with locking:
while (true) {
cur = head;
if (!try_om_lock(cur)) {
// could not lock cur so try again
continue;
}
if (head != cur) {
// head changed while locking cur so try again
om_unlock(cur);
continue;
}
next = unmarked_next(cur);
// list head is now locked so switch it to next which also makes list head unlocked
Atomic::store(&head, next);
om_unlock(cur); // unlock cur and return it
return cur;
}
The modified take algorithm does not change the list head pointer until it has successfully locked the list head node. Notice that after we lock the list head node we have to verify that the list head pointer hasn't changed in the mean time. Only after we have verified that the node we locked is still the list head is it safe to modify the list head pointer. The locking of the list head prevents the take algorithm from executing in parallel with a prepend algorithm and losing a node.
Also notice that we update the list head pointer with store instead of with cmpxchg. Since we have the list head locked, we are not racing with other threads to change the list head pointer so we can use a simple store instead of the heavy cmpxchg hammer.
Here's the prepend algorithm modified with locking (ignores the empty list for clarity):
// "prepend" a node with locking:
while (true) {
cur = head;
if (!try_om_lock(cur)) {
// could not lock cur so try again
continue;
}
if (head != cur) {
// head changed while locking cur so try again
om_unlock(cur);
continue;
}
next = unmarked_next(cur);
// list head is now locked so switch it to 'new' which also makes list head unlocked
Atomic::store(&head, new);
om_unlock(cur); // unlock the previous list head
}
The modified prepend algorithm does not change the list head pointer until it has successfully locked the list head node. Notice that after we lock the list head node we have to verify that the list head pointer hasn't changed in the mean time. Only after we have verified that the node we locked is still the list head is it safe to modify the list head pointer. The locking of the list head prevents the prepend algorithm from executing in parallel with the take algorithm and losing a node.
Also notice that we update the list head pointer with store instead of with cmpxchg for the same reasons as the previous algorithm.
The purpose of this subsection is to provide background information about how ObjectMonitors move between the various lists. This project changes the way these movements are implemented, but does not change the movements themselves. For example, newly allocated blocks of ObjectMonitors are always prepending to the global free list; this is true in the baseline and is true in this project. One exception is the addition of the global wait list (see below).
ObjectMonitors are deflated at a safepoint by:
ObjectSynchronizer::deflate_monitor_list() calling ObjectSynchronizer::deflate_monitor()
And when Async Monitor Deflation is enabled, they are deflated by:
ObjectSynchronizer::deflate_monitor_list_using_JT() calling ObjectSynchronizer::deflate_monitor_using_JT()
Idle ObjectMonitors are deflated by the ServiceThread when Async Monitor Deflation is enabled. They can also be deflated at a safepoint by the VMThread or by a task worker thread. Safepoint deflation is used when Async Monitor Deflation is disabled or when there is a special deflation request, e.g., System.gc().
An idle ObjectMonitor is deflated and extracted from its in-use list and prepended to the global wait list. The in-use list can be either the global in-use list or a per-thread in-use list. Deflated ObjectMonitors are always prepended to the global wait list.
It is now time to switch from algorithms to real snippets from the code.
The next case to consider for spin-lock list management with the Java Monitor subsystem is prepending to a list that also allows deletes. As you might imagine, the possibility of a prepend racing with a delete makes things more complicated. The solution is to lock the next field in the ObjectMonitor at the head of the list we're trying to prepend to. A successful lock tells other prependers or deleters that the locked ObjectMonitor is busy and they will need to retry their own lock operation.
L01: while (true) {
L02: om_lock(m); // Lock m so we can safely update its next field.
L03: ObjectMonitor* cur = NULL;
L04: // Lock the list head to guard against A-B-A race:
L05: if ((cur = get_list_head_locked(list_p)) != NULL) {
L06: // List head is now locked so we can safely switch it.
L07: m->set_next_om(cur); // m now points to cur (and unlocks m)
L08: Atomic::store(list_p, m); // Switch list head to unlocked m.
L09: om_unlock(cur);
L10: break;
L11: }
L12: // The list is empty so try to set the list head.
L13: assert(cur == NULL, "cur must be NULL: cur=" INTPTR_FORMAT, p2i(cur));
L14: m->set_next_om(cur); // m now points to NULL (and unlocks m)
L15: if (Atomic::cmpxchg(list_p, cur, m) == cur) {
L16: // List head is now unlocked m.
L17: break;
L18: }
L19: // Implied else: try it all again
L20: }
L21: Atomic::inc(count_p);
What the above block of code does is:
The above block of code can be called by multiple prependers in parallel or with deleters running in parallel and must not lose track of any ObjectMonitor. Of course, the "must not lose track of any ObjectMonitor" part is where all the details come in:
ObjectMonitor 'm' is safely on the list at the point that we have updated 'list_p' to refer to 'm'. In this subsection's block of code, we also called three new functions: om_lock(), get_list_head_locked() and set_next_om(), that are explained in the next few subsections about helper functions.
Note: The above code snippet comes from prepend_to_common(); see that function for more context and a few more comments.
Managing spin-locks on ObjectMonitors has been abstracted into a few helper functions. try_om_lock() is the first interesting one:
L1: static bool try_om_lock(ObjectMonitor* om) {
L2: // Get current next field without any OM_LOCK_BIT value.
L3: ObjectMonitor* next = unmarked_next(om);
L4: if (om->try_set_next_om(next, mark_om_ptr(next)) != next) {
L5: return false; // Cannot lock the ObjectMonitor.
L6: }
L7: return true;
L8: }
The above function tries to lock the ObjectMonitor:
The function can be called by multiple threads at the same time and only one thread will succeed in the locking operation (return == true) and all other threads will get return == false. Of course, the "only one thread will succeed" part is where all the details come in:
The try_om_lock() function calls another helper function, mark_om_ptr(), that needs a quick explanation:
L1: static ObjectMonitor* mark_om_ptr(ObjectMonitor* om) {
L2: return (ObjectMonitor*)((intptr_t)om | OM_LOCK_BIT);
L3: }
This function encapsulates the setting of the locking bit in an ObjectMonitor* for the purpose of hiding the details and making the calling code easier to read:
set_next_om() is the next interesting function and it also only needs a quick explanation:
L1: inline void ObjectMonitor::set_next_om(ObjectMonitor* value) {
L2: Atomic::store(&_next_om, value);
L3: }
This function encapsulates the setting of the next field in an ObjectMonitor for the purpose of hiding the details and making the calling code easier to read:
om_lock() is the next interesting helper function:
L1: static void om_lock(ObjectMonitor* om) {
L2: while (true) {
L3: if (try_om_lock(om)) {
L4: return;
L5: }
L6: }
L7: }
The above function loops until it locks the target ObjectMonitor. There is nothing particularly special about this function so we don't need any line specific annotations.
Debugging Tip: If there's a bug where an ObjectMonitor's next field is not properly unlocked, then this function will loop forever and the caller will be stuck.
get_list_head_locked() is the next interesting helper function:
L01: static ObjectMonitor* get_list_head_locked(ObjectMonitor** list_p) {
L02: while (true) {
L03: ObjectMonitor* mid = Atomic::load(list_p);
L04: if (mid == NULL) {
L05: return NULL; // The list is empty.
L06: }
L07: if (try_om_lock(mid)) {
L08: if (Atomic::load(list_p) != mid) {
L09: // The list head changed so we have to retry.
L10: om_unlock(mid);
L11: continue;
L12: }
L13: return mid;
L14: }
L15: }
L16: }
The above function tries to lock the list head's ObjectMonitor:
The function can be called by more than one thread on the same 'list_p' at a time. False is only returned when 'list_p' refers to an empty list. Otherwise only one thread will return an ObjectMonitor* at a time. Since the ObjectMonitor is locked, any parallel callers to get_list_head_locked() will loop until the list head's ObjectMonitor is no longer locked. That typically happens when the list head's ObjectMonitor is taken off the list and 'list_p' is advanced to the next ObjectMonitor on the list. Of course, making sure that "only one thread will return true at a time" is where all the details come in:
When this function returns a non-NULL ObjectMonitor*, the ObjectMonitor is locked and any parallel callers of get_list_head_locked() on the same list will be looping until the list head's ObjectMonitor is no longer locked. The caller that just got the ObjectMonintor* needs to finish up its work quickly.
Debugging Tip: If there's a bug where the list head ObjectMonitor is not properly unlocked, then this function will loop forever and the caller will be stuck.
The next case to consider for spin-lock list management with the Java Monitor subsystem is taking an ObjectMonitor from the start of a list. Taking an ObjectMonitor from the start of a list is a specialized form of delete that is guaranteed to interact with a thread that is prepending to the same list at the same time. Again, the core of the solution is to lock the ObjectMonitor at the head of the list we're trying to take the ObjectMonitor from, but we use slightly different code because we have less linkages to make than a prepend.
L01: static ObjectMonitor* take_from_start_of_common(ObjectMonitor** list_p,
L02: int* count_p) {
L03: ObjectMonitor* take = NULL;
L04: // Lock the list head to guard against A-B-A race:
L05: if ((take = get_list_head_locked(list_p)) == NULL) {
L06: return NULL; // None are available.
L07: }
L08: ObjectMonitor* next = unmarked_next(take);
L09: // Switch locked list head to next (which unlocks the list head, but
L10: // leaves take locked):
L11: Atomic::store(list_p, next);
L12: Atomic::dec(count_p);
L13: // Unlock take, but leave the next value for any lagging list
L14: // walkers. It will get cleaned up when take is prepended to
L15: // the in-use list:
L16: om_unlock(take);
L17: return take;
L18: }
What the above function does is:
The function can be called by more than one thread at a time and each thread will take a unique ObjectMonitor from the start of the list (if one is available) without losing any other ObjectMonitors on the list. Of course, the "take a unique ObjectMonitor" and "without losing any other ObjectMonitors" parts are where all the details come in:
This last helper function exists for making life easier for list walker code. List walker code calls get_list_head_locked() to get the locked list head and then walks the list applying its particular logic to elements in the list. In order to safely walk to the 'next' ObjectMonitor in a list, the list walker code must lock the 'next' ObjectMonitor before unlocking the 'current' ObjectMonitor that it has locked. If a list walker unlocks 'current' before locking 'next', then there is race where 'current' could be modified to refer to something other than the 'next' value that was in place when 'current' was locked. By locking 'next' first and then unlocking 'current', the list walker can safely advance to 'next'.
L01: static ObjectMonitor* lock_next_for_traversal(ObjectMonitor* cur) {
L02: assert(is_locked(cur), "cur=" INTPTR_FORMAT " must be locked", p2i(cur));
L03: ObjectMonitor* next = unmarked_next(cur);
L04: if (next == NULL) { // Reached the end of the list.
L05: om_unlock(cur);
L06: return NULL;
L07: }
L08: om_lock(next); // Lock next before unlocking current to keep
L09: om_unlock(cur); // from being by-passed by another thread.
L10: return next;
L11: }
This function is pretty straight forward so there are no detailed notes for it.
ObjectSynchronizer::om_alloc() is responsible for allocating an ObjectMonitor and returning it to the caller. It has a three step algorithm:
1) Try to allocate from self's local free-list:
2) Try to allocate from the global free list (up to self→om_free_provision times):
3) Allocate a block of new ObjectMonitors:
ObjectSynchronizer::om_release() is responsible for putting an ObjectMonitor on self's free list. If 'from_per_thread_alloc' is true, then om_release() is also responsible for extracting the ObjectMonitor from self's in-use list. The extraction from self's in-use list must happen first:
L01: if (from_per_thread_alloc) {
L02: if ((mid = get_list_head_locked(&self->om_in_use_list)) == NULL) {
L03: fatal("thread=" INTPTR_FORMAT " in-use list must not be empty.", p2i(self));
L04: }
L05: next = unmarked_next(mid);
L06: if (m == mid) {
L07: Atomic::store(&self->om_in_use_list, next);
L08: } else if (m == next) {
L09: mid = next;
L10: om_lock(mid);
L11: next = unmarked_next(mid);
L12: self->om_in_use_list->set_next_om(next);
L13: } else {
L14: ObjectMonitor* anchor = next;
L15: om_lock(anchor);
L16: om_unlock(mid);
L17: while ((mid = unmarked_next(anchor)) != NULL) {
L18: if (m == mid) {
L19: next = unmarked_next(mid);
L20: anchor->set_next_om(next);
L21: break;
L22: } else {
L23: om_lock(mid);
L24: om_unlock(anchor);
L25: anchor = mid;
L26: }
L27: }
L28: }
L29: Atomic::dec(&self->om_in_use_count);
L30: om_unlock(mid);
L31: }
L32: prepend_to_om_free_list(self, m);
Most of the above code block extracts 'm' from self's in-use list; it is not an exact quote from om_release(), but it is the highlights:
The last line of the code block (L32) prepends 'm' to self's free list.
ObjectSynchronizer::om_flush() is reponsible for flushing self's in-use list to the global in-use list and self's free list to the global free list during self's thread exit processing. om_flush() starts with self's in-use list:
L01: if ((in_use_list = get_list_head_locked(&self->om_in_use_list)) != NULL) {
L02: in_use_tail = in_use_list;
L03: in_use_count++;
L04: for (ObjectMonitor* cur_om = unmarked_next(in_use_list); cur_om != NULL;) {
L05: if (is_locked(cur_om)) {
L06: while (is_locked(cur_om)) {
L07: os::naked_short_sleep(1);
L08: }
L09: cur_om = unmarked_next(in_use_tail);
L10: continue;
L11: }
L12: if (cur_om->is_free()) {
L13: cur_om = unmarked_next(in_use_tail);
L14: continue;
L15: }
L16: in_use_tail = cur_om;
L17: in_use_count++;
L18: cur_om = unmarked_next(cur_om);
L19: }
L20: guarantee(in_use_tail != NULL, "invariant");
L21: int l_om_in_use_count = Atomic::load(&self->om_in_use_count);
L22: ADIM_guarantee(l_om_in_use_count == in_use_count, "in-use counts don't match: "
L23: "l_om_in_use_count=%d, in_use_count=%d", l_om_in_use_count, in_use_count);
L24: Atomic::store(&self->om_in_use_count, 0);
L25: Atomic::store(&self->om_in_use_list, (ObjectMonitor*)NULL);
L26: om_unlock(in_use_list);
L27: }
The above is not an exact copy of the code block from om_flush(), but it is the highlights. What the above code block needs to do is pretty simple:
However, in this case, there are a lot of details:
The code to process self's free list is much, much simpler because we don't have any races with an async deflater thread like self's in-use list. The only interesting bits:
The last interesting bits for this function are prepending the local lists to the right global places:
ObjectSynchronizer::deflate_monitor_list() is responsible for deflating idle ObjectMonitors at a safepoint. This function can use the simpler lock-mid-as-we-go protocol since there can be no parallel list deletions due to the safepoint:
L01: int ObjectSynchronizer::deflate_monitor_list(ObjectMonitor** list_p,
L02: int* count_p,
L03: ObjectMonitor** free_head_p,
L04: ObjectMonitor** free_tail_p) {
L05: ObjectMonitor* cur_mid_in_use = NULL;
L06: ObjectMonitor* mid = NULL;
L07: ObjectMonitor* next = NULL;
L08: int deflated_count = 0;
L09: if ((mid = get_list_head_locked(list_p)) == NULL) {
L10: return 0; // The list is empty so nothing to deflate.
L11: }
L12: next = unmarked_next(mid);
L13: while (true) {
L14: oop obj = (oop) mid->object();
L15: if (obj != NULL && deflate_monitor(mid, obj, free_head_p, free_tail_p)) {
L16: if (cur_mid_in_use == NULL) {
L17: Atomic::store(list_p, next);
L18: } else {
L19: cur_mid_in_use->set_next_om(next);
L20: }
L21: deflated_count++;
L22: Atomic::dec(count_p);
L23: mid->set_next_om(NULL);
L24: } else {
L25: om_unlock(mid);
L26: cur_mid_in_use = mid;
L27: }
L28: mid = next;
L29: if (mid == NULL) {
L30: break; // Reached end of the list so nothing more to deflate.
L31: }
L32: om_lock(mid);
L33: next = unmarked_next(mid);
L34: }
L35: return deflated_count;
L36: }
Note: The above version of deflate_monitor_list() uses locking, but those changes were dropped during the code review cycle for JDK-8235795. The locking is only needed when additional calls to audit_and_print_stats() are used during debugging so it was decided that the pushed version would be simpler.
The above is not an exact copy of the code block from deflate_monitor_list(), but it is the highlights. What the above code block needs to do is pretty simple:
Since we're using the simpler mark-mid-as-we-go protocol, there are not too many details:
ObjectSynchronizer::deflate_monitor_list_using_JT() is responsible for asynchronously deflating idle ObjectMonitors using a JavaThread. This function uses the more complicated lock-cur_mid_in_use-and-mid-as-we-go protocol because om_release() can do list deletions in parallel. We also lock-next-next-as-we-go to prevent an om_flush() that is behind this thread from passing us. Because this function can asynchronously interact with so many other functions, this is the largest clip of code:
L01: int ObjectSynchronizer::deflate_monitor_list_using_JT(ObjectMonitor** list_p,
L02: int* count_p,
L03: ObjectMonitor** free_head_p,
L04: ObjectMonitor** free_tail_p,
L05: ObjectMonitor** saved_mid_in_use_p) {
L06: JavaThread* self = JavaThread::current();
L07: ObjectMonitor* cur_mid_in_use = NULL;
L08: ObjectMonitor* mid = NULL;
L09: ObjectMonitor* next = NULL;
L10: ObjectMonitor* next_next = NULL;
L11: int deflated_count = 0;
L12: NoSafepointVerifier nsv;
L13: if (*saved_mid_in_use_p == NULL) {
L14: if ((mid = get_list_head_locked(list_p)) == NULL) {
L15: return 0; // The list is empty so nothing to deflate.
L16: }
L17: next = unmarked_next(mid);
L18: } else {
L19: cur_mid_in_use = *saved_mid_in_use_p;
L20: om_lock(cur_mid_in_use);
L21: mid = unmarked_next(cur_mid_in_use);
L22: if (mid == NULL) {
L23: om_unlock(cur_mid_in_use);
L24: *saved_mid_in_use_p = NULL;
L25: return 0; // The remainder is empty so nothing more to deflate.
L26: }
L27: om_lock(mid);
L28: next = unmarked_next(mid);
L29: }
L30: while (true) {
L31: if (next != NULL) {
L32: om_lock(next);
L33: next_next = unmarked_next(next);
L34: }
L35: if (mid->object() != NULL && mid->is_old() &&
L36: deflate_monitor_using_JT(mid, free_head_p, free_tail_p)) {
L37: if (cur_mid_in_use == NULL) {
L38: Atomic::store(list_p, next);
L39: } else {
L40: ObjectMonitor* locked_next = mark_om_ptr(next);
L41: cur_mid_in_use->set_next_om(locked_next);
L42: }
L43: deflated_count++;
L44: Atomic::dec(count_p);
L45: mid->set_next_om(NULL);
L46: mid = next; // mid keeps non-NULL next's locked state
L47: next = next_next;
L48: } else {
L49: if (cur_mid_in_use != NULL) {
L50: om_unlock(cur_mid_in_use);
L51: }
L52: cur_mid_in_use = mid;
L53: mid = next; // mid keeps non-NULL next's locked state
L54: next = next_next;
L55: if (SafepointMechanism::should_block(self) &&
L56: cur_mid_in_use != Atomic::load(list_p) && cur_mid_in_use->is_old()) {
L57: *saved_mid_in_use_p = cur_mid_in_use;
L58: om_unlock(cur_mid_in_use);
L59: if (mid != NULL) {
L60: om_unlock(mid);
L61: }
L62: return deflated_count;
L63: }
L64: }
L65: if (mid == NULL) {
L66: if (cur_mid_in_use != NULL) {
L67: om_unlock(cur_mid_in_use);
L68: }
L69: break; // Reached end of the list so nothing more to deflate.
L70: }
L71: }
L72: *saved_mid_in_use_p = NULL;
L73: return deflated_count;
L74: }
The above is not an exact copy of the code block from deflate_monitor_list_using_JT(), but it is the highlights. What the above code block needs to do is pretty simple:
Since we're using the more complicated lock-cur_mid_in_use-and-mid-as-we-go protocol and also the lock-next-next-as-we-go protocol, there is a mind numbing amount of detail:
ObjectSynchronizer::deflate_idle_monitors() handles deflating idle monitors at a safepoint from the global in-use list using ObjectSynchronizer::deflate_monitor_list(). There are only a few things that are worth mentioning:
ObjectSynchronizer::deflate_common_idle_monitors_using_JT() handles asynchronously deflating idle monitors from either the global in-use list or a per-thread in-use list using ObjectSynchronizer::deflate_monitor_list_using_JT(). There are only a few things that are worth mentioning:
The devil is in the details! Housekeeping or administrative stuff are usually detailed, but necessary.
JDK-8181859 Monitor deflation is not checked in cleanup path
((om_list_globals.population - om_list_globals.free_count) / om_list_globals.population) > NN%
Changes to the ServiceThread mechanism by the Async Monitor Deflation project (when async deflation is enabled):
The ServiceThread will wake up every GuaranteedSafepointInterval to check for cleanup tasks.
This allows is_async_deflation_needed() to be checked at the same interval.
The ServiceThread handles deflating global idle monitors and deflating the per-thread idle monitors by calling ObjectSynchronizer::deflate_idle_monitors_using_JT().
Other invocation changes by the Async Monitor Deflation project (when async deflation is enabled):
VM_Exit::doit_prologue() will request a special cleanup to reduce the noise in 'monitorinflation' logging at VM exit time.
Before the final safepoint in a non-System.exit() end to the VM, we will request a special cleanup to reduce the noise in 'monitorinflation' logging at VM exit time.
WB_G1StartMarkCycle()