Async Monitor Deflation

Table of Contents:

Summary

This page describes adding support for Async Monitor Deflation to OpenJDK. The primary goal of this project is to reduce the time spent in safepoint cleanup operations.

RFE: 8153224 Monitor deflation prolong safepoints
https://bugs.openjdk.java.net/browse/JDK-8153224

Full Webrev: http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.full/

Inc Webrev: http://cr.openjdk.java.net/~dcubed/8153224-webrev/10-for-jdk14.v2.07.inc/

Background

This patch for Async Monitor Deflation is based on Carsten Varming's

http://cr.openjdk.java.net/~cvarming/monitor_deflate_conc/0/

which has been ported to work with monitor lists. Monitor lists were optional via the '-XX:+MonitorInUseLists' option in JDK8, the option became default 'true' in JDK9, the option became deprecated in JDK10 via JDK-8180768, and the option became obsolete in JDK12 via JDK-8211384. Carsten's webrev is based on JDK10 so there was a bit of porting work needed to merge his code and/or algorithms with jdk/jdk.

Carsten also submitted a JEP back in the JDK10 time frame:

JDK-8183909 Concurrent Monitor Deflation

https://bugs.openjdk.java.net/browse/JDK-8183909

The OpenJDK JEP process has evolved a bit since JDK10 and a JEP is no longer required for a project that is well defined to be within one area of responsibility. Async Monitor Deflation is clearly defined to be in the JVM Runtime team's area of responsibility so it is likely that the JEP (JDK-8183909) will be withdrawn and the work will proceed via the RFE (JDK-8153224).

Introduction

The current idle monitor deflation mechanism executes at a safepoint during cleanup operations. Due to this execution environment, the current mechanism does not have to worry about interference from concurrently executing JavaThreads. Async Monitor Deflation uses the ServiceThread to deflate idle monitors so the new mechanism has to detect interference and adapt as appropriate. In other words, data races are natural part of Async Monitor Deflation and the algorithms have to detect the races and react without data loss or corruption.

Key Parts of the Algorithm

1) Deflation With Interference Detection

ObjectSynchronizer::deflate_monitor_using_JT() is the new counterpart to ObjectSynchronizer::deflate_monitor() and does the heavy lifting of asynchronously deflating a monitor using a three part prototcol:

Setting a NULL owner field to DEFLATER_MARKER with cmpxchg() forces any contending thread through the slow path. A racing thread would be trying to set the owner field.
Making a zero ref_count field a large negative value with cmpxchg() forces racing threads to retry. A racing thread would would be trying to increment the ref_count field.
If the owner field is still equal to DEFLATER_MARKER, then we have won all the races and can deflate the monitor.

If we lose any of the races, the monitor cannot be deflated at this time.

Once we know it is safe to deflate the monitor (which is mostly field resetting and monitor list management), we have to restore the object's header. That's another racy operation that is described below in "Restoring the Header With Interference Detection".

2) Restoring the Header With Interference Detection

ObjectMonitor::install_displaced_markword_in_object() is the new piece of code that handles all the racy situations with restoring an object's header asynchronously. The function is called from two places (deflation and saving an ObjectMonitor* in an ObjectMonitorHandle). The restoration protocol for the object's header uses the mark bit along with the hash() value staying at zero to indicate that the object's header is being restored. Only one of the possible racing scenarios can win and the losing scenarios all adapt to the winning scenario's object header value.

3) Using "owner" or "ref_count" With Interference Detection

Various code paths have been updated to recognize an owner field equal to DEFLATER_MARKER or a negative ref_count field and those code paths will retry their operation. This is the shortest "Key Part" description, but don't be fooled. See "Gory Details" below.

An Example of ObjectMonitor Interference Detection

ObjectMonitor::save_om_ptr() is used to safely save an ObjectMonitor* in an ObjectMonitorHandle. ObjectSynchronizer::deflate_monitor_using_JT() is used to asynchronously deflate an idle monitor. save_om_ptr() and deflate_monitor_using_JT() can interfere with each other. The thread calling save_om_ptr() (T-save) is potentially racing with another JavaThread (T-deflate) so both threads have to check the results of the races.

Start of the Race

    T-save                  ObjectMonitor              T-deflate
    ----------------------  +-----------------------+  ----------------------------------------
    save_om_ptr() {         | owner=NULL            |  deflate_monitor_using_JT() {
   1> atomic inc ref_count  | ref_count=0           | 1> cmpxchg(DEFLATER_MARKER, &owner, NULL)
                            +-----------------------+

- The data fields are at their starting values.
- The "1>" markers are showing where each thread is at for the ObjectMonitor box:
  - T-deflate is about to execute cmpxchg().
  - T-save is about to increment the ref_count.

Racing Threads

    T-save                  ObjectMonitor              T-deflate
    ----------------------  +-----------------------+  ------------------------------------------
    save_om_ptr() {         | owner=DEFLATER_MARKER |  deflate_monitor_using_JT() {
   1> atomic inc ref_count  | ref_count=0           |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
                            +-----------------------+    :
                                                      1> prev = cmpxchg(-max_jint, &ref_count, 0)

- T-deflate has executed cmpxchg() and set owner to DEFLATE_MARKER.
- T-save still hasn't done anything yet
- The "1>" markers are showing where each thread is at for the ObjectMonitor box:
  - T-save and T-deflate are racing to update the ref_count field.

T-deflate Wins

    T-save                             ObjectMonitor              T-deflate
    ---------------------------------  +-----------------------+  ------------------------------------------
    save_om_ptr() {                    | owner=DEFLATER_MARKER |  deflate_monitor_using_JT() {
      atomic inc ref_count             | ref_count=-max_jint+1 |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> if (owner == DEFLATER_MARKER &&  +-----------------------+    :
          ref_count <= 0) {                       ||                prev = cmpxchg(-max_jint, &ref_count, 0)
        restore obj header                        \/             1> if (prev == 0 &&
        atomic dec ref_count           +-----------------------+        owner == DEFLATER_MARKER) {
     2> return false to force retry    | owner=DEFLATER_MARKER |      restore obj header
      }                                | ref_count=-max_jint   |   2> finish the deflation
                                       +-----------------------+    }

- This diagram starts after "Racing Threads".
- The "1>" markers are showing where each thread is at for that ObjectMonitor box:
  - T-save and T-deflate both observe owner == DEFLATER_MARKER and a negative ref_count field.
- T-save has lost the race: it restores the obj header (not shown) and decrements the ref_count.
- T-deflate restores the obj header (not shown).
- The "2>" markers are showing where each thread is at for that ObjectMonitor box.
- T-save returns false to cause the caller to retry.
- T-deflate finishes the deflation.
- Note: As of CR5/v2.05/8-for-jdk13, the owner == DEFLATER_MARKER value is allowed to linger until a deflated ObjectMonitor is reused for an enter operation. This narrows the C2 ObjectMonitor enter optimization race window with async deflation.

T-save Wins

    T-save                             ObjectMonitor              T-deflate
    ---------------------------------  +-----------------------+  ------------------------------------------
    save_om_ptr() {                    | owner=DEFLATER_MARKER |  deflate_monitor_using_JT() {
      atomic inc ref_count             | ref_count=1           |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> if (owner == DEFLATER_MARKER &&  +-----------------------+    :
          ref_count <= 0) {                       ||                prev = cmpxchg(-max_jint, &ref_count, 0)
    } else {                                      \/               1> if (prev == 0 &&
      save om_ptr in the               +-----------------------+          owner == DEFLATER_MARKER) {
        ObjectMonitorHandle            | owner=NULL            |    } else {
   2> return true                      | ref_count=1           |      cmpxchg(NULL, &owner, DEFLATER_MARKER)
                                       +-----------------------+   2> return

- This diagram starts after "Racing Threads".
- The "1>" markers are showing where each thread is at for the ObjectMonitor box:
  - T-save and T-deflate both observe a ref_count field > 0.
- T-save has won the race and it saves the ObjectMonitor* in the ObjectMonitorHandle (not shown).
- T-deflate detects that it has lost the race (prev != 0) and bails out on deflating the ObjectMonitor:
  - Before bailing out T-deflate tries to restore the owner field to NULL if it is still DEFLATER_MARKER.
- The "2>" markers are showing where each thread is at for that ObjectMonitor box.

T-save Complication with C2

Sorry in advance for the sudden deep dive into really gory C2 details, but this is related to a majority of save_om_ptr() so this is the right place to talk about the complication.

As of CR7/v2.07/10-for-jdk14, we have added C2 inc_om_ref_count() on X64 to implement the ref_count management parts of save_om_ptr():

- inc_om_ref_count() does not implement the "restore obj header" part nor the "save om_ptr in the ObjectMonitorHandle" part mentioned in the previous two subsections.
- inc_om_ref_count() is used by C2 fast_lock(), C2 fast_unlock() and C2 rtm_inflated_locking() on the LP64 X64 platform.
- The v2.05 version of C2 fast_lock() has code to detect a deflated and recycled ObjectMonitor after acquiring ownership of the ObjectMonitor. The solution to the race was to drop ownership and take the slow enter path. We have spent a lot of time and energy analyzing this race and the solution to this race and have convinced ourselves that the solution introduces theoretical problems with succession. The proper solution is to switch to using inc_om_ref_count() to protect the ObjectMonitor* for the duration of C2 fast_lock().
- Robbin wrote a new test called MoCrazy that is targeted at the C2 optimizations. This test revealed a race in the baseline C2 fast_unlock() where ownership was reacquired in order to ensure proper succession. So baseline C2 fast_unlock() had a similar version of the race that we thought we fixed in C2 fast_lock(). The proper solution is to switch to using inc_om_ref_count() to protect the ObjectMonitor* for the duration of C2 fast_unlock().
- C2 rtm_inflated_locking() is similarly exposed to races with async deflation so inc_om_ref_count() is used to protect the ObjectMonitor* for the duration of C2 rtm_inflated_locking().

T-enter Wins By A-B-A

    T-enter                                       ObjectMonitor                T-deflate
    --------------------------------------------  +-------------------------+  ------------------------------------------
    ObjectMonitor::enter() {                      | owner=DEFLATER_MARKER   |  deflate_monitor_using_JT() {
      <owner is contended>                        | ref_count=1             |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> EnterI() {                                  +-------------------------+ 1> :
        if (owner == DEFLATER_MARKER &&                       ||              2> : <thread_stalls>
            cmpxchg(Self, &owner,                             \/                 :
                    DEFLATER_MARKER)              +-------------------------+    :
                    == DEFLATER_MARKER) {         | owner=Self/T-enter      |    :
          // EnterI is done                       | ref_count=0             |    : <thread_resumes>
          return                                  +-------------------------+    prev = cmpxchg(-max_jint, &ref_count, 0)
        }                                                     ||                 if (prev == 0 &&
    } // enter() is done                                      \/                3>   owner == DEFLATER_MARKER) {
    ~OMH: atomic dec ref_count                    +-------------------------+    } else {
 2> : <does app work>                             | owner=Self/T-enter|NULL |      cmpxchg(NULL, &owner, DEFLATER_MARKER)
 3> :                                             | ref_count=-max_jint     |      atomic add max_jint to ref_count
    exit() monitor                                +-------------------------+   4> bailout on deflation
   4> owner = NULL                                            ||                 }
                                                              \/
                                                  +-------------------------+
                                                  | owner=Self/T-enter|NULL |
                                                  | ref_count=0             |
                                                  +-------------------------+

- T-deflate has executed cmpxchg() and set owner to DEFLATE_MARKER.
- T-enter has called ObjectMonitor::enter() with "ref_count == 1", noticed that the owner is contended and is about to call ObjectMonitor::EnterI().
- The first ObjectMonitor box is showing the fields at this point and the "1>" markers are showing where each thread is at for that ObjectMonitor box.
- T-deflate stalls after setting the owner field to DEFLATER_MARKER.
- T-enter calls EnterI() to do the contended enter work:
  - EnterI() observes owner == DEFLATER_MARKER and uses cmpxchg() to set the owner field to Self/T-enter.
  - T-enter owns the monitor, returns from EnterI(), and returns from enter().
  - The ObjectMonitorHandle destructor decrements the ref_count.
- T-enter is now ready to do work that requires the monitor to be owned.
- The second ObjectMonitor box is showing the fields at this point and the "2>" markers are showing where each thread is at for that ObjectMonitor box.
- T-enter is doing app work (but it also could have finished and exited the monitor).
- T-deflate resumes, calls cmpxchg() to set the ref_count field to -max_jint, and passes the first part of the bailout expression because "prev == 0".
- The third ObjectMonitor box is showing the fields at this point and the "3>" markers are showing where each thread is at for that ObjectMonitor box.
- T-deflate performs the A-B-A check which observes that "owner != DEFLATE_MARKER" and bails out on deflation:
  - Depending on when T-deflate resumes after the stall, it will see "owner == T-enter" or "owner == NULL".
  - Both of those values will cause deflation to bailout so we have to conditionally undo work:
    - restore the owner field to NULL if it is still DEFLATER_MARKER (it's not DEFLATER_MARKER)
    - undo setting ref_count to -max_jint by atomically adding max_jint to ref_count which will restore ref_count to its proper value.
  - If the T-enter thread has managed to enter but not exit the monitor during the T-deflate stall, then our owner field A-B-A transition is:
    - NULL → DEFLATE_MARKER → Self/T-enter
  - so we really have A1-B-A2, but the A-B-A principal still holds.
  - If the T-enter thread has managed to enter and exit the monitor during the T-deflate stall, then our owner field A-B-A transition is:
    - NULL → DEFLATE_MARKER → Self/T-enter → NULL
  - so we really have A-B1-B2-A, but the A-B-A principal still holds.
- T-enter finished doing app work and is about to exit the monitor (or it has already exited the monitor).
- The fourth ObjectMonitor box is showing the fields at this point and the "4>" markers are showing where each thread is at for that ObjectMonitor box.

An Example of Object Header Interference

After T-deflate has won the race for deflating an ObjectMonitor it has to restore the header in the associated object. Of course another thread can be trying to do something to the object's header at the same time. Isn't asynchronous work exciting?!?!

ObjectMonitor::install_displaced_markword_in_object() is called from two places so we can have a race between a T-save thread and a T-deflate thread:

Start of the Race

    T-save                                       object           T-deflate
    -------------------------------------------  +-------------+  --------------------------------------------
    install_displaced_markword_in_object() {     | mark=om_ptr |  install_displaced_markword_in_object() {
      dmw = header()                             +-------------+    dmw = header()
      if (!dmw->is_marked() &&                                      if (!dmw->is_marked() &&
          dmw->hash() == 0) {                                           dmw->hash() == 0) {
        create marked_dmw                                              create marked_dmw
        dmw = cmpxchg(marked_dmw, &header, dmw)                        dmw = cmpxchg(marked_dmw, &header, dmw)
      }                                                              }

- The data field (mark) is at its starting value.
- 'dmw' and 'marked_dmw' are local copies in each thread.
- T-save and T-deflate are both calling install_displaced_markword_in_object() at the same time.
- Both threads are poised to call cmpxchg() at the same time.

T-deflate Wins First Race

    T-save                                       object            T-deflate
    -------------------------------------------  +-------------+   -------------------------------------------
    install_displaced_markword_in_object() {     | mark=om_ptr |   install_displaced_markword_in_object() {
      dmw = header()                             +-------------+     dmw = header()
      if (!dmw->is_marked() &&                                       if (!dmw->is_marked() &&
          dmw->hash() == 0) {                                            dmw->hash() == 0) {
        create marked_dmw                                              create marked_dmw
        dmw = cmpxchg(marked_dmw, &header, dmw)                        dmw = cmpxchg(marked_dmw, &header, dmw)
      }                                                              }
      // dmw == marked_dmw here                                      // dmw == original dmw here
      if (dmw->is_marked())                                          if (dmw->is_marked())
        unmark dmw                                                     unmark dmw
      obj = object()                                                 obj = object()
      obj->cas_set_mark(dmw, this)                                   obj->cas_set_mark(dmw, this)

- The return value from cmpxchg() in each thread will be different.
- Since T-deflate won the race, its 'dmw' variable contains the header/dmw from the ObjectMonitor.
- Since T-save lost the race, its 'dmw' variable contains the 'marked_dmw' set by T-deflate.
  - T-save will unmark its 'dmw' variable.
- Both threads are poised to call cas_set_mark() at the same time.

T-save Wins First Race

    T-save                                       object            T-deflate
    -------------------------------------------  +-------------+   -------------------------------------------
    install_displaced_markword_in_object() {     | mark=om_ptr |   install_displaced_markword_in_object() {
      dmw = header()                             +-------------+     dmw = header()
      if (!dmw->is_marked() &&                                       if (!dmw->is_marked() &&
          dmw->hash() == 0) {                                            dmw->hash() == 0) {
        create marked_dmw                                              create marked_dmw
        dmw = cmpxchg(marked_dmw, &header, dmw)                        dmw = cmpxchg(marked_dmw, &header, dmw)
      }                                                              }
      // dmw == original dmw here                                    // dmw == marked_dmw here
      if (dmw->is_marked())                                          if (dmw->is_marked())
        unmark dmw                                                     unmark dmw
      obj = object()                                                 obj = object()
      obj->cas_set_mark(dmw, this)                                   obj->cas_set_mark(dmw, this)

- This diagram is the same as "T-deflate Wins First Race" except we've swapped the post cmpxchg() comments.
- Since T-save won the race, its 'dmw' variable contains the header/dmw from the ObjectMonitor.
- Since T-deflate lost the race, its 'dmw' variable contains the 'marked_dmw' set by T-save.
  - T-deflate will unmark its 'dmw' variable.
- Both threads are poised to call cas_set_mark() at the same time.

Either Wins the Second Race

    T-save                                       object            T-deflate
    -------------------------------------------  +-------------+   -------------------------------------------
    install_displaced_markword_in_object() {     | mark=dmw    |   install_displaced_markword_in_object() {
      dmw = header()                             +-------------+     dmw = header()
      if (!dmw->is_marked() &&                                       if (!dmw->is_marked() &&
          dmw->hash() == 0) {                                            dmw->hash() == 0) {
        create marked_dmw                                              create marked_dmw
        dmw = cmpxchg(marked_dmw, &header, dmw)                        dmw = cmpxchg(marked_dmw, &header, dmw)
      }                                                              }
      // dmw == ...                                                  // dmw == ...
      if (dmw->is_marked())                                          if (dmw->is_marked())
        unmark dmw                                                     unmark dmw
      obj = object()                                                 obj = object()
      obj->cas_set_mark(dmw, this)                                   obj->cas_set_mark(dmw, this)

- It does not matter whether T-save or T-deflate won the cmpxchg() call so the comment does not say who won.
- It does not matter whether T-save or T-deflate won the cas_set_mark() call; in this scenario both were trying to restore the same value.
- The object's mark field has changed from 'om_ptr' → 'dmw'.

Please notice that install_displaced_markword_in_object() does not do any retries on any code path:

- Instead the code adapts to being the loser in a cmpxchg() by unmarking its copy of the dmw.
- In the second race, if a thread loses the cas_set_mark() race, there is also no need to retry because the object's header has been restored by the other thread.

Hashcodes and Object Header Interference

If we have a race between a T-deflate thread and a thread trying to get/set a hashcode (T-hash), then the race is between the ObjectMonitorHandle.save_om_ptr(obj, mark) call in T-hash and deflation protocol in T-deflate.

Start of the Race

    T-hash                  ObjectMonitor              T-deflate
    ----------------------  +-----------------------+  ----------------------------------------
    save_om_ptr() {         | owner=NULL            |  deflate_monitor_using_JT() {
      :                     | ref_count=0           | 1> cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> atomic inc ref_count  +-----------------------+

- The data fields are at their starting values.
- T-deflate is about to execute cmpxchg().
- T-hash is about to increment ref_count.
- The "1>" markers are showing where each thread is at for the ObjectMonitor box.

Racing Threads

    T-hash                  ObjectMonitor              T-deflate
    ----------------------  +-----------------------+  ------------------------------------------
    save_om_ptr() {         | owner=DEFLATER_MARKER |  deflate_monitor_using_JT() {
      :                     | ref_count=0           |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> atomic inc ref_count  +-----------------------+    if (contentions != 0 || waiters != 0) {
                                                         }
                                                      1> prev = cmpxchg(-max_jint, &ref_count, 0)

- T-deflate has set the owner field to DEFLATER_MARKER.
- The "1>" markers are showing where each thread is at for the ObjectMonitor box:
  - T-deflate is about to execute cmpxchg().
  - T-save is about to increment the ref_count.

T-deflate Wins

If T-deflate wins the race, then T-hash will have to retry at most once.

    T-hash                      ObjectMonitor              T-deflate
    -------------------------   +-----------------------+  ------------------------------------------
    save_om_ptr() {             | owner=DEFLATER_MARKER |  deflate_monitor_using_JT() {
   1> atomic inc ref_count      | ref_count=-max_jint   |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
      if (owner ==              +-----------------------+    if (contentions != 0 || waiters != 0) {
          DEFLATER_MARKER &&               ||                }
          ref_count <= 0) {                \/                prev = cmpxchg(-max_jint, &ref_count, 0)
        restore obj header      +-----------------------+ 1> if (prev == 0 &&
        atomic dec ref_count    | owner=DEFLATER_MARKER |        owner == DEFLATER_MARKER) {
     2> return false to         | ref_count=-max_jint   |      restore obj header
          cause a retry         +-----------------------+   2> finish the deflation
      }                                                      }

- T-deflate made it past the cmpxchg() of ref_count before T-hash incremented it.
- T-deflate set the ref_count field to -max_jint and is about to make the last of the protocol checks.
- The first ObjectMonitor box is showing the fields at this point and the "1>" markers are showing where each thread is at for that ObjectMonitor box.
- T-deflate sees "prev == 0 && owner == DEFLATER_MARKER" so it knows that it has won the race.
- T-deflate restores obj header (not shown).
- T-hash increments the ref_count.
- T-hash observes "owner == DEFLATER_MARKER && ref_count <= 0" so it restores obj header (not shown) and decrements ref_count.
- The second ObjectMonitor box is showing the fields at this point and the "2>" markers are showing where each thread is at for that ObjectMonitor box.
- T-deflate finishes the deflation work.
- T-hash returns false to cause a retry and when T-hash retries:
  - it observes the restored object header (done by T-hash or T-deflate):
    - if the object's header does not have a hash, then generate a hash and merge it with the object's header.
    - Otherwise, extract the hash from the object's header and return it.

T-hash Wins

If T-hash wins the race, then the ref_count will cause T-deflate to bail out on deflating the monitor.

Note: header is not mentioned in any of the previous sections for simplicity.

    T-hash                      ObjectMonitor              T-deflate
    -------------------------   +-----------------------+  ------------------------------------------
    save_om_ptr() {             | header=dmw_no_hash    |  deflate_monitor_using_JT() {
      atomic inc ref_count      | owner=DEFLATER_MARKER |    cmpxchg(DEFLATER_MARKER, &owner, NULL)
   1> if (owner ==              | ref_count=1           |    if (contentions != 0 || waiters != 0) {
          DEFLATER_MARKER &&    +-----------------------+    }
          ref_count <= 0) {                ||             1> prev = cmpxchg(-max_jint, &ref_count, 0)
      } else {                             \/                if (prev == 0 &&
     2> save om_ptr in the      +-----------------------+        owner == DEFLATER_MARKER) {
          ObjectMonitorHandle   | header=dmw_no_hash    |    } else {
        return true             | owner=NULL            |      cmpxchg(NULL, &owner, DEFLATER_MARKER)
      }                         | ref_count=1           |   2> bailout on deflation
    }                           +-----------------------+    }
    if save_om_ptr() {                     ||
      if no hash                           \/
        gen hash & merge        +-----------------------+
      hash = hash(header)       | header=dmw_hash       |
    }                           | owner=NULL            |
 3> atomic dec ref_count        | ref_count=1           |
    return hash                 +-----------------------+

- T-deflate has set the owner field to DEFLATER_MARKER.
- T-hash has incremented ref_count before T-deflate made it to cmpxchg().
- The first ObjectMonitor box is showing the fields at this point and the "1>" markers are showing where each thread is at for that ObjectMonitor box.
- T-deflate bails out on deflation, but first it tries to restore the owner field:
  - The return value of cmpxchg() is not checked here.
  - If T-deflate cannot restore the owner field to NULL, then another thread has managed to enter the monitor (or enter and exit the monitor) and we don't want to overwrite that information.
- T-hash observes:
  - "owner == DEFLATER_MARKER && ref_count > 0" or
  - "owner == NULL && ref_count > 0" so it gets ready to save the ObjectMonitor*.
- The second ObjectMonitor box is showing the fields at this point and the "2>" markers are showing where each thread is at for that ObjectMonitor box.
- T-hash saves the ObjectMonitor* in the ObjectMonitorHandle (not shown) and returns to the caller.
- save_om_ptr() returns true since the ObjectMonitor is safe:
  - if ObjectMonitor's 'header/dmw' field does not have a hash, then generate a hash and merge it with the 'header/dmw' field.
  - Otherwise, extract the hash from the ObjectMonitor's 'header/dmw' field.
- The third ObjectMonitor box is showing the fields at this point and the "3>" marker is showing where T-hash is at for that ObjectMonitor box.
- T-hash decrements the ref_count field.
- T-hash returns the hash value.

Please note that in Carsten's original prototype, there was another race in ObjectSynchronizer::FastHashCode() when the object's monitor had to be inflated. The setting of the hashcode in the ObjectMonitor's header/dmw could race with T-deflate. That race is resolved in this version by the use of an ObjectMonitorHandle in the call to ObjectSynchronizer::inflate(). The ObjectMonitor* returned by ObjectMonitorHandle.om_ptr() has a non-zero ref_count so no additional races with T-deflate are possible.

Lock-Free Monitor List Management In Theory

Use of specialized measurement code with the CR5/v2.05/8-for-jdk13 bits revealed that the gListLock contention is responsible for much of the performance degradation observed with SPECjbb2015. Consequently the primary focus of the next round of changes is/was on switching to lock-free monitor list management. Of course, since the Java Monitor subsystem is full of special cases, the lock-free list management code has to have a number of special cases which are described here.

The Simple Case

There is one simple case of lock-free list management with the Java Monitor subsystem so we'll start with that code as a way to introduce the lock-free concepts:

    L1:  while (true) {
    L2:    PaddedObjectMonitor* cur = g_block_list;
    L3:    new_blk[0]._next_om = cur;
    L4:    if (Atomic::cmpxchg(new_blk, &g_block_list, cur) == cur) {
    L5:      Atomic::add(_BLOCKSIZE - 1, &g_om_population);
    L6:      break;
    L7:    }
    L8:  }

What the above block of code does is:

- prepends a 'new_blk' to the front of 'g_block_list'
- increments the 'g_om_population' counter to include the number of new elements

The above block of code can be called by multiple threads in parallel and must not lose track of any blocks. Of course, the "must not lose track of any blocks" part is where all the details come in:

- L2 loads the current 'g_block_list' value into 'cur'; g_block_list is only modified by cmpxchg so a load (instead of a load-acquire) is sufficient.
- L3 stores 'cur' into the 0th element's next field for 'new_blk'; g_block_list is only modified by cmpxchg so a store (instead of a release-store) is sufficient.
- L4 is the critical decision point for this lock-free update. cmpxchg will change 'g_block_list' to 'new_blk' iff 'g_block_list' == 'cur' (publish it).
  - if the cmpxchg return value is 'cur', then we succeeded with the lock-free update and we atomically update 'g_om_population' to match.
  - Otherwise we loop around and do everything again from L2.

At the point that cmpxchg has published the new 'g_block_list' value, 'new_blk' is now first block in the list and the 0th element's next field is used to find the previous first block; all of the monitor list blocks are chained together via the next field in the block's 0th element. It is the use of cmpxchg to update 'g_block_list' and the checking of the return value from cmpxchg that insures that we don't lose track of any blocks.

This example is considered to be the "simple case" because we only prepend to the list (no deletes) and we only use:

- one load
- one store and
- one cmpxchg

to achieve the safe update of the 'g_block_list' value; the atomic increment of the 'g_om_population' counter is considered to be just accounting (pun intended).

The concepts introduced here are:

- update the new thing to refer to head of existing list
- try to update the head of the existing list to refer to the new thing
- retry as needed

Note: The above code snippet comes from ObjectSynchronizer::prepend_block_to_lists(); see that function for more complete context (and comments).

The Not So Simple Case or Taking and Prepending on the Same List Leads to A-B-A Races

Note: This subsection is talking about "Simple Take" and "Simple Prepend" in abstract terms. The purpose of this code and A-B-A example is to introduce the race concepts. The code shown here is not an exact match for the project code and the specific A-B-A example is not found in the project code.

The left hand column shows "Thread1" taking a node "A" from the front of the list and it shows the simple code that does that operation. The right hand column shows "Thread2" prepending a node "B" to the front of the list and it shows the simple code that does that operation. We have a third thread, "Thread3", that does a take followed by a prepend, but we don't show a column for "Thread3". Instead we have a column in the middle that shows the results of the interleaved operations of all three threads:

    T1: Simple Take:                           |                                            | T2: Simple Prepend:
    ----------------                           |  T1 and T3 see this initial list:          | -------------------
            +---+    +---+    +---+            |          +---+    +---+    +---+           |          +---+    +---+
    head -> | A | -> | X | -> | Y |            |  head -> | A | -> | X | -> | Y |           |  head -> | X | -> | Y |
            +---+    +---+    +---+            |          +---+    +---+    +---+           |          +---+    +---+
                                               |  T3 takes "A", T2 sees this list:          |
    Take a node     ||                         |          +---+    +---+                    |  Prepend a node  ||
    from the front  ||                         |  head -> | X | -> | Y |                    |  to the front    ||
    of the list     ||                         |          +---+    +---+                    |  of the list     ||
                    \/                         |  T2 prepends "B":                          |                  \/
                                               |          +---+    +---+    +---+           |
            +---+    +---+                     |  head -> | B | -> | X | -> | Y |           |          +---+    +---+    +---+
    head -> | X | -> | Y |                     |          +---+    +---+    +---+           |  head -> | B | -> | X | -> | Y |
            +---+    +---+                     |  T3 prepends "A":                          |          +---+    +---+    +---+
           +---+                               |          +---+    +---+    +---+    +---+  |
    cur -> | A |                               |  head -> | A | -> | B | -> | X | -> | Y |  |
           +---+                               |          +---+    +---+    +---+    +---+  |
                                               |  T1 takes "A", loses "B":                  |
    // "take" a node:                          |  +---+                                     |  // "prepend" a node:
    while (true) {                             |  | B | ----+                               |  while (true) {
      cur = head;                              |  +---+     |                               |    cur = head;
      next = cur->next;                        |            V                               |    new->next = cur;
      if (cmpxchg(next, &head, cur) == cur) {  |          +---+    +---+                    |    if (cmpxchg(new, &head, cur) == cur) {
        break;  // success changing head       |  head -> | X | -> | Y |                    |      break; // success changing head
      }                                        |          +---+    +---+                    |    }
    }                                          |         +---+                              |  }
    return cur;                                |  cur -> | A |                              |
                                               |         +---+                              |

The "Simple Take" and "Simple Prepend" algorithms are just fine by themselves. The "Simple Prepend" algorithm is almost identical to the algorithm in the "The Simple Case" and just like that algorithm, it works fine if we are only doing prepend operations on the list. Similarly, the "Simple Take" algorithm works just fine if we are only doing take operations on the list; the only thing missing is an empty list check, but that would have clouded the example.

When we allow simultaneous take and prepend operations on the same list, the simple algorithms are exposed to A-B-A races. An A-B-A race is a situation where the head of the list can change from node "A" to node "B" and back to node "A" again without the simple algorithm being aware that critical state has changed. In the middle column of the above diagram, we show what happens when Thread3 causes the head of the list to change from node "A" to node "B" (a take operation) and back to node "A" (a prepend operation). That A-B-A race causes Thread1 to lose node "B" when it updates the list head to node "X" instead of node "B" because Thread1 was unaware that its local 'next' value was stale.

Here's the diagram again with the code in Thread1 and Thread2 lined up with the effects of the A-B-A race executed by Thread3:

    T1: Simple Take:                           |                                            | T2: Simple Prepend:
    ----------------                           |  T1 and T3 see this initial list:          | -------------------
    while (true) {                             |          +---+    +---+    +---+           |  :
      cur = head;                              |  head -> | A | -> | X | -> | Y |           |  :
      next = cur->next;                        |          +---+    +---+    +---+           |  :
      :                                        |  T3 takes "A", T2 sees this list:          |  :
      :                                        |          +---+    +---+                    |  :
      :                                        |  head -> | X | -> | Y |                    |  :
      :                                        |          +---+    +---+                    |  while (true) {
      :                                        |  T2 prepends "B":                          |    cur = head; 
      :                                        |          +---+    +---+    +---+           |    new->next = cur;
      :                                        |  head -> | B | -> | X | -> | Y |           |    if (cmpxchg(new, &head, cur) == cur) {
      :                                        |          +---+    +---+    +---+           |      break;
      :                                        |  T3 prepends "A":                          |    }
      :                                        |          +---+    +---+    +---+    +---+  |  }
      :                                        |  head -> | A | -> | B | -> | X | -> | Y |  |
      :                                        |          +---+    +---+    +---+    +---+  |
      :                                        |  T1 takes "A", loses "B":                  |
      :                                        |  +---+                                     |
      :                                        |  | B | ----+                               |
      :                                        |  +---+     |                               |
      :                                        |            V                               |
      :                                        |          +---+    +---+                    |
      if (cmpxchg(next, &head, cur) == cur) {  |  head -> | X | -> | Y |                    |
      }                                        |          +---+    +---+                    |
    }                                          |         +---+                              |
    return cur;                                |  cur -> | A |                              |
                                               |         +---+                              |

So the simple algorithms are not sufficient when we allow simultaneous take and prepend operations.

Marking to Solve the A-B-A Race

Note: This subsection is talking about "Marking" as a solution to the A-B-A race in abstract terms. The purpose of this marking code and A-B-A example is to introduce the solution concepts. The code shown here is not an exact match for the project code.

One solution to the A-B-A race is to mark the next field in a node to indicate that the node is busy. Only one thread can successfully mark the next field in a node at a time and other threads must loop around and retry their marking operation until they succeed. Each thread that marks the next field in a node must unmark the next field when it is done with the node so that other threads can proceed.

Here's the take algorithm modified with marking (still ignores the empty list for clarity):

    // "take" a node with marking:
    while (true) {
      cur = head;
      if (!mark_next(cur, &next)) {
        // could not mark cur so try again
        continue;
      }
      if (head != cur) {
        // head changed while marking cur so try again
        unmark_next(cur);
        continue;
      }
      // list head is now marked so switch it to next which also makes list head unmarked
      OrderAccess::release_store(&head, next);
      unmark_next(cur);  // unmark cur and return it
      return cur;
    }

The modified take algorithm does not change the list head pointer until it has successfully marked the list head node. Notice that after we mark the list head node we have to verify that the list head pointer hasn't changed in the mean time. Only after we have verified that the node we marked is still the list head is it safe to modify the list head pointer. The marking of the list head prevents the take algorithm from executing in parallel with a prepend algorithm and losing a node.

Also notice that we update the list head pointer with release-store instead of with cmpxchg. Since we have the list head marked, we are not racing with other threads to change the list head pointer so we can use the smaller release-store hammer instead of the heavier cmpxchg hammer.

Here's the prepend algorithm modified with marking (ignores the empty list for clarity):

    // "prepend" a node with marking:
    while (true) {
      cur = head;
      if (!mark_next(cur, &next)) {
        // could not mark cur so try again
        continue;
      }
      if (head != cur) {
        // head changed while marking cur so try again
        unmark_next(cur);
        continue;
      }
      // list head is now marked so switch it to 'new' which also makes list head unmarked
      Atomic::release_store(&head, new);
      unmark_next(cur);  // unmark the previous list head
    }

The modified prepend algorithm does not change the list head pointer until it has successfully marked the list head node. Notice that after we mark the list head node we have to verify that the list head pointer hasn't changed in the mean time. Only after we have verified that the node we marked is still the list head is it safe to modify the list head pointer. The marking of the list head prevents the prepend algorithm from executing in parallel with the take algorithm and losing a node.

Also notice that we update the list head pointer with release-store instead of with cmpxchg for the same reasons as the previous algorithm.

Background: ObjectMonitor Movement Between the Lists

The purpose of this subsection is to provide background information about how ObjectMonitors move between the various lists. This project changes the way these movements are implemented, but does not change the movements themselves. For example, newly allocated blocks of ObjectMonitors are always prepending to the global free list; this is true in the baseline and is true in this project. One exception is the optional addition of the global wait list (see below).

ObjectMonitor Allocation Path

- ObjectMonitors are allocated by ObjectSynchronizer::om_alloc().
- Assume that the calling JavaThread has an empty free list and the global free list is also empty:
  - A block of ObjectMonitors is allocated by the calling JavaThread and prepended to the global free list.
  - ObjectMonitors are taken from the front of the global free list by the calling JavaThread and prepended to the JavaThread's free list by ObjectSynchronizer::om_release().
  - An ObjectMonitor is taken from the front of the JavaThread's free list and prepended to the JavaThread's in-use list (optimistically).

ObjectMonitor Deflation Path

- ObjectMonitors are deflated at a safepoint by:
  ObjectSynchronizer::deflate_monitor_list() calling ObjectSynchronizer::deflate_monitor()
  And when Async Monitor Deflation is enabled, they are deflated by:
  ObjectSynchronizer::deflate_monitor_list_using_JT() calling ObjectSynchronizer::deflate_monitor_using_JT()
- Idle ObjectMonitors are deflated by the ServiceThread when Async Monitor Deflation is enabled. They can also be deflated at a safepoint by the VMThread or by a task worker thread. Safepoint deflation is used when Async Monitor Deflation is disabled or when there is a special deflation request, e.g., System.gc().
- An idle ObjectMonitor is deflated and extracted from its in-use list and prepended to the global free list. The in-use list can be either the global in-use list or a per-thread in-use list. Deflated ObjectMonitors are always prepended to the global free list.
  - In CR7/v2.07/10-for-jdk14, the HandshakeAfterDeflateIdleMonitors diagnostic option is added to enable a new g_wait_list that tracks deflated ObjectMonitors until after a handshake/safepoint with all JavaThreads.
  - The g_wait_list allows ObjectMonitors to be safely deflated on platforms that do not have C2 inc_om_ref_count() implemented. See the "T-save Complication with C2" subsection above for the gory C2 details.
  - So when the option is enabled, idle ObjectMonitors are deflated and extracted from an in-use list and prepended to g_wait_list; after the handshake/safepoint with all JavaThreads, the ObjectMonitors on the g_wait_list are prepended to the global free list.

ObjectMonitor Flush Path

- ObjectMonitors are flushed by ObjectSynchronizer::om_flush().
- When a JavaThread exits, the ObjectMonitors on its in-use list are prepended on the global in-use list and the ObjectMonitors on its free list are prepended on the global free list.

ObjectMonitor Linkage Path

- ObjectMonitors are linked with objects by ObjectSynchronizer::inflate().
- An inflate() call by one JavaThread can race with an inflate() call by another JavaThread for the same object.
- When inflate() realizes that it failed to link an ObjectMonitor with the target object, it calls ObjectSynchronizer::om_release() to extract the ObjectMonitor from the JavaThread's in-use list and prepends it on the JavaThread's free list.
  Note: Remember that ObjectSynchronizer::om_alloc() optimistically added the newly allocated ObjectMonitor to the JavaThread's in-use list.
- When inflate() successfully links an ObjectMonitor with the target object, that ObjectMonitor stays on the JavaThread's in-use list.

The Lists and Which Threads Touch Them

- global free list:
  - prepended to by JavaThreads that allocated a new block of ObjectMonitors (malloc time)
  - prepended to by JavaThreads that are exiting (and have a non-empty per-thread free list)
  - taken from the head by JavaThreads that need to allocate ObjectMonitor(s) for their per-thread free list (reprovision)
  - prepended to by deflation done by:
    - either the VMThread or a worker thread for safepoint based
    - or the ServiceThread for async monitor deflation
- global in-use list:
  - prepended to by JavaThreads that are exiting (and have a non-empty per-thread free list)
  - extracted from by deflation done by:
    - either the VMThread or a worker thread for safepoint based
    - or the ServiceThread for async monitor deflation
- global wait list:
  - only used when HandshakeAfterDeflateIdleMonitors == true
  - prepended by the ServiceThread during async deflation
  - entire list detached and prepended to the global free list by the ServiceThread during async deflation
  - Note: The global wait list serves the same function as Carsten's gFreeListNextSafepoint list in his prototype.
- per-thread free list:
  - prepended to by a JavaThread when it needs to allocate new ObjectMonitor(s) (reprovision)
  - taken from the head by a JavaThread when it needs to allocate a new ObjectMonitor (inflation)
  - prepended to by a JavaThread when it isn't able to link the object to the ObjectMonitor (failed inflation)
  - entire list detached and prepended to the global free list when the JavaThread is exiting
- per-thread in-use list:
  - prepended to by a JavaThread when it allocates a new ObjectMonitor (inflation, optimistically in-use)
  - extracted from by deflation done by:
    - either the VMThread or a worker thread for safepoint based
    - or the ServiceThread for async monitor deflation
  - entire list detached and prepended to the global in-use list when the JavaThread is exiting

Lock-Free Monitor List Management In Reality

Prepending To A List That Also Allows Deletes

It is now time to switch from algorithms to real snippets from the code.

The next case to consider for lock-free list management with the Java Monitor subsystem is prepending to a list that also allows deletes. As you might imagine, the possibility of a prepend racing with a delete makes things more complicated. The solution is to "mark" the next field in the ObjectMonitor at the head of the list we're trying to prepend to. A successful mark tells other prependers or deleters that the marked ObjectMonitor is busy and they will need to retry their own mark operation.

    L01:  while (true) {
    L02:    (void)mark_next_loop(m);  // mark m so we can safely update its next field
    L03:    ObjectMonitor* cur = NULL;
    L04:    ObjectMonitor* next = NULL;
    L05:    // Mark the list head to guard against A-B-A race:
    L06:    if (mark_list_head(list_p, &cur, &next)) {
    L07:      // List head is now marked so we can safely switch it.
    L08:      set_next(m, cur);  // m now points to cur (and unmarks m)
    L09:      OrderAccess::release_store(list_p, m);  // Switch list head to unmarked m.
    L10:      set_next(cur, next);  // Unmark the previous list head.
    L11:      break;
    L12:    }
    L13:    // The list is empty so try to set the list head.
    L14:    assert(cur == NULL, "cur must be NULL: cur=" INTPTR_FORMAT, p2i(cur));
    L15:    set_next(m, cur);  // m now points to NULL (and unmarks m)
    L16:    if (Atomic::cmpxchg(m, list_p, cur) == cur) {
    L17:      // List head is now unmarked m.
    L18:      break;
    L19:    }
    L20:    // Implied else: try it all again
    L21:  }
    L22:  Atomic::inc(count_p);

What the above block of code does is:

- prepends an ObjectMonitor 'm' to the front of the list referred to by list_p
  - mark 'm's next field
  - mark the list head's next field
  - update 'm' to refer to the list head
  - update 'list_p' to refer to 'm'
  - unmark the next field in the previous list head
- increments the counter referred to by 'count_p' by one

The above block of code can be called by multiple prependers in parallel or with deleters running in parallel and must not lose track of any ObjectMonitor. Of course, the "must not lose track of any ObjectMonitor" part is where all the details come in:

- L02 loops to mark 'm's next field; we have to loop because another thread (T2) might have 'm' marked and we try again until we have marked it.
  You might be asking yourself: why does T2 have 'm' marked?
  - Before T1 was trying to prepend 'm' to an in-use list, T1 and T2 were racing to take an ObjectMonitor off the free list.
  - T1 won the race, marked 'm', removed 'm' from the free list and unmarked 'm'; T2 stalled before trying to mark 'm'.
  - T2 resumed and marked 'm', realized that 'm' was no longer the head of the free list, unmarked 'm' and tried it all again.
  - If our thread (T1) does not mark 'm' before it tries to prepend it to an in-use list, then T2's umarking of 'm' could erase the next value that T1 wants to put in 'm'.
- L06 tries to mark the list head 'list_p'; if mark_list_head() returns true, we have the list head marked and can safely update it:
  - L08: Update 'm's next field to point to the current list head (which unmarks 'm').
  - L09: release-store 'm' into 'list_p' which switches the list head to an unmarked 'm'.
  - L10: We unmark the previous list head.
- If mark_list_head() returned false, we have an empty list:
  - L15: Update 'm's next field to NULL (which unmarks 'm').
  - L16: Try to cmpxchg 'list_p' to 'm':
    - if cmpxchg works, then we're done.
    - Otherwise, another prepender won the race to update the list head so we have to try again.
- The counter referred to by 'count_p' is incremented by one.

ObjectMonitor 'm' is safely on the list at the point that we have updated 'list_p' to refer to 'm'. In this subsection's block of code, we also called three new functions, mark_next_loop(), mark_list_head() and set_next(), that are explained in the next few subsections about helper functions.

Note: The above code snippet comes from prepend_to_common(); see that function for more context and a few more comments.

mark_next(), mark_om_ptr(), and set_next() Helper Functions

Managing marks on ObjectMonitors has been abstracted into a few helper functions. mark_next() is the first interesting one:

    L01:  static bool mark_next(ObjectMonitor* om, ObjectMonitor** next_p) {
    L02:    // Get current next field without any marking value.
    L03:    ObjectMonitor* next = (ObjectMonitor*)
    L04:        ((intptr_t)OrderAccess::load_acquire(&om->_next_om) & ~0x1);
    L05:    if (Atomic::cmpxchg(mark_om_ptr(next), &om->_next_om, next) != next) {
    L06:      return false;  // Could not mark the next field or it was already marked.
    L07:    }  
    L08:    *next_p = next;
    L09:    return true;
    L10:  }

The above function tries to mark the next field in an ObjectMonitor:

- If marking is successful, then the unmarked next field is returned via parameter and true is returned.
- Otherwise, false is returned.

The function can be called by multiple threads at the same time and only one thread will succeed in the marking operation (return == true) and all other threads will get return == false. Of course, the "only one thread will succeed" part is where all the details come in:

- L0[34] load-acquires the ObjectMonitor's next field and strips the marking bit:
  - The unmarked value is saved in 'next'.
  - We need the unmarked next value in order to properly detect if the next field was already marked.
- L05 tries to cmpxchg a marked 'next' value into the ObjectMonitor's next field:
  - if cmpxchg does not work, then we return false:
    - The cmpxchg will not work if the next field changes after we load-acquired the value on L04.
    - The cmpxchg will not work if the next field is already marked.
  - Otherwise, we return the unmarked 'next' value via 'next_p' and return true.

The mark_next() function calls another helper function, mark_om_ptr(), that needs a quick explanation:

    L1:  static ObjectMonitor* mark_om_ptr(ObjectMonitor* om) {
    L2:    return (ObjectMonitor*)((intptr_t)om | 0x1);
    L3:  }

This function encapsulates the setting of the marking bit in an ObjectMonitor* for the purpose of hiding the details and making the calling code easier to read:

- L2 casts the ObjectMonitor* into a type that will allow the '|' operator to be used.
- We use the 0x1 bit as our marking value because ObjectMonitors are aligned on a cache line so the low order bit is not used by the normal addressing of an ObjectMonitor*.

set_next() is the next interesting function and it also only needs a quick explanation:

    L1:  static void set_next(ObjectMonitor* om, ObjectMonitor* value) {
    L2:    OrderAccess::release_store(&om->_next_om, value);
    L3:  }

This function encapsulates the setting of the next field in an ObjectMonitor for the purpose of hiding the details and making the calling code easier to read:

- This function is simply a wrapper around a release-store of an ObjectMonitor* into the next field in an ObjectMonitor.
- The typical "set_next(cur, next)" call sequence is easier to read than "OrderAccess::release_store(&cur→_next_om, next)".

mark_next_loop() Helper Function

mark_next_loop() is the next interesting helper function:

    L1:  static ObjectMonitor* mark_next_loop(ObjectMonitor* om) {
    L2:    ObjectMonitor* next;
    L3:    while (true) {
    L4:      if (mark_next(om, &next)) {
    L5:        // Marked om's next field so return the unmarked value.
    L6:        return next;
    L7:      }
    L8:    }
    L9:  }

The above function loops until it marks the next field of the target ObjectMonitor. The unmarked value of the next field is returned by the function. There is nothing particularly special about this function so we don't need any line specific annotations.

Debugging Tip: If there's a bug where an ObjectMonitor's next field is not properly unmarked, then this function will loop forever and the caller will be stuck.

mark_list_head() Helper Function

mark_list_head() is the next interesting helper function:

    L01:  static bool mark_list_head(ObjectMonitor* volatile * list_p,
    L02:                             ObjectMonitor** mid_p, ObjectMonitor** next_p) {
    L03:    while (true) {
    L04:      ObjectMonitor* mid = OrderAccess::load_acquire(list_p);
    L05:      if (mid == NULL) {
    L06:        return false;  // The list is empty so nothing to mark.
    L07:      }
    L08:      if (mark_next(mid, next_p)) {
    L09:        if (OrderAccess::load_acquire(list_p) != mid) {
    L10:          // The list head changed so we have to retry.
    L11:          set_next(mid, *next_p);  // unmark mid
    L12:          continue;
    L13:        }
    L14:        // We marked next field to guard against races.
    L15:        *mid_p = mid;
    L16:        return true;
    L17:      }
    L18:    }
    L19:  }

The above function tries to mark the next field in the list head's ObjectMonitor:

- If the list is empty, false is returned.
- Otherwise, the list head's ObjectMonitor* is returned via parameter (mid_p), the unmarked next field is returned via parameter (next_p) and true is returned.

The function can be called by more than one thread on the same 'list_p' at a time. False is only returned when 'list_p' refers to an empty list. Otherwise only one thread will return true at a time with the 'mid_p' and 'next_p' return parameters set. Since the next field in 'mid_p' is marked, any parallel callers to mark_list_head() will loop until the next field in the list head's ObjectMonitor is no longer marked. That typically happens when the list head's ObjectMonitor is taken off the list and 'list_p' is advanced to the next ObjectMonitor on the list. Of course, making sure that "only one thread will return true at a time" is where all the details come in:

- L04 load-acquires the current 'list_p' value into 'mid'; the use of load-acquire is necessary to get the latest value release-stored by another thread; the current 'list_p' is updated by either a release-store or a cmpxchg depending on the algorithm that made the update; only the release-store needs to match up with a load-acquire, but this code doesn't know whether release-store or cmpxchg was used.
- L0[56] is the empty list check and the only time that false is returned by this function.
- L08 tries to mark the next field in 'mid':
  - If marking is not successful, we loop around to try it all again.
  - If marking is successful, then 'next_p' contains mid's unmarked next field value.
  - L09 load-acquires the current 'list_p' value to see if it still matches 'mid':
    - If the list head has changed, then we unmark mid on L11 and try it all again.
    - Otherwise, 'mid' is returned via 'mid_p' and we return true.

When this function returns true, the next field in 'mid_p' is marked and any parallel callers of mark_list_head() on the same list will be looping until the next field in the list head's ObjectMonitor is no longer marked. The caller that just got the 'true' return needs to finish up its work with 'mid_p' quickly.

Debugging Tip: If there's a bug where the list head ObjectMonitor's next field is not properly unmarked, then this function will loop forever and the caller will be stuck.

Taking From The Start Of A List

The next case to consider for lock-free list management with the Java Monitor subsystem is taking an ObjectMonitor from the start of a list. Taking an ObjectMonitor from the start of a list is a specialized form of delete that is guaranteed to interact with a thread that is prepending to the same list at the same time. Again, the core of the solution is to "mark" the next field in the ObjectMonitor at the head of the list we're trying to take the ObjectMonitor from, but we use slightly different code because we have less linkages to make than a prepend.

    L01:  static ObjectMonitor* take_from_start_of_common(ObjectMonitor* volatile * list_p,
    L02:                                                  int volatile * count_p) {
    L03:    ObjectMonitor* next = NULL;
    L04:    ObjectMonitor* take = NULL;
    L05:    // Mark the list head to guard against A-B-A race:
    L06:    if (!mark_list_head(list_p, &take, &next)) {
    L07:      return NULL;  // None are available.
    L08:    }
    L09:    // Switch marked list head to next (which unmarks the list head, but
    L10:    // leaves take marked):
    L11:    OrderAccess::release_store(list_p, next);
    L12:    Atomic::dec(count_p);
    L13:    // Unmark take, but leave the next value for any lagging list
    L14:    // walkers. It will get cleaned up when take is prepended to
    L15:    // the in-use list:
    L16:    set_next(take, next);
    L17:    return take;
    L18:  }

What the above function does is:

- Tries to mark the ObjectMonitor at the head of the list:
  - Marking will only fail if the list is empty so that NULL can be returned.
  - Otherwise mark_list_head() will loop until the ObjectMonitor at the list head has been marked.
- Updates the list head to refer to the next ObjectMonitor.
- Decrements the counter referred to by 'count_p'.
- Unmarks the next field in the taken ObjectMonitor.

The function can be called by more than one thread at a time and each thread will take a unique ObjectMonitor from the start of the list (if one is available) without losing any other ObjectMonitors on the list. Of course, the "take a unique ObjectMonitor" and "without losing any other ObjectMonitors" parts are where all the details come in:

- L06 tries to mark the list head:
  - mark_list_head() returns false if the list is empty so we return NULL on L07.
  - Otherwise, 'take' is a pointer to the marked list head and 'next' is the unmarked next field in the list head.
- L11 release-stores 'next' into 'list_p'.
  You might be asking yourself: Why release-store instead of cmpxchg?
  - mark_list_head() only returns to the caller when it has marked the next field in the ObjectMonitor at the head of the list.
  - Because of that guarantee, any prepender or deleter thread that is running in parallel must loop until we have release-stored 'next' into 'list_p' which unmarks the list head.
- L12 decrements the counter referred to by 'count_p'.
- L16 unmarks 'take' using the unmarked 'next' value we got from mark_list_head():
  - Keeping the 'next' value in take's next field allows any lagging list walker to get to the next ObjectMonitor on that list.
  - take's next field will get cleaned up when take is prepended to its target in-use list.
- L17 returns 'take' to the caller.

Using The New Lock-Free Monitor List Functions

ObjectSynchronizer::om_alloc(Thread* self, ...)

ObjectSynchronizer::om_alloc() is responsible for allocating an ObjectMonitor and returning it to the caller. It has a three step algorithm:

1) Try to allocate from self's local free-list:

- take_from_start_of_om_free_list(self) takes an ObjectMonitor from self's free list (if possible).
- prepend_to_om_in_use_list(self, m) prepends the newly taken ObjectMonitor to self's in-use list.

2) Try to allocate from the global free list (up to self→om_free_provision times):

- take_from_start_of_g_free_list() takes an ObjectMonitor from the global free list (if possible).
- om_release(self, take, false) prepends the newly taken ObjectMonitor to self's free list.
- Retry the allocation from step 1.

3) Allocate a block of new ObjectMonitors:

- prepend_block_to_lists() prepends the newly allocated block to 'g_block_list' and to the global free list.
- Retry the allocation from step 1.

ObjectSynchronizer::om_release(Thread* self, ObjectMonitor* m, bool from_per_thread_alloc)

ObjectSynchronizer::om_release() is responsible for putting an ObjectMonitor on self's free list. If 'from_per_thread_alloc' is true, then om_release() is also responsible for extracting the ObjectMonitor from self's in-use list. The extraction from self's in-use list must happen first:

    L01:  if (from_per_thread_alloc) {
    L02:    mark_list_head(&self->om_in_use_list, &mid, &next);
    L03:    while (true) {
    L04:      if (m == mid) {
    L05:        if (cur_mid_in_use == NULL) {
    L06:          OrderAccess::release_store(&self->om_in_use_list, next);
    L07:        } else {
    L08:          OrderAccess::release_store(&cur_mid_in_use->_next_om, next);
    L09:        }
    L10:        extracted = true;
    L11:        Atomic::dec(&self->om_in_use_count);
    L12:        set_next(mid, next);
    L13:        break;
    L14:      }
    L15:      if (cur_mid_in_use != NULL) {
    L16:        set_next(cur_mid_in_use, mid);  // umark cur_mid_in_use
    L17:      }
    L18:      cur_mid_in_use = mid;
    L19:      mid = next;
    L20:      next = mark_next_loop(mid);
    L21:    }
    L22:  }
    L23:  prepend_to_om_free_list(self, m);

Most of the above code block extracts 'm' from self's in-use list; it is not an exact quote from om_release(), but it is the highlights:

- L02 is used to mark self's in-use list head:
  - 'mid' is self's in-use list head and its next field is marked.
  - 'next' is the unmarked next field from 'mid'.
- L03 → L21: self's in-use list is traversed looking for the target ObjectMonitor 'm':
  - L04: if the current 'mid' matches 'm':
    - L05: if cur_mid_in_use is NULL, we're still processing the head of the thread's in-use list so...
      - L06: we cmpxchg the list head's next field from 'marked_mid' to 'next'.
    - else
      - L08: we cmpxchg the cur_mid_in_use's next field from 'marked_mid' to 'next'.
    - L10 → L13: we've successfully extracted 'm' from self's in-use list so we decrement self's in-use counter, unmark the next field in 'mid' and we're done.
  - L1[56]: if cur_mid_in_use != NULL, then unmark its next field.
  - L18: set 'cur_mid_in_use' to 'mid'
    Note: cur_mid_in_use keeps the marked next field so that it remains stable for a possible next field change. It cannot be deflated while it is marked.
  - L19: set 'mid' to 'next'.
  - L20: mark next field in the new 'mid' and update 'next'; loop around and do it all again.

The last line of the code block (L23) prepends 'm' to self's free list.

ObjectSynchronizer::om_flush(Thread* self)

ObjectSynchronizer::om_flush() is reponsible for flushing self's in-use list to the global in-use list and self's free list to the global free list during self's thread exit processing. om_flush() starts with self's in-use list:

    L01:  if (mark_list_head(&self->om_in_use_list, &in_use_list, &next)) {
    L02:    in_use_tail = in_use_list;
    L03:    in_use_count++;
    L04:    for (ObjectMonitor* cur_om = unmarked_next(in_use_list); cur_om != NULL;) {
    L05:      if (is_next_marked(cur_om)) {
    L06:        while (is_next_marked(cur_om)) {
    L07:          os::naked_short_sleep(1);
    L08:        }
    L09:        cur_om = unmarked_next(in_use_tail);
    L10:        continue;
    L11:      }
    L12:      if (!cur_om->is_active()) {
    L13:        cur_om = unmarked_next(in_use_tail);
    L14:        continue;
    L15:      }
    L16:      in_use_tail = cur_om;
    L17:      in_use_count++;
    L18:      cur_om = unmarked_next(cur_om);
    L19:    }
    L20:    OrderAccess::release_store(&self->om_in_use_count, 0);
    L21:    OrderAccess::release_store(&self->om_in_use_list, (ObjectMonitor*)NULL);
    L22:    set_next(in_use_list, next);
    L23:  }

The above is not an exact copy of the code block from om_flush(), but it is the highlights. What the above code block needs to do is pretty simple:

- Count the number of ObjectMonitors on the in-use list using 'in_use_count'.
- Point to the last ObjectMonitor on the in-use list using 'in_use_tail'.
- Set self's in-use count to zero.
- Set self's in-use llist to NULL.

However, in this case, there are a lot of details:

- L01 marks the in-use list head (if it is not empty):
  - 'in_use_list' is self's in-use list head and its next field is marked.
  - 'next' is the unmarked next field from 'in_use_list'.
  - The in-use list head is kept marked to prevent an async deflation thread from entering the list behind this thread.
    Note: An async deflation thread does check to see if the target thread is exiting, but if it has made it past that check before this thread started exiting, then we're racing.
- L04-L18: loops over the in-use list counting and advancing 'in_use_tail'.
  - L05-L10: 'cur_om's next field is marked so there must be an async deflater thread ahead of us so we delay to give it a chance to finish and refetch 'in_use_tail's (possibly changed) next field and try again.
  - L12-L14: 'cur_om' was deflated and its allocation state was changed to Free while it was marked. We just happened to be lucky enough to see it just after it was unmarked (and added to the free list). We refetch 'in_use_tail's (possibly changed) next field and try again.
  - L1[67]: finally 'cur_om' has been completely vetted so we can update 'in_use_tail' and increment 'in_use_count'.
  - L18: advance 'cur_om' to the next ObjectMonitor and do it all again.
- L20: release-store self's in-use count to zero.
  Note: We clear self's in-use count before umarking self's in-use list head to avoid races.
- L21: release-store self's in-use list head to NULL.
- L22: unmark the disconnected list head.
  Note: Yes, the next field in self's in-use list head was kept marked for the whole loop to keep any racing async deflater thread out of the in-use list. After L21, the racing async deflater thread will loop around and see self's in-use list is empty and bail out.

The code to process self's free list is much, much simpler because we don't have any races with an async deflater thread like self's in-use list. The only interesting bits:

- load-acquire self's free list head.
- release-store self's free list head to NULL.
- release-store self's free list count to zero.

The last interesting bits for this function are prepending the local lists to the right global places:

- prepend_list_to_g_free_list(free_list, free_tail, free_count);
- prepend_list_to_g_om_in_use_list(in_use_list, in_use_tail, in_use_count);

ObjectSynchronizer::deflate_monitor_list(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor free_head_p, ObjectMonitor free_tail_p)

ObjectSynchronizer::deflate_monitor_list() is responsible for deflating idle ObjectMonitors at a safepoint. This function can use the simpler mark-mid-as-we-go protocol since there can be no parallel list deletions due to the safepoint:

    L01:  int ObjectSynchronizer::deflate_monitor_list(ObjectMonitor* volatile * list_p,
    L02:                                               int volatile * count_p,
    L03:                                               ObjectMonitor** free_head_p,
    L04:                                               ObjectMonitor** free_tail_p) {
    L05:    ObjectMonitor* cur_mid_in_use = NULL;
    L06:    ObjectMonitor* mid = NULL;
    L07:    ObjectMonitor* next = NULL;
    L08:    int deflated_count = 0;
    L09:    if (!mark_list_head(list_p, &mid, &next)) {
    L10:      return 0;  // The list is empty so nothing to deflate.
    L11:    }
    L12:    while (true) {
    L13:      oop obj = (oop) mid->object();
    L14:      if (obj != NULL && deflate_monitor(mid, obj, free_head_p, free_tail_p)) {
    L15:        if (cur_mid_in_use == NULL) {
    L16:          OrderAccess::release_store(list_p, next);
    L17:        } else {
    L18:          OrderAccess::release_store(&cur_mid_in_use->_next_om, next);
    L19:        }
    L20:        deflated_count++;
    L21:        Atomic::dec(count_p);
    L22:        set_next(mid, NULL);
    L23:        mid = next;
    L24:      } else {
    L25:        set_next(mid, next);  // unmark next field
    L26:        cur_mid_in_use = mid;
    L27:        mid = next;
    L28:      }
    L29:      if (mid == NULL) {
    L30:        break;  // Reached end of the list so nothing more to deflate.
    L31:      }
    L32:      next = mark_next_loop(mid);
    L33:    }
    L34:    return deflated_count;
    L35:  }

The above is not an exact copy of the code block from deflate_monitor_list(), but it is the highlights. What the above code block needs to do is pretty simple:

- Walk the list and deflate any idle ObjectMonitor that is associated with an object.
- 'free_head_p' and 'free_tail_p' track the list of deflated ObjectMonitors.
- 'deflated_count' is the number of deflated ObjectMonitors on the 'free_head_p' list.

Since we're using the simpler mark-mid-as-we-go protocol, there are not too many details:

- L09 marks the 'list_p' head (if it is not empty):
  - 'mid' is 'list_p's head and its next field is marked.
  - 'next' is the unmarked next field from 'mid'.
- L12-L34: We walk each 'mid' in the list and determine if it can be deflated:
  - L14: if 'mid' is associated with an object and can be deflated:
    - L15: if cur_mid_in_use is NULL, we're still processing the head of the in-use list so...
      - L16: we cmpxchg the list head's next field from 'marked_mid' to 'next'.
    - else
      - L18: we cmpxchg the cur_mid_in_use's next field from 'marked_mid' to 'next'.
    - L20 → L23: we've successfully extracted 'mid' from 'list_p's list so we increment 'deflated_count', decrement the counter referred to by 'count_p', set 'mid's next field to NULL and we're done.
      Note: 'mid' is the current tail in the 'free_head_p' list so we have to NULL terminate it (which also unmarks it).

- - L2[4-7]: 'mid' can't be deflated so unmark mid's next field and advance both 'cur_mid_in_use' and 'mid'.
  - L29 → L30]: we reached the end of the list so break out of the loop.
  - L32: mark next field in the new 'mid' and update 'next'; loop around and do it all again.
- L34: all done so return 'deflated_count'.

ObjectSynchronizer::deflate_monitor_list_using_JT(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor free_head_p, ObjectMonitor free_tail_p, ObjectMonitor** saved_mid_p)

ObjectSynchronizer::deflate_monitor_list_using_JT() is responsible for asynchronously deflating idle ObjectMonitors using a JavaThread. This function uses the more complicated mark-cur_mid_in_use-and-mid-as-we-go protocol because om_release() can do list deletions in parallel. We also mark-next-next-as-we-go to prevent an om_flush() that is behind this thread from passing us. Because this function can asynchronously interact with so many other functions, this is the largest clip of code:

    L01:  int ObjectSynchronizer::deflate_monitor_list_using_JT(ObjectMonitor* volatile * list_p,
    L02:                                                        int volatile * count_p,
    L03:                                                        ObjectMonitor** free_head_p,
    L04:                                                        ObjectMonitor** free_tail_p,
    L05:                                                        ObjectMonitor** saved_mid_in_use_p) {
    L06:    ObjectMonitor* cur_mid_in_use = NULL;
    L07:    ObjectMonitor* mid = NULL;
    L08:    ObjectMonitor* next = NULL;
    L09:    ObjectMonitor* next_next = NULL;
    L10:    int deflated_count = 0;
    L11:    if (*saved_mid_in_use_p == NULL) {
    L12:      if (!mark_list_head(list_p, &mid, &next)) {
    L13:        return 0;  // The list is empty so nothing to deflate.
    L14:      }
    L15:    } else {
    L16:      cur_mid_in_use = *saved_mid_in_use_p;
    L17:      mid = mark_next_loop(cur_mid_in_use);
    L18:      if (mid == NULL) {
    L19:        set_next(cur_mid_in_use, NULL);  // unmark next field
    L20:        *saved_mid_in_use_p = NULL;
    L21:        return 0;  // The remainder is empty so nothing more to deflate.
    L22:      }
    L23:      next = mark_next_loop(mid);
    L24:    }
    L25:    while (true) {
    L26:      if (next != NULL) {
    L27:        next_next = mark_next_loop(next);
    L28:      }
    L29:      if (mid->object() != NULL && mid->is_old() &&
    L30:          deflate_monitor_using_JT(mid, free_head_p, free_tail_p)) {
    L31:        if (cur_mid_in_use == NULL) {
    L32:          OrderAccess::release_store(list_p, next);
    L33:        } else {
    L34:          ObjectMonitor* marked_next = mark_om_ptr(next);
    L35:          OrderAccess::release_store(&cur_mid_in_use->_next_om, marked_next);
    L36:        }
    L37:        deflated_count++;
    L38:        Atomic::dec(count_p);
    L39:        set_next(mid, NULL);
    L40:        mid = next;  // mid keeps non-NULL next's marked next field
    L41:        next = next_next;
    L42:      } else {
    L43:        if (cur_mid_in_use != NULL) {
    L44:          set_next(cur_mid_in_use, mid);  // umark cur_mid_in_use
    L45:        }
    L46:        cur_mid_in_use = mid;
    L47:        mid = next;  // mid keeps non-NULL next's marked next field
    L48:        next = next_next;
    L49:        if (SafepointSynchronize::is_synchronizing() &&
    L50:            cur_mid_in_use != OrderAccess::load_acquire(list_p) &&
    L51:            cur_mid_in_use->is_old()) {
    L52:          *saved_mid_in_use_p = cur_mid_in_use;
    L53:          set_next(cur_mid_in_use, mid);  // umark cur_mid_in_use
    L54:          if (mid != NULL) {
    L55:            set_next(mid, next);  // umark mid
    L56:          }
    L57:          return deflated_count;
    L58:        }
    L59:      }
    L60:      if (mid == NULL) {
    L61:        if (cur_mid_in_use != NULL) {
    L62:          set_next(cur_mid_in_use, mid);  // umark cur_mid_in_use
    L63:        }
    L64:        break;  // Reached end of the list so nothing more to deflate.
    L65:      }
    L66:    }
    L67:    *saved_mid_in_use_p = NULL;
    L68:    return deflated_count;
    L69:  }

The line numbers in the analysis below are still for the v2.06 version and will be updated when we roll the project forward to v2.07.

The above is not an exact copy of the code block from deflate_monitor_list_using_JT(), but it is the highlights. What the above code block needs to do is pretty simple:

- Walk the list and deflate any idle ObjectMonitor that is associated with an object.
- 'free_head_p' and 'free_tail_p' track the list of deflated ObjectMonitors.
- 'deflated_count' is the number of deflated ObjectMonitors on the 'free_head_p' list.

Since we're using the more complicated mark-cur_mid_in_use-and-mid-as-we-go protocol and also the mark-next-next-as-we-go protocol, there is a mind numbing amount of detail:

- L1[1-3]: Handle the initial setup if we are not resuming after a safepoint:
  - L12 marks the 'list_p' head (if it is not empty):
    - 'mid' is 'list_p's head and its next field is marked.
    - 'next' is the unmarked next field from 'mid'.
- L15-L23: Handle the initial setup if we are resuming after a safepoint:
  - L17: mark next field in 'cur_mid_in_use' and update 'mid'
  - L18-L21: If 'mid' == NULL, then we've resumed context at the end of the list so we're done.
  - L23: mark next field in 'mid' and update 'next'
- L25-L64: We walk each 'mid' in the list and determine if it can be deflated:
  - L2[67]: if next != NULL, then mark next field in 'next' and update 'next_next'
  - L29-L41: if 'mid' is associated with an object, 'mid' is old, and can be deflated:
    - L31: if cur_mid_in_use is NULL, we're still processing the head of the in-use list so...
      - L32: we cmpxchg the list head's next field from 'marked_mid' to 'next'.
    - else
      - L34: make a marked copy of 'next'
      - L35: we cmpxchg the cur_mid_in_use's next field from 'marked_mid' to 'next'.
    - L37 → L39: we've successfully extracted 'mid' from 'list_p's list so we increment 'deflated_count', decrement the counter referred to by 'count_p', set 'mid's next field to NULL and we're done.
      Note: 'mid' is the current tail in the 'free_head_p' list so we have to NULL terminate it (which also unmarks it).
    - L40: advance 'mid' to 'next'.
      Note: 'mid' keeps non-NULL 'next's marked next field.
    - L41: advance 'next' to 'next_next'.
  - L42-L57: 'mid' can't be deflated so we have to carefully advance the list pointers:
    - L4[34]: if cur_mid_in_use != NULL, then unmark next field in 'cur_mid_in_use'.
    - L46: advance 'cur_mid_in_use' to 'mid'.
      Note: The next field in 'mid' is still marked and 'cur_mid_in_use' keeps that.
    - L47: advance 'mid' to 'next'.
      Note: The next field in a non-NULL 'next' is still marked and 'mid' keeps that.
    - L48: advance 'next' to 'next_next'.
    - L49-L57: Handle a safepoint if one has started and it is safe to do so.
  - L60-L64: we reached the end of the list:
    - L6[12]: if cur_mid_in_use != NULL, then unmark next field in 'cur_mid_in_use'.
    - L64: break out of the loop because we are done
- L67: not pausing for a safepoint so clear saved state.
- L68: all done so return 'deflated_count'.

ObjectSynchronizer::deflate_idle_monitors(...)

ObjectSynchronizer::deflate_idle_monitors() handles deflating idle monitors at a safepoint from the global in-use list using ObjectSynchronizer::deflate_monitor_list(). There are only a few things that are worth mentioning:

- OrderAccess::load_acquire(&g_om_in_use_list) is used to get the latest global in-use list.
- OrderAccess::load_acquire(&g_om_in_use_count) is used to get the latest global in-use count.
- prepend_list_to_g_free_list(free_head_p, free_tail_p, deflated_count) is used to prepend the deflated ObjectMonitors on the global free list.

ObjectSynchronizer::deflate_common_idle_monitors_using_JT(bool is_global, JavaThread* target)

ObjectSynchronizer::deflate_common_idle_monitors_using_JT() handles asynchronously deflating idle monitors from either the global in-use list or a per-thread in-use list using ObjectSynchronizer::deflate_monitor_list_using_JT(). There are only a few things that are worth mentioning:

- OrderAccess::load_acquire(&g_om_in_use_count) is used to get the latest global in-use count.
- OrderAccess::load_acquire(&target→om_in_use_count) is used to get the latest per-thread in-use count.
- prepend_list_to_g_free_list(free_head_p, free_tail_p, local_deflated_count) is used to prepend the deflated ObjectMonitors on the global free list.

Housekeeping Parts of the Algorithm

The devil is in the details! Housekeeping or administrative stuff are usually detailed, but necessary.

New diagnostic option '-XX:AsyncDeflateIdleMonitors' that is default 'true' so that the new mechanism is used by default, but it can be disabled for potential failure diagnosis.
ObjectMonitor deflation is still initiated or signaled as needed at a safepoint. When Async Monitor Deflation is in use, flags are set so that the work is done by the ServiceThread which offloads the safepoint cleanup mechanism.
- Having the ServiceThread deflate a potentially long list of in-use monitors could potentially delay the start of a safepoint. This is detected in ObjectSynchronizer::deflate_monitor_list_using_JT() which will save the current state when it is safe to do so and return to its caller to drop locks as needed before honoring the safepoint request.
New diagnostic option '-XX:AsyncDeflationInterval' that is default 250 millis; this this option controls how frequently we async default idle monitors when MonitorUsedDeflationThreshold is exceeded.
New diagnostic option '-XX:HandshakeAfterDeflateIdleMonitors' that is default false on the LP64 X64 platform and default true on other platforms that implement C2.

Everything else is just monitor list management, infrastructure, logging, debugging and the like. :-)

Monitor Deflation Invocation Details

The existing safepoint deflation mechanism is still invoked at safepoint "cleanup" time when '-XX:AsyncDeflateIdleMonitors' is false or when a special cleanup request is made.
SafepointSynchronize::do_cleanup_tasks() calls:
- ObjectSynchronizer::prepare_deflate_idle_monitors()
- A ParallelSPCleanupTask is used to perform the tasks (possibly using parallel tasks):
  - A ParallelSPCleanupThreadClosure is used to perform the per-thread tasks:
    - ObjectSynchronizer::deflate_thread_local_monitors() to deflate per-thread idle monitors
  - ObjectSynchronizer::deflate_idle_monitors() to deflate global idle monitors
- ObjectSynchronizer::finish_deflate_idle_monitors()

If MonitorUsedDeflationThreshold is exceeded (default is 90%, 0 means off), then the ServiceThread will invoke a cleanup safepoint when '-XX:AsyncDeflateIdleMonitors' is false. When '-XX:AsyncDeflateIdleMonitors' is true, the ServiceThread will call ObjectSynchronizer::deflate_idle_monitors_using_JT().
- This experimental flag was added in JDK10 via:

JDK-8181859 Monitor deflation is not checked in cleanup path

- For this option, exceeded means:

((g_om_population - g_om_free_count) / g_om_population) > NN%

If MonitorBound is exceeded (default is 0 which means off), cleanup safepoint will be induced.
For this option, exceeded means:

(g_om_population - g_om_free_count) > MonitorBound

This is a very difficult option to use correctly as it does not scale.
The MonitorBound option has been deprecated via JDK-8230938.

Changes to the safepoint deflation mechanism by the Async Monitor Deflation project (when async deflation is enabled):
- If System.gc() is called, then a special deflation request is made which invokes the safepoint deflation mechanism.
- Added the AsyncDeflationInterval diagnostic option (default 250 millis, 0 means off) to prevent MonitorUsedDeflationThreshold requests from swamping the ServiceThread.
  - Description: Async deflate idle monitors every so many milliseconds when MonitorUsedDeflationThreshold is exceeded (0 is off).
  - A special deflation request can cause an async deflation to happen sooner than AsyncDeflationInterval.
- SafepointSynchronize::do_cleanup_tasks() now calls:
  - ObjectSynchronizer::is_safepoint_deflation_needed() instead of ObjectSynchronizer::is_cleanup_needed().
  - is_safepoint_deflation_needed() returns true only if a special deflation request is made (see above).
- ObjectSynchronizer::do_safepoint_work() only does the safepoint cleanup tasks if there is a special deflation request. Otherwise it just sets the is_async_deflation_requested flag and notifies the ServiceThread.
- ObjectSynchronizer::deflate_idle_monitors() and ObjectSynchronizer::deflate_thread_local_monitors() do nothing unless there is a special deflation request.

Changes to the ServiceThread mechanism by the Async Monitor Deflation project (when async deflation is enabled):
- The ServiceThread will wake up every GuaranteedSafepointInterval to check for cleanup tasks.
  - This allows is_async_deflation_needed() to be checked at the same interval.
- The ServiceThread handles deflating global idle monitors and deflating the per-thread idle monitors by calling ObjectSynchronizer::deflate_idle_monitors_using_JT().

Other invocation changes by the Async Monitor Deflation project (when async deflation is enabled):
- VM_Exit::doit_prologue() will request a special cleanup to reduce the noise in 'monitorinflation' logging at VM exit time.
- Before the final safepoint in a non-System.exit() end to the VM, we will request a special cleanup to reduce the noise in 'monitorinflation' logging at VM exit time.

Gory Details

Counterpart function mapping for those that know the existing code:
- ObjectSynchronizer class:
  - deflate_idle_monitors() has deflate_idle_monitors_using_JT(), deflate_global_idle_monitors_using_JT(), deflate_per_thread_idle_monitors_using_JT(), and deflate_common_idle_monitors_using_JT().
  - deflate_monitor_list() has deflate_monitor_list_using_JT()
  - deflate_monitor() has deflate_monitor_using_JT()
- ObjectMonitor class:
  - clear() has clear_using_JT()
These functions recognize the Async Monitor Deflation protocol and adapt their operations:
- ObjectMonitor::enter()
- ObjectMonitor::EnterI()
- ObjectMonitor::ReenterI()
- ObjectSynchronizer::quick_enter()
- ObjectSynchronizer::deflate_monitor()
- Note: These changes include handling the lingering owner == DEFLATER_MARKER value.
Also these functions had to adapt and retry their operations:
- ObjectSynchronizer::FastHashCode()
- ObjectSynchronizer::current_thread_holds_lock()
- ObjectSynchronizer::query_lock_ownership()
- ObjectSynchronizer::get_lock_owner()
- ObjectSynchronizer::monitors_iterate()
- ObjectSynchronizer::inflate_helper()
- ObjectSynchronizer::inflate()
Various assertions had to be modified to pass without their real check when AsyncDeflateIdleMonitors is true; this is due to the change in semantics for the ObjectMonitor owner field.
ObjectMonitor has a new allocation_state field that supports three states: 'Free', 'New', 'Old'. Async Monitor Deflation is only applied to ObjectMonitors that have reached the 'Old' state.
- Note: Prior to CR1/v2.01/4-for-jdk13, the allocation state was transitioned from 'New' to 'Old' in deflate_monitor_via_JT(). This meant that deflate_monitor_via_JT() had to see an ObjectMonitor twice before deflating it. This policy was intended to prevent oscillation from 'New' → 'Old' and back again.
- In CR1/v2.01/4-for-jdk13, the allocation state is transitioned from 'New' -> "Old" in inflate(). This makes ObjectMonitors available for deflation earlier. So far there has been no signs of oscillation from 'New' → 'Old' and back again.
ObjectMonitor has a new ref_count field that is used as part of the async deflation protocol and to indicate that an ObjectMonitor* is in use so the ObjectMonitor should not be deflated; this is needed for operations on non-busy monitors so that ObjectMonitor values don't change while they are being queried. There is a new ObjectMonitorHandle helper to manage the ref_count.
The ObjectMonitor::owner() accessor detects DEFLATER_MARKER and returns NULL in that case to minimize the places that need to understand the new DEFLATER_MARKER value.
System.gc()/JVM_GC() causes a special monitor list cleanup request which uses the safepoint based monitor list mechanism. So even if AsyncDeflateIdleMonitors is enabled, the safepoint based mechanism is still used by this special case.
- This is necessary for those tests that do something to cause an object's monitor to be inflated, clear the only reference to the object and then expect that enough System.gc() calls will eventually cause the object to be GC'ed even when the thread never inflates another object's monitor. Yes, we have several tests like that. :-)

Child pages

Async Monitor Deflation

Summary

Background

Introduction

Key Parts of the Algorithm

1) Deflation With Interference Detection

2) Restoring the Header With Interference Detection

3) Using "owner" or "ref_count" With Interference Detection

An Example of ObjectMonitor Interference Detection

Start of the Race

Racing Threads

T-deflate Wins

T-save Wins

T-save Complication with C2

T-enter Wins By A-B-A

An Example of Object Header Interference

Start of the Race

T-deflate Wins First Race

T-save Wins First Race

Either Wins the Second Race

Hashcodes and Object Header Interference

Start of the Race

Racing Threads

T-deflate Wins

T-hash Wins

Lock-Free Monitor List Management In Theory

The Simple Case

The Not So Simple Case or Taking and Prepending on the Same List Leads to A-B-A Races

Marking to Solve the A-B-A Race

Background: ObjectMonitor Movement Between the Lists

ObjectMonitor Allocation Path

ObjectMonitor Deflation Path

ObjectMonitor Flush Path

ObjectMonitor Linkage Path

The Lists and Which Threads Touch Them

Lock-Free Monitor List Management In Reality

Prepending To A List That Also Allows Deletes

mark_next(), mark_om_ptr(), and set_next() Helper Functions

mark_next_loop() Helper Function

mark_list_head() Helper Function

Taking From The Start Of A List

Using The New Lock-Free Monitor List Functions

ObjectSynchronizer::om_alloc(Thread* self, ...)

ObjectSynchronizer::om_release(Thread* self, ObjectMonitor* m, bool from_per_thread_alloc)

ObjectSynchronizer::om_flush(Thread* self)

ObjectSynchronizer::deflate_monitor_list(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor** free_head_p, ObjectMonitor** free_tail_p)

ObjectSynchronizer::deflate_monitor_list_using_JT(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor** free_head_p, ObjectMonitor** free_tail_p, ObjectMonitor** saved_mid_p)

ObjectSynchronizer::deflate_idle_monitors(...)

ObjectSynchronizer::deflate_common_idle_monitors_using_JT(bool is_global, JavaThread* target)

Housekeeping Parts of the Algorithm

Monitor Deflation Invocation Details

Gory Details

ObjectSynchronizer::deflate_monitor_list(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor free_head_p, ObjectMonitor free_tail_p)

ObjectSynchronizer::deflate_monitor_list_using_JT(ObjectMonitor* volatile * list_p, int volatile * count_p, ObjectMonitor free_head_p, ObjectMonitor free_tail_p, ObjectMonitor** saved_mid_p)