Problem
The current implementation of ParallelGC's the Parallel Full GC is not suitable too rigid for Lilliput 2. In particular's Compact Identity Hash-Code. Specifically, it does not work with Compact Identity Hashcodeallow for objects to be resized/expanded when they move, which is a requirement for Compact Identity Hash-Code. The reason for that is the compressor-style algorithm that is used in this Parallel Full GC. This algorithm precomputes forwarding addresses of objects in bulk (e.g. 64-word-blocks) and then only requires a few additional computations to get the true forwarding address of each object. The problem with this approach is that it is too rigid and does not allow for expanding of moving objects, which is necessary for Compact Identity Hashcode. Here I want to outline a parallel mark-compact algorithm that does not have this restriction.
The basic idea is to use an algorithm that is similar to the one that's used by G1 and Shenandoah. We assume that the first phase of the full GC identifies all reachable objects by marking them in a marking bitmap, as is already implemented by the current implementation. Unless G1 and Shenandoah, the Parallel GC does not divide the heap into regions. This is a problem, because division into regions is the important feature that facilitates the parallelism in G1 and Shenandoah. However, the current full GC implementation provides machinery to partition the heap spaces into regions and deal with objects that overlap region boundaries.
Forwarding Phase
The forwarding phase is the 2nd phase of the full GC, and starts after marking completed. This phase scans the heap and computes the forwarding addresses for all reachable objects.
The work of calculating the forwarding addresses is carried by multiple GC workers, therefore we need a way to divide up the work between the GC threads. This is achieved as follows:
The heap is divided into N equal-sized regions, and we start out with a list of regions. This list comprises the whole heap.
...
lies in the fact that the algorithm parallelizes by dividing up the heap into equal-sized regions (with overlapping objects), and pre-compute destination boundaries for each region by inspecting the size of each live object per region. However, we can't determine the size of objects until we know whether or not an object will move at all. This is further complicated by the fact that we cannot even assume that only a dense prefix will not move - the expansion of moved objects can lead to a situation where a subsequent object would not move.
Proposed new Algorithm
The basic idea is to not make assumptions about object sizes, and instead determine the destination location more dynamically. We can adapt the algorithm that is used by G1 and Shenandoah GC. The difficulty is that in G1 and Shenandoah, regions are a bit more rigid in that they don't allow objects that cross region boundaries. That property makes parallelization much easier because worker threads can fully own a region without potential interference from other worker threads.
More flexible region sizes
Therefore we need to make regions of more flexible sizes. In the (single-threaded) summary phase that follows after marking and precedes compaction, we set up our list of regions by starting out with equal-sized regions, and then adjusting each region's bottom upwards to be the first word of the region that is not an overlapping object, and adjust its end upwards to the first word that is not an overlapping object (which will also be the bottom of the subsequent region).
Forwarding Phase
With those more flexible regions set-up, we can basically 1:1 adapt G1/Shenandoah's algorithm for the forwarding and compaction phases. Forwarding works like this:
- From the global list of regions, workers atomically claim regions serially. The first claimed regions becomes that workers current source and destination region. Later, source and destination are likely to become different regions. As the names imply, source is where we compact from, and destination is where we compact to. The destination region maintains the current compact-point, which initially is the destination region's bottom. The worker also maintains a list of destination regions. Claimed regions also get appended to the tail of the destination-region-list, that is they may become compaction destinations, once the current compaction destination is exhausted.
- The worker then scans the current source region's live objects. Each live object gets assigned a forwarding address, which is the current compact point. The compact point is then advanced by the object's size (possibly taking into account object expansion).
- If an object does not fit into the current destination region, then we switch to the next destination region. We may leave a wasted gap at the end of a destination region, which will later be filled with a dummy object. We append the current destination region to the end of the worker's compaction list. We pop the head of the destination-region-list and make that the new destination region.
- When the phase is finished, we append the remaining destination-list to the end of the compaction-list. The resulting list are all regions that the worker has processed, and serves as the work-list for the compaction phase.
Compaction Phase
Similarily, we can also 1:1 adapte G1/Shenandoah's compaction phase.
- Each worker processes its compaction list sequentially.
- It scans the live objects in each region in the compaction list.
- For every live object, find the forwarding address that has been computed in the Forwarding Phase, and copy the object to that address.
- Fill the gap at the end of each destination region with a filler object.