C2 Register Allocator Notes

There are some very rough notes I wrote up about the structure of the C2 register allocator.

An Overview of C2's Register Allocator

Chaitin Briggs style graph coloring allocator with optimistic coloring.

Stages:

de_ssa:

Convert from an SSA style to a Named style. The comments before
the call to de_ssa says that the output and inputs of a PhiNode are
all given the same name but that's not what the implementation does.
It simply gives every node with a non-empty out_RegMask a name which
is a number from an incrementing counter. The 0 live range is given
to nodes which don't produce values. The _names array contains the
mapping from idx to name. n2lidx is used to map from Node* to the
name. It appears the lidx is just name.

The IFG data structures are allocated at this point.

gather_lrg_masks(false):
for each instruction, collect information about the live ranges.
_has_copy
_is_float
_was_spilled1, was_spilled2
AND out_RegMask into LRG
_is_bound
_max_freq
set_num_regs
set_reg_pressure
_is_oop
Then for each input do similar data collection

Then for each LRG some computation of other fields is done and some
LRG are marked as _direct_conflicts as the point because their
register mask has gone empty.

There's a bit of duplication in what's done for the def and for it's
uses that could probably be profitably shared.

There's a weird special case in here that doesn't AND in the
register mask if there's a large frequency difference between the
between the def block and the use block. This is only done before
aggresive coalescing.

liveness:
Next we compute liveness which consists of the liveout set for each
block represented as an IndexSet.

stretch_base_pointer_live_ranges:
This code adds extra uses of the base of any derived pointer to any
safepoint the derived pointer is live across. This is needed for
describing derived pointers to GC. This may cause values to be live
longer than they used to be which requires recomputing the liveness
information.

if stretch_base_pointer_live_ranges invalidated the liveness then
repeat gather_lrg_masks(false),liveness.

build_ifg_virtual:
// The IFG is built by a single reverse pass over each basic block.
// Starting with the known live-out set, we remove things that get
// defined and add things that become live (essentially executing one
// pass of a standard LIVE analysis). Just before a Node defines a value
// (and removes it from the live-ness set) that value is certainly live.
// The defined value interferes with everything currently live. The
// value is then removed from the live-ness set and it's inputs are
// added to the live-ness set.

There's some special handling of two_adr in here to make the def
live simultaneously with the use. This creates a virtual copy that
will disappear is this is actually the last use.

There's a total hack in the middle of all this that attempts to
commute some addIs so that the two_adr input in the right place.
This can eliminate some unneeded copies. There should be a more
general optimization of this kind to arrange last uses for
commutative operators so that they go dead as the two_adr input.

Aggressive Coalescing:
The IFG is squared so that if a is b's neighbor then b is a's
neighbor.

Aggressive Coalescing is coalescing of live ranges without regard to
preserving colorability. It can cause spilling that wouldn't have
happened before. This implementation is pessimistic too, meaning
that we don't hit a fixed point in one pass. Multiple passes would
be required before all coalescing opportunities would be found.

coalesce_driver calls coalesce on each block. It has a comment that
says "Coalesce from high frequency to low" which uses the _blks
array of PhaseChaitin which is a sorted list of blocks computed in
the PhaseChaitin constructor. The sorting code for this looks silly
and should probably be revisited. The frequencies are probably
scaled wrong too since we fixed the frequencies in our blocks.

PhaseAggressiveCoalesce::coalesce simply combines the every phi
which each of its inputs. The code for this does a fair amount of
searching and could probably be more efficient. It takes each block
and looks at the successor for phis, instead of simply looking at
the head of each block for phis and merging all the inputs.
Additionally for two_adr nodes it combines the two_adr input with
the node itself.

insert_copies then rewrites the liveness information to reflect that
coalescing of live ranges that occurred. Then for each instruction,
every use is updated to eliminate any copies and dead copies are
removed. Any phis which couldn't get completely coalesced have
copies inserted and the edges are updated to reflect this.
Something similar is done to support two_adr nodes. The new copy
may be a rematerialized copy of the value or a real copy. Low
frequency uses of high frequency LRGs also gets copies here if they
are uses for debug info.

ifg.init/gather_lrg_mask(true)/live.compute are repeated again.
noticed that true is passed to indicate that aggressive coalescing
is completed.

build physical interference graph
This is the real IFG computation taking into account actual register
usage. The liveout info is cloned for use in the phase so its
available for oopmap computation. This also computes ihrp_index and
fhrp_index to track regions of high register pressure. LRGs which
go empty are marked as requiring spilling. Dead nodes are also
removed during this pass.

There are some ifdefs for EXACT_PRESSURE which could eliminated
since I think this is the mode we want.

The cost metric in here is the block frequency times the number of
non-phi instructions. The means that things like temps and kills
count in the area which seems silly. It's decremented by the
frequency after each node in a block is processed. The cost is
accumlated onto LRG._area.

spill copies for lrgs which are single def and have a single use
directly following them are given a high score to ensure they get
simplified first. This appears to be intended to avoid spill split
recycle problems.

USE cost normally is 2 times the block frequency. Rematerializable
values are simply the block frequency and debug uses are 0. DEF
cost is the block frequency or 0 for rematerializable values.

This functions returns the number of spills required by LRGs that
went empty.

If any spilling is required at this point:
Split is called which inserts copies for any LRGs which are marked
as _must_spill. A more detailed description of its operation will
be written later.

compact() is called which removed unused LRG names and renames the
live ones to compact the number of LRGs names in use.

ifg.init/gather_lrg_mask(true)/live.compute is done again.

build_ifg_physical is called again and the return value is ignored.

The IFG is squared up and Compute_Effective_Degree is called. The
effective degree of each LRG is computed by looking at the registers
used by each neighbor. The degree of an LRG is a conservative
estimate so low degree LRGs are guaranteed to color but high degree
LRGs most likely won't but in some cases could depending on what
color theirs neighbors get. There's a comment that for fat_projs we
have to multiply the required number of registers together which
refers to Brigg's thesis for the reasoning.

A round of conservative coalescing is done at this point.
Conservative coaslescing is not allowed make a colorable graph
uncolorable. It processes each copy in every block seeing if the
src and dst live ranges can be safely merged. More details about
this are needed.

compress_uf_map_for_nodes is called which updates the _names array
with the result of any merging of LRGs. It seems like this should
be part of coalescing.

If no spilling was required then the ifg is squared up and the
effective degree of the LRG in the IFG is computed.

cache_lrg_info is called which looks at each LRG and sorts it into 3
lists, _lo_degree which are LRGs which are guaranteed to color and
require a register, _lo_stk_degree are ranges which can be register or
stack and are guaranteed to color and _hi_degree which are LRGs of
high degree and are unlikely to color.

Simplify and Select are called again and if we need spills we enter
the iterative part of registers allocation consisting of repeating
this until no more spills are required:
Split
compact
ifg.init/gather_lrg_masks(true)/live.compute
SquareUp/Compute_Effective_Degree
PhaseConservativeCoalesce
compress_uf_map_for_nodes
cache_lrg_info
Simplify
Select

post_allocate_copy_removal: Eliminate copies left around by the
allocator when another live version of the value can be found in a
live register or on the stack. This is done by maintaining a table
representing the current state of the stack and registers with the
value that was defined into that register. Each input of a node is
processed looking for copies and if the needed value is already in
the right register or is available in a register that is compatiable
with the in_RegMask of the use, the input is switched to the
existing value. This may cause the original input to go dead which
allows it to be deleted. Then the node is placed into the table to
represent it def'ing its value. Nodes that produce constants are
treated as copies and can be eliminated.

Note that for best result you want to process the blocks in RPO
order so that the results of predecessor blocks are available.
Obviously this doesn't work that well for loops because of that but
maybe an iterative version could be made to work. Additionally
eliminating copies may sometimes make other copies eliminatable so
processing a block more than once may be a good idea. Scheduling copies
higher in the block may also expose more opportunities since copy
elimination is very senstive to register selection.

The framesize is computed.

fixup_spills: Conversion of cisc spill instructions is performed by
walking all the instructions and replacing them with their cisc
variants if it's needed.

alloc_node_regs: Copy the information about register selection from
the LRG into a OptoRegPair table indexed by idx.