For Sumatra targets, we wanted to experiment with an infrastructure for supporting deoptimization, that is, transferring execution from the compiled code running on the GPU back to the equivalent bytecodes being run through the interpreter on the CPU.   The following describes some experiments with deoptimization using the HSAIL Backend which are now available on the graal trunk.

Some reasons for deoptimization on the GPU are:

Implementation Notes

Compile Time

We use the graal compiler to generate the hsail code.  The graal compiler has a mature infrastructure for supporting deoptimization and still achieving good code quality.  See http://design.cs.iastate.edu/vmil/2013/papers/p04-Duboscq.pdf.  Basically the compiler keeps track of the deoptimization state at each deopt point, and from that we can tell what HSAIL registers need to be saved, which registers contain oops, etc.

Execution Time

When we dispatch a kernel to the GPU, we need to finish the execution of all the workitems even if one or more of the workitems deoptimize.  The kernel code can be executed across a possibly very large number of workitems, each of which can have its own state.  The non-deoptimizing workitems can finish as they normally would but we need to be able to save the state of the deoptimizing workitems.   We don't know how many workitems are going to need to deoptimize and need to save their state, yet we want to avoid having to allocate state-saving space for the entire possibly very large range of workitems.

A particular HSA target has a maximum number of workitems that can be executing concurrently.  So to save space, we need only allocate state-saving space for this maximum number of possible concurrent workitems.  We do this by having deopting workitems set a "deopt happened" flag, and having each workitem check this "deopt happened" flag.  Future workitems that see the "deopt happened" flag as true just set a simple flag indicating they never ran and exit immediately, thus not needing to save any additional state.  Currently the never_ran flag array is one byte per workitem.  We are looking at ways to make this smaller but HSA devices have a lot of freedom in how they schedule workitems which makes it difficult to figure out the never-rans.

Workitems that deopt atomically bump an index saying where they should store their deopt data.  The deopt data consists of

An HSAILFrame consists of:

To avoid code bloat, we currently have one deopt exit point per kernel, and in that deopt exit code we save the union of the actual registers that are live at any of the infopoints.

When the GPU dispatch completes, each workitem will have either finished normally, deopted, or not run at all.  Still in the VM, we check if there were any deopts.  If not, we know everything completed normally and we can just return back to java.  However, if there was at least one deopt then

GC Considerations

Currently, the normal kernel dispatch runs in "thread in VM" mode and thus does not need to worry about moving Oops.  However, each time a deopting workitem is run through the interpreter or each time a never-ran workitem is run we are back in Java mode which can cause GCs.  So for each saved hsail frame, we need to know which of the saved state contain oops and make sure those locations are updated in the face of GC.  The current strategy of copying oops into an Object array supplied by the java side will be replaced later with an oops_do type of strategy.