For Sumatra targets, we wanted to experiment with an infrastructure for supporting deoptimization, that is, transferring execution from the compiled code running on the GPU back to the equivalent bytecodes being run through the interpreter on the CPU.   The following describes some experiments with deoptimization using the HSAIL Backend which are now available on the graal trunk.

Some reasons for deoptimization on the GPU are:

A Small Example

Here is a simple example where we want to produce an output array of the squared value of a sequence of integers.  But the logic that computes the output array index will cause an ArrayIndexOutOfBoundsException about halfway through the range.  You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal and sumatra trunks. 

Try running with -Dcom.amd.sumatra.offload.immediate=false (for normal JDK stream parallel operation) and -Dcom.amd.sumatra.offload.immediate=true (so the lambda will be offloaded to the GPU).  Note that the stack trace shows the same trace lines through the context of the lambda itself.  (Lines further up the stack will be dependent on the internal mechanism used to run the lambda across the range). 

Note that on any run some output array slots contain their original -1 value, indicating the workitem for that entry did not run.  The set of workitems that did not run may be different for the GPU vs. CPU cases, in fact it may be different for any two GPU runs, or any two CPU runs.  The semantics for an exception on a stream operation is that the first exception is reported and pending new workitems will not run.  Since the lambda is executing in parallel across the range, the set of workitems that might not run because of the exception is implementation-dependent.  As a further experiment, you could try removing the .parallel() call from the forEach invocation, and see yet another output array configuration for a non-parallel run.

package simpledeopt; 
import java.util.stream.IntStream;
import java.util.Arrays;
public class SimpleDeopt { 
    public static void main(String[] args) {
        final int length = 20;
        int[] output = new int[length];
        Arrays.fill(output, -1);
        try {
// offloadable since it is parallel
// will trigger exception halfway thru the range
            IntStream.range(0, length).parallel().forEach(p -> {
                    int outIndex = (p == length/2 ? p + length : p);
                    writeIntArray(output, outIndex,  p * p);
                });
        } catch (Exception e) {
            e.printStackTrace();
        }
        // Print results - not offloadable since it is not parallel
        IntStream.range(0, length).forEach(p -> {
            System.out.println(p + ", " + output[p]);
        });
    }
    static void writeIntArray(int[] ary, int index, int val) {
        ary[index] = val;
    }
}
 

Implementation Notes

Compile Time

We use the graal compiler to generate the hsail code.  The graal compiler has a mature infrastructure for supporting deoptimization and still achieving good code quality.  See http://design.cs.iastate.edu/vmil/2013/papers/p04-Duboscq.pdf.  Basically the compiler nicely keeps track of the deoptimization state at each deopt point, and from that we can tell what HSAIL registers need to be saved, which registers contain oops, etc.

Execution Time

When we dispatch a kernel to the GPU, we need to finish the execution of all the workitems even if one or more of the workitems deoptimize.  The kernel code can be executed across a possibly very large number of workitems, each of which can have its own state.  The non-deoptimizing workitems can finish as they normally would but we need to be able to save the state of the deoptimizing workitems.   We don't know how many workitems are going to need to deoptimize and need to save their state, yet we want to avoid having to allocate state-saving space for the entire possibly very large range of workitems.

A particular HSA target has a maximum number of workitems that can be executing concurrently.  So to save space, we need only allocate state-saving space for this maximum number of possible concurrent workitems.  We do this by having deopting workitems set a "deopt happened" flag, and having each workitem check this "deopt happened" flag.  Future workitems that see the "deopt happened" flag as true just set a simple flag indicating they never ran and exit immediately, thus not needing to save any additional state.  Currently the never_ran flag array is one byte per workitem.  We are looking at ways to make this smaller but HSA devices have a lot of freedom in how they schedule workitems which makes it difficult to figure out the never-rans.

Workitems that deopt atomically bump an index saying where they should store their deopt data.  The deopt data consists of

An HSAILFrame consists of:

To avoid code bloat, we currently have one deopt exit point per kernel, and in that deopt exit code we save the union of the actual registers that are live at any of the infopoints.

When the GPU dispatch completes, each workitem will have either finished normally, deopted, or not run at all.  Still in the VM, we check if there were any deopts.  If not, we know everything completed normally and we can just return back to java.  However, if there was at least one deopt then

GC Considerations

Currently, the normal kernel dispatch runs in "thread in VM" mode and thus does not need to worry about moving Oops.  However, each time a deopting workitem is run through the interpreter or each time a never-ran workitem is run we are back in Java mode which can cause GCs.  So for each saved hsail frame, we need to know which of the saved state contain oops and make sure those locations are updated in the face of GC.  The current hack strategy of copying oops into an Object array supplied by the java side will be replaced later with an oops_do type of strategy.