- Loading...
For Sumatra targets, we wanted to experiment with an infrastructure for supporting deoptimization, that is, transferring execution from the compiled code running on the GPU back to the equivalent bytecodes being run through the interpreter on the CPU. The following describes some experiments with deoptimization using the HSAIL Backend which are now available on the graal trunk.
Some reasons for deoptimization on the GPU are:
Here is a simple example where we want to produce an output array of the squared value of a sequence of integers. But the logic that computes the output array index will cause an ArrayIndexOutOfBoundsException about halfway through the range. You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal and sumatra trunks.
Try running with -Dcom.amd.sumatra.offload.immediate=false (for normal JDK stream parallel operation) and -Dcom.amd.sumatra.offload.immediate=true (so the lambda will be offloaded to the GPU). Note that the stack trace shows the same trace lines through the context of the lambda itself. (Lines further up the stack will be dependent on the internal mechanism used to run the lambda across the range).
Note that on any run some output array slots contain their original -1 value, indicating the workitem for that entry did not run. The set of workitems that did not run may be different for the GPU vs. CPU cases, in fact it may be different for any two GPU runs, or any two CPU runs. The semantics for an exception on a stream operation is that the first exception is reported and pending new workitems will not run. Since the lambda is executing in parallel across the range, the set of workitems that might not run because of the exception is implementation-dependent. As a further experiment, you could try removing the .parallel() call from the forEach invocation, and see yet another output array configuration for a non-parallel run.
package simpledeopt;
import java.util.stream.IntStream;
import java.util.Arrays;
public class SimpleDeopt {
public static void main(String[] args) {
final int length = 20;
int[] output = new int[length];
Arrays.fill(output, -1);
try {
// offloadable since it is parallel
// will trigger exception halfway thru the range
IntStream.range(0, length).parallel().forEach(p -> {
int outIndex = (p == length/2 ? p + length : p);
writeIntArray(output, outIndex, p * p);
});
} catch (Exception e) {
e.printStackTrace();
}
// Print results - not offloadable since it is not parallel
IntStream.range(0, length).forEach(p -> {
System.out.println(p + ", " + output[p]);
});
}
static void writeIntArray(int[] ary, int index, int val) {
ary[index] = val;
}
}
We use the graal compiler to generate the hsail code. The graal compiler has a mature infrastructure for supporting deoptimization and still achieving good code quality. See http://design.cs.iastate.edu/vmil/2013/papers/p04-Duboscq.pdf. Basically the compiler nicely keeps track of the deoptimization state at each deopt point, and from that we can tell what HSAIL registers or stackslots need to be saved, which registers or stackslots contain oops, etc.
In HSAIL, registers do not live across function calls, so the register saving must be inlined into each function. To avoid code bloat, we currently have one deopt exit point per kernel. In that deopt exit code we save the union of the actual registers and stack slots that are live at any of the infopoints. There is another reason why only the minimum number of registers should be saved. The total number of HSAIL registers used affects the code quality when the HSAIL code is finalized to the device IL, even if registers are used on a "cold" path such as a deopt exit point.
When we dispatch a kernel to the GPU, we need to finish the execution of all the workitems even if one or more of the workitems deoptimize. The kernel code can be executed across a possibly very large number of workitems, each of which can have its own state. The non-deoptimizing workitems can finish as they normally would but we need to be able to save the state of the deoptimizing workitems. For a particular kernel graal can tell us the maximum size of the registers and stack slots that will need to be saved for any deoptimization PC in that kernel. We don't know up front how many workitems are going to need to deoptimize and need to save their state, yet we want to avoid having to allocate state-saving space for the entire possibly very large range of workitems.
A particular HSA target has a maximum number of workitems that can be executing concurrently. So to save space, we need only allocate state-saving space for this maximum number of possible concurrent workitems. To support this, deopting workitems set a "deopt happened" flag, and before beginning execution, each workitem check this "deopt happened" flag. Future workitems that see the "deopt happened" flag as true just set a simple flag indicating they never ran and exit immediately, thus not needing to save any additional state. Currently the never_ran flag array is one byte per workitem. We are looking at ways to make this smaller but HSA devices have a lot of freedom in how they schedule workitems which makes it difficult to figure out the never-rans.
Workitems that deopt atomically bump an index saying where they should store their deopt data. The deopt data consists of
An HSAILFrame consists of:
When the GPU dispatch completes, each workitem will have either finished normally, deopted, or not run at all. Still in the VM, we check if there were any deopts. If not, we know everything completed normally and we can just return back to java. However, if there was at least one deopt then
Currently, the normal kernel dispatch runs in "thread in VM" mode and thus does not need to worry about moving Oops. However, each time a deopting workitem is run through the interpreter we are back in Java mode which can cause GCs. So for each saved hsail frame, we need to know which of the saved state contain oops and make sure those locations are updated in the face of GC. The OopMap is calculated on the java side and passed down at execute time. Given a deoptimization PC, once can use the OopMap tell which 64-bit registers or stack slots are oops.
This OopMap information is used by a GC thread when, as part of its oops_do workflow, it goes through the list of JavaThreads and finds one that is currently executing an HSAIL kernel. The saved state is only used by deoptimizing workitems and we know that we only need to fixup the state for deotpimizing workitems that have not yet been run.