...
- When compile-time assumptions are violated, we can "trap" to the interpreter (this relies on the fact that the interpreter can handle anything). In addition, we have a way of handling certain hopefully rare events, such as throwing exceptions back to the CPU, which might be difficult to implement completely from the GPU. If statistics shows that such events are not actually "rare", we can
- decide that in the future this particular lambda is not a good candidate for offload.
- or in some cases we might be able to recompile and generate new code for the GPU.
- compiled code running on the GPU might get to a point where it needs the CPU to do something before the GPU can make further progress. For example, if we are supporting allocation on the GPU, we could get to a point where we cannot allocate any new object until a GC happens. If the target does not have an easy way to spin and wait for the CPU to do the GC, one way to support this is to deoptimize. The interpreter will let the GC happen and then continue executing bytecodes from the point of the deoptimization, including finishing the allocation.
- It allows an implementation of compiler safepoints. A long-running kernel can check an external flag, when directed by the compiler, check an external flag for example at the bottom of loops, and deopt deoptimize if the flag is set. This external flag can be set by the VM safepoint logic. And since the deoptimization into the interpreter can pause at safepoints, this mechanism allows a long-running kernel to be interrupted so that CPU threads do not have to wait so long.
A Small Example
Here is a simple example where we want to produce an output array of the squared value of a sequence of integers. But the logic that computes the output array index will cause an ArrayIndexOutOfBoundsException about halfway through the range. You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal and sumatra trunks.
Try running with -Dcom.amd.sumatra.offload.immediate=false (for normal JDK stream parallel operation) and -Dcom.amd.sumatra.offload.immediate=true (so the lambda will be offloaded to the GPU). Note that the stack trace shows the same trace lines through the context of the lambda itself. (Lines further up the stack will be dependent on the internal mechanism used to run the lambda across the range).
Note that on any run some output array slots contain their original -1 value, indicating the workitem for that entry did not run. The set of workitems that did not run may be different for the GPU vs. CPU cases, in fact it may be different for any two GPU runs, or any two CPU runs. The semantics for an exception on a stream operation is that the first exception is reported and pending new workitems will not run. Since the lambda is executing in parallel across the range, the set of workitems that might not run because of the exception is implementation-dependent. As a further experiment, you could try removing the .parallel() call from the forEach invocation, and see yet another output array configuration for a non-parallel run.
package simpledeopt;
import java.util.stream.IntStream;
import java.util.Arrays;
public class SimpleDeopt {
public static void main(String[] args) {
final int length = 20;
int[] output = new int[length];
Arrays.fill(output, -1);
try {
// offloadable since it is parallel
// will trigger exception halfway thru the range
IntStream.range(0, length).parallel().forEach(p -> {
int outIndex = (p == length/2 ? p + length : p);
writeIntArray(output, outIndex, p * p);
});
} catch (Exception e) {
e.printStackTrace();
}
// Print results - not offloadable since it is not parallel
IntStream.range(0, length).forEach(p -> {
System.out.println(p + ", " + output[p]);
});
}
static void writeIntArray(int[] ary, int index, int val) {
ary[index] = val;
}
}
Implementation Notes
Compile Time
We use the graal compiler to generate the hsail code. The graal compiler has a mature infrastructure for supporting deoptimization and still achieving good code quality. See http://design.cs.iastate.edu/vmil/2013/papers/p04-Duboscq.pdf. Basically the compiler nicely keeps track of the deoptimization state at each deopt point, and from that we can tell what HSAIL registers need to be saved, which registers contain oops, etc.
...