...
// Print results - not offloaded
IntStream.range(0, length).forEach(p -> {
System.out.println(output[p]);
});
}
}
Implementation Notes
FastPath Allocation
The implementation has so far only been done for HSA targets and makes use of the fact that the HSA target has a coherent view of host memory. The HSAIL backend has always been able to directly access objects in the java heap, but the coherent view of memory also covers the data structures that control the heap. For example, the HotSpot JVM in many cases uses Thread Local Allocation Buffers (TLABs) to allocate memory on the heap. Usually each Java thread has its own TLAB. Among other things, a TLAB has the usual pointers for
...
Note that the donor threads themselves are suspended and are not allocating from their TLABs while the GPU is allocating.
...
Fastpath Allocation Failure, Deoptimization
If the fastpath allocation from the workitem's shared tlab fails (top overflows past end as described above), then by default we deoptimize to the interpreter using the usual HSAIL deoptimization logic (todo: link here to hsail deoptimization experiments). There is an additional graal option called HsailUseEdenAllocate which, if set to true, will instead first attempt to allocate from eden. (Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas). While eden allocation was functionally correct, we saw a performance degradation using eden allocation compared to simply deoptimizing and so have turned it off by default. We may explore eden_allocation further in the future and we would also like to explore the strategy of allocating a whole new TLAB from the GPU when a TLAB overflows.While deoptimizing to the interpreter gets correct results, for performance we would prefer to stay on the GPU rather than to deoptimize. There is an additional graal hsail allocation option which can be used for performance experiments. While deoptimizing to the interpreter gets correct results, for performance we would prefer to stay on the GPU rather than to deoptimize. The graal option HsailAllocBytesPerWorkitem specifies how many bytes each workitem expects to allocate. The JVM code before invoking the kernel will look at the donor thread TLAB free sizes and attempt to "close" a TLAB and try to allocate a new one if the existing free space is not large enough (taking into account the number of workitems and the number of donor threads). Behavior will be functionally correct regardless of this option, there just might be more deopts. We intend to explore other ways to reduce the probability of deopts.
Fastpath Allocation Failure, Eden Allocation
There is an additional graal option called HsailUseEdenAllocate which, if set to true, will instead first attempt to allocate from eden. There is a single eden from which all threads (and for us all workitems) allocate. In fact, TLABs themselves are allocated from Eden. Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas. While eden allocation was functionally correct, we saw a performance degradation using eden allocation compared to simply deoptimizing and so have turned it off by default. We may explore eden_allocation further in the future and we would also like to explore the strategy of allocating a whole new TLAB from the GPU when a TLAB overflows.