...
Implementation Notes
FastPath Allocation
- The implementation has so far only been done for HSA targets and makes use of the fact that the HSA target has a coherent view of host memory. The HSAIL backend has always been able to directly access objects in the java heap, but the coherent view of memory also covers the data structures that control the heap. For example, the HotSpot JVM in many cases uses Thread Local Allocation Buffers (TLABs) to allocate memory on the heap. Usually each Java thread has its own TLAB. Among other things, a TLAB has the usual pointers for
HeapWord* _start; // start of TLAB
HeapWord* _top; // address after last allocation, bumped with each allocation
HeapWord* _end; // end of TLAB end
...
Since multiple workitems can allocate from a single tlab, the HSAIL instruction atomic_add is used to atomically get and add to thetlabthe tlab.top pointer. If tlab.top overflows past tlab.end, the first overflower (who is detectable because his oldTop is still less than end) saves the oldTop as the "last good top". Meanwhile other workitems who try to allocate from this TLAB will also overflow and fail. This When the kernel dispatch finishes, this "last good top" is then restored in the JVM code when the kernel dispatch finishes code so the TLAB invariants are met.
...
Note that the donor threads themselves are suspended and are not allocating from their TLABs while the GPU is allocating.
Fastpath Allocation Failure
...
If the fastpath allocation from the workitem's shared tlab fails (top overflows past end as described above), then by default we deoptimize to the interpreter using the usual HSAIL deoptimization logic. While deoptimizing to the interpreter gets correct results, for performance we would prefer to stay on the GPU rather than to deoptimize. There is an additional graal option, HsailAllocBytesPerWorkitem, which can be used for performance experiments. HsailAllocBytesPerWorkitem hints how many bytes each workitem expects to allocate. The JVM code before invoking the kernel will look at the donor thread TLAB free sizes and attempt to "close" a TLAB and try to allocate a new one if the existing free space is not large enough (taking into account the number of workitems and the number of donor threads). Behavior will be functionally correct regardless of this option, there just might be more deopts. We intend to explore other ways to reduce the probability of deopts.
Fastpath Allocation Failure, Eden Allocation
a number of different policies are possible. These are:
- allocate a new tlab from the gpu and retry the fastpath allocation described above. This is the default policy.
- allocate the original request from eden directly using CAS.
- give up and deoptimize to the interpreter
For best performance, we would like to deoptimize back to the interpreter only when absolutely necessary.
For the default policy of allocating a new tlab from the GPU, we assign the task of allocating the new tlab to the "first overflower" as described above. Other workitems then wait for the first overflower to indicate that a new TLAB is ready. A level of indirection, a TLABInfo pointer is used to make the updating to a new tlab pointer atomic.
The second policy of allocating from eden directly is enabled by the graal flag HsailUseEdenAllocate. In this case, there The graal option HsailUseEdenAllocate, if set to true, specifies that instead of deopting, we will instead first attempt to allocate from eden. There is a single eden from which all threads (and for us all workitems) allocate. In fact, TLABs themselves are allocated from Eden. Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas. While eden allocation was functionally correct, we saw a performance degradation using eden allocation compared to simply deoptimizing and so have turned it off by default. We may explore eden_allocation further in the future and we would also like to explore the strategy of allocating a whole new TLAB from the GPU when a TLAB overflows.
on hardware where many workitems could be hitting eden at the same time.
The final policy of deoptimizing to the interpreter has to be supported by any allocation policy. For instance if the GPU tries to allocate a new TLAB, there might not be enough free space on the heap and a GC will be necessary. In this case we deoptimize using the usual HSAIL deoptimization logic. In the future there may be a way to stay on the GPU and wait for the GC to happen on the CPU, but this is not supported yet.