- Loading...
In order to widen the types of lambda expressions that could be offloaded to the GPU, we wanted to be able to handle lambdas that used object allocation. The following describes some experiments with allocation using the HSAIL Backend to graal which are available on the graal trunk.
Note: As is true of many modern compilers, graal can avoid the actual allocation when, by using escape analysis, it can use escape analysis to prove that the allocated objects do not escape. Here's an example junit test where graal can successfully use escape analysis.
But we also wanted to handle lambdas where allocated objects really did escape. Here is a simple example where we start with an array of longs and we want to produce an array of Strings, one for each long. You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal trunk.
...
// Print results - not offloaded
IntStream.range(0, length).forEach(p -> {
System.out.println(output[p]);
});
}
}
HeapWord* _start; // start of TLAB
HeapWord* _top; // address after last allocation, bumped with each allocation
HeapWord* _end; // end of TLAB end
...
Since multiple workitems can allocate from a single tlab, the HSAIL instruction atomic_add is used to atomically get and add to thetlabthe tlab.top pointer. If tlab.top overflows past tlab.end, the first overflower (who is detectable because his oldTop is still less than end) saves the oldTop as the "last good top". Meanwhile other workitems who try to allocate from this TLAB will also overflow and fail. This When the kernel dispatch finishes, this "last good top" is then restored in the JVM code when the kernel dispatch finishes code so the TLAB invariants are met.
...
Note that the donor threads themselves are suspended and are not allocating from their TLABs while the GPU is allocating.
...
If the fastpath allocation from the workitem's shared tlab fails (top overflows past end as described above), then by default we deoptimize to the interpreter using the usual HSAIL deoptimization logic (todo: link here to hsail deoptimization experiments). There is an additional graal option called HsailUseEdenAllocate which, if set to true, will instead first attempt to allocate from eden. (a number of different policies are possible. These are:
For best performance, we would like to deoptimize back to the interpreter only when absolutely necessary.
For the default policy of allocating a new tlab from the GPU, we assign the task of allocating the new tlab to the "first overflower" as described above. Other workitems then wait for the first overflower to indicate that a new TLAB is ready. A level of indirection, a TLABInfo pointer is used to make the updating to a new tlab pointer atomic.
The second policy of allocating from eden directly is enabled by the graal flag HsailUseEdenAllocate. In this case, there is a single eden from which all threads (and for us all workitems) allocate. In fact, TLABs themselves are allocated from Eden. Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas). While eden allocation was functionally correct, we saw a performance degradation using eden allocation compared to simply deoptimizing and so have turned it off by default. We may explore eden_allocation further in the future and we would also like to explore the strategy of allocating a whole new TLAB from the GPU when a TLAB overflows.
There is an additional graal hsail allocation option which can be used for performance experiments. While deoptimizing to the interpreter gets correct results, for performance we would prefer to stay on the GPU rather than to deoptimize. The graal option HsailAllocBytesPerWorkitem specifies how many bytes each workitem expects to allocate. The JVM code before invoking the kernel will look at the donor thread TLAB free sizes and attempt to "close" a TLAB and try to allocate a new one if the existing free space is not large enough (taking into account the number of workitems and the number of donor threads). Behavior will be functionally correct regardless of this option, there just might be more deopts. We intend to explore other ways to reduce the probability of deopts.
on hardware where many workitems could be hitting eden at the same time.
The final policy of deoptimizing to the interpreter has to be supported by any allocation policy. For instance if the GPU tries to allocate a new TLAB, there might not be enough free space on the heap and a GC will be necessary. In this case we deoptimize using the usual HSAIL deoptimization logic. In the future there may be a way to stay on the GPU and wait for the GC to happen on the CPU, but this is not supported yet.