Experiments with Allocation on Sumatra

In order to widen the types of lambda expressions that could be offloaded to the GPU, we wanted to be able to handle lambdas that used object allocation. The following describes some experiments with allocation using the HSAIL Backend to graal which are available on the graal trunk.

Note: As is true of many modern compilers, graal can avoid the actual allocation when, by using escape analysis, it can prove that the allocated objects do not escape. Here's an example junit test where graal can successfully use escape analysis.

But we also wanted to handle lambdas where allocated objects really did escape. Here is a simple example where we start with an array of longs and we want to produce an array of Strings, one for each long. You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal trunk.

package simplealloc;

import java.util.stream.IntStream;
import java.util.Random;

public class SimpleAlloc {

    public static void main(String[] args) {
        final int length = 20;
        String[] output = new String[length];

        // initialize input, not offloaded
        long[] input = new Random().longs(length).toArray();

        // call toString on each input element - this is offloaded
        IntStream.range(0, length).parallel().forEach(p -> {
                output[p] = Long.toString(input[p]);
        });

 
        // Print results - not offloaded
        IntStream.range(0, length).forEach(p -> {
            System.out.println(output[p]);
        });
    }
}

Implementation Notes

FastPath Allocation

The implementation has so far only been done for HSA targets and makes use of the fact that the HSA target has a coherent view of host memory. The HSAIL backend has always been able to directly access objects in the java heap, but the coherent view of memory also covers the data structures that control the heap. For example, the HotSpot JVM in many cases uses Thread Local Allocation Buffers (TLABs) to allocate memory on the heap. Usually each Java thread has its own TLAB. Among other things, a TLAB has the usual pointers for

  HeapWord* _start;     // start of TLAB
  HeapWord* _top;       // address after last allocation, bumped with each allocation
  HeapWord* _end;       // end of TLAB end

The HSAIL allocation code uses TLABs but if we had a TLAB for each workitem, the number of TLABs would be too large. Thus multiple workitems can allocate from a single TLAB. To simplify TLAB collection by regular GC procedures, the TLABs that the HSAIL kernels use are still owned by regular Java threads called "donor threads". The graal option, -G:HsailDonorThreads controls how many such donor threads (and TLABs) are created and passed to the gpu.

Since multiple workitems can allocate from a single tlab, the HSAIL instruction atomic_add is used to atomically get and add to thetlab.top pointer. If tlab.top overflows past tlab.end, the first overflower (who is detectable because his oldTop is still less than end) saves the oldTop as the "last good top". Meanwhile other workitems who try to allocate from this TLAB will also overflow and fail. This "last good top" is then restored in the JVM code when the kernel dispatch finishes so the TLAB invariants are met.

This allocation logic can be seen in the graal files HSAILNewObjectSnippets.java and HSAILHotSpotReplacementsUtil.java and is currently implemented for the graal nodes NewInstanceNode and NewArrayNode. The dynamic flavors are not supported yet. Other than the special treatment of tlab.top mentioned above, the other logic in the fastpath allocation path (formatting object, etc). inherits from its superclass NewObjectSnippets.

Note that the donor threads themselves are suspended and are not allocating from their TLABs while the GPU is allocating.

Fastpath Allocation Failure, Deoptimization

If the fastpath allocation from the workitem's shared tlab fails (top overflows past end as described above), then by default we deoptimize to the interpreter using the usual HSAIL deoptimization logic. While deoptimizing to the interpreter gets correct results, for performance we would prefer to stay on the GPU rather than to deoptimize. There is an additional graal option, HsailAllocBytesPerWorkitem, which can be used for performance experiments. HsailAllocBytesPerWorkitem hints how many bytes each workitem expects to allocate. The JVM code before invoking the kernel will look at the donor thread TLAB free sizes and attempt to "close" a TLAB and try to allocate a new one if the existing free space is not large enough (taking into account the number of workitems and the number of donor threads). Behavior will be functionally correct regardless of this option, there just might be more deopts. We intend to explore other ways to reduce the probability of deopts.

Fastpath Allocation Failure, Eden Allocation

The graal option HsailUseEdenAllocate, if set to true, specifies that instead of deopting, we will instead first attempt to allocate from eden. There is a single eden from which all threads (and for us all workitems) allocate. In fact, TLABs themselves are allocated from Eden. Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas. While eden allocation was functionally correct, we saw a performance degradation using eden allocation compared to simply deoptimizing and so have turned it off by default. We may explore eden_allocation further in the future and we would also like to explore the strategy of allocating a whole new TLAB from the GPU when a TLAB overflows.