Experiments with Allocation on Sumatra

In order to widen the types of lambda expressions that could be offloaded to the GPU, we wanted to be able to handle lambdas that used object allocation. The following describes some experiments with allocation using the HSAIL Backend to graal which are available on the graal trunk.

Note: As is true of many modern compilers, graal can avoid the actual allocation when, by using escape analysis, it can prove that the allocated objects do not escape. Here's an example junit test where graal can successfully use escape analysis.

But we also wanted to handle lambdas where allocated objects really did escape. Here is a simple example where we start with an array of longs and we want to produce an array of Strings, one for each long. You can build and run this using the instructions on Standalone Sumatra Stream API Offload Demo assuming you are using the latest graal trunk.

package simplealloc;

import java.util.stream.IntStream;
import java.util.Random;

public class SimpleAlloc {

    public static void main(String[] args) {
        final int length = 20;
        String[] output = new String[length];

        // initialize input, not offloaded
        long[] input = new Random().longs(length).toArray();

        // call toString on each input element - this is offloaded
        IntStream.range(0, length).parallel().forEach(p -> {
                output[p] = Long.toString(input[p]);
        });

 
        // Print results - not offloaded
        IntStream.range(0, length).forEach(p -> {
            System.out.println(output[p]);
        });
    }
}

Implementation Notes

FastPath Allocation

The implementation has so far only been done for HSA targets and makes use of the fact that the HSA target has a coherent view of host memory. The HSAIL backend has always been able to directly access objects in the java heap, but the coherent view of memory also covers the data structures that control the heap. For example, the HotSpot JVM in many cases uses Thread Local Allocation Buffers (TLABs) to allocate memory on the heap. Usually each Java thread has its own TLAB. Among other things, a TLAB has the usual pointers for

  HeapWord* _start;     // start of TLAB
  HeapWord* _top;       // address after last allocation, bumped with each allocation
  HeapWord* _end;       // end of TLAB end

The HSAIL allocation code uses TLABs but if we had a TLAB for each workitem, the number of TLABs would be too large. Thus multiple workitems can allocate from a single TLAB. To simplify TLAB collection by regular GC procedures, the TLABs that the HSAIL kernels use are still owned by regular Java threads called "donor threads". The graal option, -G:HsailDonorThreads controls how many such donor threads (and TLABs) are created and passed to the gpu.

Since multiple workitems can allocate from a single tlab, the HSAIL instruction atomic_add is used to atomically get and add to the tlab.top pointer. If tlab.top overflows past tlab.end, the first overflower (who is detectable because his oldTop is still less than end) saves the oldTop as the "last good top". Meanwhile other workitems who try to allocate from this TLAB will also overflow and fail. When the kernel dispatch finishes, this "last good top" is then restored in the JVM code so the TLAB invariants are met.

This allocation logic can be seen in the graal files HSAILNewObjectSnippets.java and HSAILHotSpotReplacementsUtil.java and is currently implemented for the graal nodes NewInstanceNode and NewArrayNode. The dynamic flavors are not supported yet. Other than the special treatment of tlab.top mentioned above, the other logic in the fastpath allocation path (formatting object, etc). inherits from its superclass NewObjectSnippets.

Note that the donor threads themselves are suspended and are not allocating from their TLABs while the GPU is allocating.

Fastpath Allocation Failure

If the fastpath allocation from the workitem's shared tlab fails (top overflows past end as described above), then a number of different policies are possible. These are:

allocate a new tlab from the gpu and retry the fastpath allocation described above. This is the default policy.
allocate the original request from eden directly using CAS.
give up and deoptimize to the interpreter

For best performance, we would like to deoptimize back to the interpreter only when absolutely necessary.

For the default policy of allocating a new tlab from the GPU, we assign the task of allocating the new tlab to the "first overflower" as described above. Other workitems then wait for the first overflower to indicate that a new TLAB is ready. A level of indirection, a TLABInfo pointer is used to make the updating to a new tlab pointer atomic.

The second policy of allocating from eden directly is enabled by the graal flag HsailUseEdenAllocate. In this case, there is a single eden from which all threads (and for us all workitems) allocate. In fact, TLABs themselves are allocated from Eden. Given that we are possibly competing with real java threads for eden allocation, we use the hsail platform atomic instruction atomic_cas. While eden allocation was functionally correct, we saw a performance degradation using eden allocation on hardware where many workitems could be hitting eden at the same time.

The final policy of deoptimizing to the interpreter has to be supported by any allocation policy. For instance if the GPU tries to allocate a new TLAB, there might not be enough free space on the heap and a GC will be necessary. In this case we deoptimize using the usual HSAIL deoptimization logic. In the future there may be a way to stay on the GPU and wait for the GC to happen on the CPU, but this is not supported yet.