Page History

Introduction: JVM vs. SIMD

The Java Virtual Machine represents parallel execution by threads. Each thread executes some bit of code on some bit of data. Both the code and the data are (in principle) independent from all other threads. At any moment, a thread performs a single work-unit consisting of an instruction, its inputs, and its outputs.

Parallel processing units execute similar work-units, with a difference: Several work-units may be executed simultaneously, sharing a single instruction and multiple parallel inputs and outputs.unun

A program that executes one ALU instruction across a parallel set of data inputs and outputs is called SIMD, single instruction multiple thread. If the data is structured as parallel sets of registers or stack temps, and the instructions are capable of performing control flow, then the model may be called SIMT, where the last word is "thread".

...

(Note: Highly regular computations, such as dense matrix arithmetic, will benefit independently from vectorization and tiled data layouts. SIMT modes of execution may or may not add value to these very regular problems.)

What's in a Java Thread?

At any given mementmoment, a thread consists of:

a bytecode T.bc being executed (part of a basic block within some "current method")
local variables T.L[], expression stack T.S[], and monitors T.M[] (appropriate to the current method)
a stackframe stack frame of all of the above T.F = < bc,L,S,M >
a control stack of pending executions of either bytecode methods or native methods: T.C = {F(j)}
thread-local values (accessed by java.lang.ThreadLocal), T.TL[]
a permanently associated object of type java.lang.Thread, T.Thread

...

Java threads share memory structures but keep their instruction streams private. Parallel processing hardware such as GPUs can have a complementary design, with shared instruction streams but private memory. Localized, unshared memory simplifies communication, while several execution units may share a single instruction stream to reduce cross-lane interactions and/or amortize costs of instruction issuing.

Thread

...

Swarms

Imagine a thread gangswarm G as a group of Java threads T(i), with full Java semantics, which executes all the threads with a high degree of alignment.

When a thread gang swarm starts, all of its threads are positioned at an instruction which is about to call the standard method Thread.run.

At any point, G has a pending continuation for each live thread T(i). In order to express the alignment between gang membersthreads in the swarm, we express G's continuations as a multi-map rather than a set, with the keys of the map being current instructions:

...

Let's call each value set in the multi-map an aligned thread set.

A thread gang swarm makes progress by picking a key bc from G.conts.keySet and running that instruction in SIMD fashion across the relevant parts of each T(i) in parallel.

For example, if bc is a particular iadd, then all the T(i) waiting at that iadd add their top two stack elements and replace them with the sum. The threads then update their continuations to make ready to execute the next instruction after the iadd, and the G.conts map is updated to represent this.

Control Flow

When a thread gang swarm executes an instruction, there may be a variety of outcomes:

...

This variability of control flow is a strongly entrenched feature of the JVM. In order to execute a thread gang swarm efficienty on GPU hardware, threads must be managed to as to converge as much as possible. For example, consider this classic diamond-shaped program:

...

If all threads start off at a() but they compute differing values of the boolean, some will execute b() and others c(). To begin to control divergence, no thread should execute d(), until all threads have left the if-then-else statement (the block labeled L).

(Noteequence Note the consequence that throwing an exception or returning from b() or c() counts as exiting the block, even though the thread will not rejoin the others at d().)

Walking the Control Graph

We want to encourage convergence, to use parallel hardware more efficiently, and also to make less opportunity for race conditions.

...

A thread group may use more than one physical thread (or GPU array) to execute. For example, in an if-then-else statement, one processor might execute the "then" part which another deals with the the "else" part. Or, an implementation may choose to use a single physical processor, multiplexing the execution of the two parts.

Other Considerations

The SIMT model semms seems to allow some kinds of very serialized Java code to operate efficiently on GPUs.

The thread-gang swarm model is useless if the implementation actually requires that we create a large number of conventional threads. Perhaps a subclass of Thread would help express the necessary information to the system.

...

It should be possible to "roll up" or gather the final results (or provisional results) of a thread gang computationswarm computation. The rolled-up value must be structured as a map from processor to value, or a collection of values.

It should also be possible to "roll down" or scatter new values into the thread gang swarm computation from a map or collection.

...

Child pages

Versions Compared

Old Version 1

New Version Current

Key