Project Sumatra Wiki
This page, with its child pages, contains design notes for Project Sumatra
OpenJDK project page: http://openjdk.java.net/projects/sumatra
Repositories: http://hg.openjdk.java.net/sumatra/sumatra-dev/{scratch,hotspot,jdk,...} (repo info)
Developer list: http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev
Goals
- Enable Java applications to take advantage of heterogeneous processing units (GPUs/APUs)
- Extend JVM JITs to generate code for heterogeneous processing hardware
- Integrate the JVM data model with data types efficiently processed by such hardware
- Allow the JVM to efficiently interoperate with high-performance libraries built for such hardware
- Extend the JVM managed runtime to track pointers and storage allocation throughout such a system
Challenges
Here are some of the specific technical challenges.
- mitigate the complexities of present-day GPU backends and layered standards
- build compromise data schemes for both the JVM and GPU hardware
- define Java model for "value types" which can be pervasively unboxed (like tuples or structs)
- need to support flatter data structures (Complex values, vector and RBGA values, 2D arrays) from Java
- need to support mix of primitives and JVM-managed pointers
- range of solutions: "don't"; like JNI array-critical; pinning read barrier; stack maps and safepoints in GPU
- range of solutions: no pointers; pointers are opaque (e.g., indices into Java-side array); arena pointers; pinning read barrier.
- need "foreign data interface" that is competent to interoperate (without copying) to standard sparse array packages
- adapt (or extend if necessary) JNI as a foreign invocation interface that is competent to call purpose-built C code for complex GPU requests
- reduce data copying and inter-phase latency between ISAs and loop kernels
- agreement of data structures will reduce copying
- more flexible loop kernel container will allow loop kernel fusion
- cope with dynamically varying mixes of managed parallel and serial data and code
- use JVM dynamic compilation techniques to build customized kernels and execution strategies
- optimize computation requests relative to online data
- automatically (at each appropriate level of the system) sense load and distribute cleanly between CPU and GPUs
- compile (online) JDK 8 parallel collection pipelines to data parallel compute requests
- partition simple Java bytecode call graphs (after profile-directed inlining) into CPU and GPU
- learn to efficiently flatten nested or keyed parallel constructs
- apply existing technology on nested data parallelism (to JVM execution of GPU code)
- apply existing technology on MapReduce (to JVM execution of GPU code)
- ensure that Java views of flattened and grouped parallel data sets are compatible with GPU capabilities
- efficiently implement "nonlinear streams" in JDK 8 parallel collections
- create a practical and predictable story for loop vectorization, presumably user-assisted, and with useful failure modes
- build a low-level library of vector intrinsics (e.g., AVX-style) that can be called (manually) from Java
- apply existing technology for loop vectorization
- build user-assisted loop vectorizers for Java, possibly based on type annotations (JSR 308)
- deal with exceptional conditions as they arise in loop kernels
- allow GPU loop kernels to call back to CPU for infrequent edge cases (argument reduction, exceptions, allocation overflows, deoptimization of slow paths)
- engineer a loop kernel container API which accounts for multiple CPU outcomes, and aggregates per kernel iteration (perhaps with continuation-passing style)
- define a robust and clear data-parallel execution model on top of the JVM bytecode, memory, and thread specifications
- interpret (or adapt if necessary) the Java Memory Model (JSR 133) to the needs of data parallel programming
- interpret (or adapt if necessary) the thread-based Java concurrency model (define GPU kernel effects in terms of bytecode execution by weakened quasi-threads)
- Investigate use of Java Language constructs and programming idioms that can be effectively compiled for a data-parallel execution engine (such as a GPU).
- potential candidate - Lambda methods and expressions
- other options?
- Investigate opportunities for GPU enabled 'intrinsics' versions of existing JDK APIs
- candidates may be sort, (de)+compression, crc checking, search, convolutions etc.
- adopt and adapt insights from previous work on data-parallel Java projects
- Fork/Join framework
- Aparapi
- Rootbeer
- RIT Parallel Java
- Terracotta
- jcuda - Java bindings for CUDA
- jocl - Java bindings for OpenCL
- jogamp-jocl - Jogamps' Java bindings for OpenCL
- FIXME: need a good list of references here
FIXME: Most of these items need their own wiki pages and/or email conversations
Roadmap
FIXME: In what order will we address these challenges?
Known investigations
FIXME: Add your work here!
See something wrong on this page? Fix it!
- Join sumatra-dev@openjdk.java.net
- Send an e-mail to sumatra-dev@openjdk.java.net:
- Request editor rights to the "Project Sumatra" wiki
- Include your Oracle SSO Username