Project Sumatra Wiki

This page, with its child pages, contains design notes for Project Sumatra

OpenJDK project page: http://openjdk.java.net/projects/sumatra
Repositories: http://hg.openjdk.java.net/sumatra/sumatra-dev/{scratch,hotspot,jdk,...} (repo info)
Developer list: http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev

Goals

  • Enable Java applications to take advantage of heterogeneous processing units (GPUs/APUs)
  • Extend JVM JITs to generate code for heterogeneous processing hardware
  • Integrate the JVM data model with data types efficiently processed by such hardware
  • Allow the JVM to efficiently interoperate with high-performance libraries built for such hardware
  • Extend the JVM managed runtime to track pointers and storage allocation throughout such a system

Challenges

Here are some of the specific technical challenges.

  • mitigate the complexities of present-day GPU backends and layered standards
    • standards include: OpenCL, CUDA, Intel Phi, PTX, HSA HSA (forthcoming), ...
    • FIXME: choose 1-3 of the standards (e.g., PTX, HSAIL/HSA ) for initial backend development
  • build compromise data schemes for both the JVM and GPU hardware
    • define Java model for "value types" which can be pervasively unboxed (like tuples or structs)
    • need to support flatter data structures (Complex values, vector and RBGA values, 2D arrays) from Java
    • need to support mix of primitives and JVM-managed pointers
      • range of solutions: "don't"; like JNI array-critical; pinning read barrier; stack maps and safepoints in GPU
      • range of solutions: no pointers; pointers are opaque (e.g., indices into Java-side array); arena pointers; pinning read barrier.
    • need "foreign data interface" that is competent to interoperate (without copying) to standard sparse array packages
    • adapt (or extend if necessary) JNI as a foreign invocation interface that is competent to call purpose-built C code for complex GPU requests
  • reduce data copying and inter-phase latency between ISAs and loop kernels
    • agreement of data structures will reduce copying
    • more flexible loop kernel container will allow loop kernel fusion
  • cope with dynamically varying mixes of managed parallel and serial data and code
    • use JVM dynamic compilation techniques to build customized kernels and execution strategies
    • optimize computation requests relative to online data
  • automatically (at each appropriate level of the system) sense load and distribute cleanly between CPU and GPUs
    • compile (online) JDK 8 parallel collection pipelines to data parallel compute requests
    • partition simple Java bytecode call graphs (after profile-directed inlining) into CPU and GPU
  • learn to efficiently flatten nested or keyed parallel constructs
    • apply existing technology on nested data parallelism (to JVM execution of GPU code)
    • apply existing technology on MapReduce (to JVM execution of GPU code)
    • ensure that Java views of flattened and grouped parallel data sets are compatible with GPU capabilities
    • efficiently implement "nonlinear streams" in JDK 8 parallel collections
  • create a practical and predictable story for loop vectorization, presumably user-assisted, and with useful failure modes
    • build a low-level library of vector intrinsics (e.g., AVX-style) that can be called (manually) from Java
    • apply existing technology for loop vectorization
    • build user-assisted loop vectorizers for Java, possibly based on type annotations (JSR 308)
  • deal with exceptional conditions as they arise in loop kernels
    • allow GPU loop kernels to call back to CPU for infrequent edge cases (argument reduction, exceptions, allocation overflows, deoptimization of slow paths)
    • engineer a loop kernel container API which accounts for multiple CPU outcomes, and aggregates per kernel iteration (perhaps with continuation-passing style)
  • define a robust and clear data-parallel execution model on top of the JVM bytecode, memory, and thread specifications
    • interpret (or adapt if necessary) the Java Memory Model (JSR 133) to the needs of data parallel programming
    • interpret (or adapt if necessary) the thread-based Java concurrency model (define GPU kernel effects in terms of bytecode execution by weakened quasi-threads)
  • Investigate use of Java Language constructs and programming idioms that can be effectively compiled for a data-parallel execution engine (such as a GPU).
    • potential candidate - Lambda methods and expressions
    • other options?
  • Investigate opportunities for GPU enabled 'intrinsics' versions of existing JDK APIs
    • candidates may be sort, (de)+compression, crc checking, search, convolutions etc.
  • adopt and adapt insights from previous work on data-parallel Java projects

FIXME: Most of these items need their own wiki pages and/or email conversations

Roadmap

FIXME: In what order will we address these challenges?

Known investigations

FIXME: Add your work here!

See something wrong on this page? Fix it!

 

  • No labels