...
To overcome this problem, we propose to generate the MemBar in the load/store factory methods if required. This could be omitted locally on PPC and IA64 or other platforms that emit the ordering operation along with the load. Instead, a MemBarCPUOrder could be emitted. Then, on these platforms, we can properly emit ordering operations for the MemBar nodes.
Trampoline stubs
On PPC relative branches and calls can only encode a 16-bit offset, i.e., they reach only 64k far. This is not sufficient for calls to reach all code, thus a long branch is needed. As loading 64-bit constants is expensive (5 instructions), and as patching them atomic is hard, we implement a long branch by loading the callee address from the constant pool. When a call instruction is relocated, we decide which branch to use.
Performance measurements showed that it is better to keep the long call in a stub. Also, this makes it easier to find a good instruction schedule. Changing the instruction sequence after fixing the schedule, as it happens with relocation, can break an instruction schedule for Power6 considerably. The Power6 processor derives a lot of information about implicit bundling from the instructions executed previously.
To implement this feature, we need a new relocation type trampoline_stub_Relocation.
http://hg.openjdk.java.net/ppc-aix-port/jdk8/hotspot/rev/d02f0701be17
Constants, Constant pool and Calls.
On PPC, loading a coastant into a register requires five instructions. These instructions can't easily be patched atomically. Therefore we choose to load all 64-bit constants from the constant pool. Unfortunately the IC cache and the call target of Call nodes are not represented by Const nodes in the IR.
One can not use the adl constant pool functionality with Calls. Using $constanttablebase etc. with a call has two problems: It makes the call a subnode of MachConstantNode, but calls must be subnodes of SafePointNode. Further it adds an edge to the Call node containing the constant table base constant. But call nodes have a fixed layout of in-edges, thus this is not possible either.
Therefore we do not use $constanttablebase for Calls. We add the edge to the table base in an extra phase after matching. This phase walks the IR and fixes the Call nodes. We use the TypeFunc::ReturnAdr in-edge of Calls, which is not used on PPC, for this. This is an existing edge and thus does not break the other phases relying on the edge layout of Calls. To recognize nodes that need the constant table base input in this phase, we added a funcion ins_requires_toc() to MachNodes, which returns false by default, and true if specified in the adl instruct statements.
As Calls are not derived from MachConstantNode the phase computing the size of the constant pools skips these. To fix this we extend MachNodes by a function ins_num_consts() returning the space required in the constant pool by the node, and adapt the phase computing the space requirements.
We also use this functionality in the storeCM node, for which we implemented an optimization that requires a constant.
This change extends HotSpot by the new platform-dependent phase after matching:
http://hg.openjdk.java.net/ppc-aix-port/jdk8/hotspot/rev/45f271751014
An this one adds the PPC extensions for constants:
http://hg.openjdk.java.net/ppc-aix-port/jdk8/hotspot/rev/9a8d8eff3f61
Comment:
This change is designed to do as few functional changes to shared code as possible. Adding a new, but unused phase to sparc and x86 should do no harm there. But because of this it is not a really clean solution of the problem. We would be happy to redesign Call nodes so they can get new edges added.
Also, we could redesign the $constanttablebase functionality to support more constants similarly (polling page, narrow oop base). Further, it would be helpful if it would add a MachOper describing the new input edge. If a constant is recognize by having this operand, or by a function generated into the node, the problem with the ambiguous superclasses can be resolved.
Expanding nodes after register allocation (8003854).
We designed a compiler phase the expands IR nodes after register allocation. We call this phase lateExpand.
Some nodes can not be expanded during matching. E.g., register allocation might not be able to deal with the resulting pattern, or global code motion might break some constraints. But instruction scheduling needs to be able to place each instruction individiually, thus a node should correspond to a single instruction if possible. The lateExpand phase which runs after register allocation solves this. Whether and how nodes are expanded is specified in the ad-file. Shared code calls the exand routines if they are available. We use this for some nodes on ppc, and extensively on ia64.
LateExpand is called after register allocation, just before output (i.e. scheduling). It only gets called if Matcher::require_late_expand is true. It expands compound nodes requiring several assembler instructions to be implemented into two or more non-compound nodes. The old compound node is simply replaced in its location in the basic block by a new subgraph which does not contain compound nodes any more. The scheduler called during output can process these non-compound nodes.
http://hg.openjdk.java.net/ppc-aix-port/jdk8/hotspot/rev/19846affb789
Implementation details
Nodes requiring late expand are specified in the ad file by using an lateExpand statement instead of ins_encode. A lateExpand contains a single call to an encoding, as does an ins_encode statement. Instead of an emit() emit function a lateExpand() function is generated that doesn't emit assembler but creates a new subgraph. A walk over the IR calls this lateExpand function for each node that specifies a lateExpand. This function returns the new nodes generated in an array passed in the call. The old node, potential MachTemps before and potential Projs after it then get disconnected and replaced by the new nodes. The instruction generating the result has to be the last one in the array. In general it is assumed that Projs after the node expanded are kills. These kills are not required any more after expanding as there are now explicitly visible def-use chains and the Projs are removed. This does not hold for calls: They do not only have kill-Projs but also Projs defining values. Therefore Projs after the node expanded are removed for all but for calls. If a node is to be reused, it must be added to the nodes list returned, and it will be added again.
Implementing the lateExpand function for a node is rather tedious. It requires knowledge about many node details, as the nodes and the subgraph must be hand crafted. To simplify this, adlc generates some utility variables into the lateExpand function, e.g., holding the operands as specified by the lateExpand encoding specification, e.g.:
- unsigned idx_<par_name> holding the index of the node in the ins
- Node *n_<par_name> holding the node loaded from the ins
- MachOpnd *op_<par_name> holding the corresponding operand
The ordering of operands can not be determined by looking at a rule. Especially if a match rule matches several different trees, several nodes are generated from one instruct specification with different operand orderings. In this case the adlc generated variables are the only way to access the ins and operands deterministically.
Example
Below you find an example how to use late expand for the sparc.ad file. Further down you see the code generated by adlc. Perhaps you can find better use cases for this feature.
--- a/src/cpu/sparc/vm/sparc.ad 2012-11-21 12:27:04.591486000 +0100
+++ b/src/cpu/sparc/vm/sparc.ad 2012-11-19 14:45:15.059452000 +0100
@@ -1933,7 +1937,7 @@
}
// Does the CPU require late expand (see block.cpp for description of late expand)?
-const bool Matcher::require_late_expand = false;
+const bool Matcher::require_late_expand = true;
// Should the Matcher clone shifts on addressing modes, expecting them to
// be subsumed into complex addressing expressions or compute them into
@@ -7497,6 +7501,7 @@
// Register Division
instruct divI_reg_reg(iRegI dst, iRegIsafe src1, iRegIsafe src2) %{
match(Set dst (DivI src1 src2));
+ predicate(!UseNewCode);
ins_cost((2+71)*DEFAULT_COST);
format %{ "SRA $src2,0,$src2\n\t"
@@ -7506,6 +7511,68 @@
ins_pipe(sdiv_reg_reg);
%}
+//------------------------------------------------------------------------------------
+
+encode %{
+
+ enc_class lateExpandIdiv_reg_reg(iRegI dst, iRegIsafe src1, iRegIsafe src2) %{
+ MachNode *m1 = new (C) divI_reg_reg_SRANode();
+ MachNode *m2 = new (C) divI_reg_reg_SRANode();
+ MachNode *m3 = new (C) divI_reg_reg_SDIVXNode();
+
+ m1->add_req(n_region, n_src1);
+ m2->add_req(n_region, n_src2);
+ m3->add_req(n_region, m1, m2);
+
+ m1->_opnds[0] = _opnds[1]->clone(C);
+ m1->_opnds[1] = _opnds[1]->clone(C);
+
+ m2->_opnds[0] = _opnds[2]->clone(C);
+ m2->_opnds[1] = _opnds[2]->clone(C);
+
+ m3->_opnds[0] = _opnds[0]->clone(C);
+ m3->_opnds[1] = _opnds[1]->clone(C);
+ m3->_opnds[2] = _opnds[2]->clone(C);
+
+ ra_->set1(m1->_idx, ra_->get_reg_first(n_src1));
+ ra_->set1(m2->_idx, ra_->get_reg_first(n_src2));
+ ra_->set1(m3->_idx, ra_->get_reg_first(this));
+
+ nodes->push(m1);
+ nodes->push(m2);
+ nodes->push(m3);
+ %}
+%}
+
+instruct divI_reg_reg_SRA(iRegIsafe dst) %{
+ effect(USE_DEF dst);
+ size(4);
+ format %{ "SRA $dst,0,$dst\n\t" %}
+ ins_encode %{ __ sra($dst$$Register, 0, $dst$$Register); %}
+ ins_pipe(ialu_reg_reg);
+%}
+
+instruct divI_reg_reg_SDIVX(iRegI dst, iRegIsafe src1, iRegIsafe src2) %{
+ effect(DEF dst, USE src1, USE src2);
+ size(4);
+ format %{ "SDIVX $src1,$src2,$dst\n\t" %}
+ ins_encode %{ __ sdivx($dst$$Register, 0, $dst$$Register); %}
+ ins_pipe(sdiv_reg_reg);
+%}
+
+instruct divI_reg_reg_Ex(iRegI dst, iRegIsafe src1, iRegIsafe src2) %{
+ match(Set dst (DivI src1 src2));
+ predicate(UseNewCode);
+ ins_cost((2+71)*DEFAULT_COST);
+
+ format %{ "SRA $src2,0,$src2\n\t"
+ "SRA $src1,0,$src1\n\t"
+ "SDIVX $src1,$src2,$dst" %}
+ lateExpand( lateExpandIdiv_reg_reg(src1, src2, dst) );
+%}
+
+//------------------------------------------------------------------------------------
+
// Immediate Division
instruct divI_reg_imm13(iRegI dst, iRegIsafe src1, immI13 src2) %{
match(Set dst (DivI src1 src2));
Code generated by adlc:
class divI_reg_reg_ExNode : public MachNode {
// ...
virtual bool requires_late_expand() const { return true; }
virtual void lateExpand(GrowableArray <Node *> *nodes, PhaseRegAlloc *ra_);
// ...
};
void divI_reg_reg_ExNode::lateExpand(GrowableArray <Node *> *nodes, PhaseRegAlloc *ra_) {
// Start at oper_input_base() and count operands
unsigned idx0 = 1;
unsigned idx1 = 1; // src1
unsigned idx2 = idx1 + opnd_array(1)->num_edges(); // src2
// Access to ins and operands for late expand.
unsigned idx_dst = idx1; // iRegI, src1
unsigned idx_src1 = idx2; // iRegIsafe, src2
unsigned idx_src2 = idx0; // iRegIsafe, dst
Node *n_region = lookup(0);
Node *n_dst = lookup(idx_dst);
Node *n_src1 = lookup(idx_src1);
Node *n_src2 = lookup(idx_src2);
iRegIOper *op_dst = (iRegIOper *)opnd_array(1);
iRegIsafeOper *op_src1 = (iRegIsafeOper *)opnd_array(2);
iRegIsafeOper *op_src2 = (iRegIsafeOper *)opnd_array(0);
Compile *C = Compile::current();
{
#line 7518 "/net/usr.work/d045726/oJ/8/main-hotspot-outputStream-test/src/cpu/sparc/vm/sparc.ad"
MachNode *m1 = new (C) divI_reg_reg_SRANode();
MachNode *m2 = new (C) divI_reg_reg_SRANode();
MachNode *m3 = new (C) divI_reg_reg_SDIVXNode();
m1->add_req(n_region, n_src1);
m2->add_req(n_region, n_src2);
m3->add_req(n_region, m1, m2);
m1->_opnds[0] = _opnds[1]->clone(C);
m1->_opnds[1] = _opnds[1]->clone(C);
m2->_opnds[0] = _opnds[2]->clone(C);
m2->_opnds[1] = _opnds[2]->clone(C);
m3->_opnds[0] = _opnds[0]->clone(C);
m3->_opnds[1] = _opnds[1]->clone(C);
m3->_opnds[2] = _opnds[2]->clone(C);
ra_->set1(m1->_idx, ra_->get_reg_first(n_src1));
ra_->set1(m2->_idx, ra_->get_reg_first(n_src2));
ra_->set1(m3->_idx, ra_->get_reg_first(this));
nodes->push(m1);
nodes->push(m2);
nodes->push(m3);
#line 11120 "../generated/adfiles/ad_sparc.cpp"
}
}
Trap based null checks
Trampoline relocations
...