Description
Let's look at a bad case for tracing JITs. The simple function we looked at in #6 worked really well but turns out to be quite fragile. Let us look at how to break it and see if we can learn something in the process.
I write this from my perspective as a humble application programmer who only needs to understand the fundamentals well enough to write efficient programs. I will not be offering any grand insights into compilers on this blog (though perhaps commenters will).
Background
Back in #6 I wrote a simple Lua module for Snabb Switch and found that the JIT compiled it surprisingly well and that the CPU executed the code surprisingly well too.
The code was a simple loop to bump a counter one billion times. The code is high-level Lua that even includes a hashtable lookup in the inner loop (looking up the add
function from the counter
table):
local counter = require("core.counter")
local n = 1e9
local c = counter.open("test")
for i = 1,n do
counter.add(c, 1)
end
LuaJIT compiled this down to a trace: a linear block of machine code that contains at most one loop and otherwise no internal branches. The loop compiled down to five instructions:
->LOOP:
0bcafab0 add rbp, +0x01
0bcafab4 mov [rcx+0x8], rbp
0bcafab8 add ebx, +0x01
0bcafabb cmp ebx, eax
0bcafabd jle 0x0bcafab0 ->LOOP
and I was pleasantly surprised when perf stat
told me that my CPU executes all five of those instructions in one cycle:
5,021,702,476 instructions # 4.58 insns per cycle
1,097,518,679 cycles
That paints a very pretty picture. But it is easy to shake this up.
Just a little innocent change
Here is an innocent looking new version of the program that contains an if
to count 1 when the loop index is even and 10 otherwise:
local counter = require("core.counter")
local n = 1e9
local c = counter.open("test")
for i = 1,n do
if i % 2 == 0 then
counter.add(c, 1)
else
counter.add(c, 10)
end
end
How does this version run?
36,029,310,879 instructions # 2.38 insns per cycle
15,109,330,649 cycles
Oops! Now each iteration takes 15 cycles and executes 36 instructions. That is a 15x slow down.
What happened?
The high-level explanation is actually straightforward. The first version runs fast because the loop executes very few instructions: most of the work like table lookups has been "hoisted" to run before entering the loop. The second version runs slowly because it is frequently repeating this setup work.
To understand what happens we can read the JIT dumps for the first version and second version and draw pictures of the flow of control. In these diagrams each box is a trace i.e. a series of machine code instructions that will execute from top to bottom. Branches are drawn as arrows and there are two kinds: the loop back into an earlier part of the trace (at most one is allowed) or an exit to a different trace. The "hot" code that consumes the CPU is highlighted.
Here is a picture of the first version that runs fast:
and below is the full machine code. I don't really bother to read every instruction here. My observation is that the proportions match the diagram: quite a lot of instructions upon entry and then a small number of instructions in the loop.
---- TRACE 3 mcode 212
0bcaf9f0 mov dword [0x41693410], 0x3
0bcaf9fb movsd xmm7, [rdx+0x20]
0bcafa00 cvttsd2si eax, xmm7
0bcafa04 xorps xmm6, xmm6
0bcafa07 cvtsi2sd xmm6, eax
0bcafa0b ucomisd xmm7, xmm6
0bcafa0f jnz 0x0bca0010 ->0
0bcafa15 jpe 0x0bca0010 ->0
0bcafa1b cmp eax, 0x7ffffffe
0bcafa21 jg 0x0bca0010 ->0
0bcafa27 cvttsd2si ebx, [rdx+0x18]
0bcafa2c cmp dword [rdx+0x4], -0x0c
0bcafa30 jnz 0x0bca0010 ->0
0bcafa36 mov r8d, [rdx]
0bcafa39 cmp dword [r8+0x1c], +0x0f
0bcafa3e jnz 0x0bca0010 ->0
0bcafa44 mov esi, [r8+0x14]
0bcafa48 mov rdi, 0xfffffffb4169d9e8
0bcafa52 cmp rdi, [rsi+0x140]
0bcafa59 jnz 0x0bca0010 ->0
0bcafa5f cmp dword [rsi+0x13c], -0x09
0bcafa66 jnz 0x0bca0010 ->0
0bcafa6c cmp dword [rdx+0x14], -0x0b
0bcafa70 jnz 0x0bca0010 ->0
0bcafa76 mov ecx, [rdx+0x10]
0bcafa79 cmp dword [rsi+0x138], 0x4172d7b0
0bcafa83 jnz 0x0bca0010 ->0
0bcafa89 movzx edx, word [rcx+0x6]
0bcafa8d cmp edx, 0x4f9
0bcafa93 jnz 0x0bca0010 ->0
0bcafa99 mov rbp, [rcx+0x8]
0bcafa9d add rbp, +0x01
0bcafaa1 mov [rcx+0x8], rbp
0bcafaa5 add ebx, +0x01
0bcafaa8 cmp ebx, eax
0bcafaaa jg 0x0bca0014 ->1
->LOOP:
0bcafab0 add rbp, +0x01
0bcafab4 mov [rcx+0x8], rbp
0bcafab8 add ebx, +0x01
0bcafabb cmp ebx, eax
0bcafabd jle 0x0bcafab0 ->LOOP
0bcafabf jmp 0x0bca001c ->3
---- TRACE 3 stop -> loop
So what changes in the second version that causes the inner loop to expand from 5 instructions up to 36? Here is the picture:
Now we have two traces: the original root trace and a new side trace. This is necessary because there is a branch (if
) in our code and traces are not allowed to have internal branches. The root trace will internally handle the case when the loop index is even but it will exit to the side trace when the index is odd. The side trace then rejoins the root trace. The effect is that the loop alternates between saying inside the root trace and exiting to the side trace.
The picture also illustrates the two reasons why we execute so many instructions now. First, the side trace is bigger than the loop in the root trace (i.e. it contains more instructions). Second, when the side trace branches back to the root trace it re-enters at the top instead of taking a short-cut into the inner loop. This means that overall we execute more instructions.
Let us zoom in to a bit more detail: first to look at the inner loop in the root trace, then to look at the side trace, and finally to look at the complete root trace that is running every time the side trace branches back.
Here is the new loop in the root trace (with added comments):
->LOOP:
0bcafaa0 test ebx, 0x1 ; loop index is odd?
0bcafaa6 jnz 0x0bca0024 ->5 ; yes: exit this trace
0bcafaac add rbp, +0x01
0bcafab0 mov [rcx+0x8], rbp
0bcafab4 add ebx, +0x01
0bcafab7 cmp ebx, eax
0bcafab9 jle 0x0bcafaa0 ->LOOP
The difference from the original trace is the two new instructions at the start. These test a guard for the trace (that the loop index must be even) and branch to an exit if this condition does not hold. So when the loop index happens to be even the execution will be very similar to the original version, but when the loop index is odd we will exit to the side trace.
Here is the code for the side trace.
---- TRACE 4 mcode 171
0bcaf922 mov dword [0x416ac410], 0x4
0bcaf92d mov edx, [0x416ac4b4]
0bcaf934 cmp dword [rdx+0x4], -0x0c
0bcaf938 jnz 0x0bca0010 ->0
0bcaf93e mov ebp, [rdx]
0bcaf940 cmp dword [rbp+0x1c], +0x0f
0bcaf944 jnz 0x0bca0010 ->0
0bcaf94a mov ebp, [rbp+0x14]
0bcaf94d mov rdi, 0xfffffffb416b69e8
0bcaf957 cmp rdi, [rbp+0x140]
0bcaf95e jnz 0x0bca0010 ->0
0bcaf964 cmp dword [rbp+0x13c], -0x09
0bcaf96b jnz 0x0bca0010 ->0
0bcaf971 cmp dword [rdx+0x14], -0x0b
0bcaf975 jnz 0x0bca0010 ->0
0bcaf97b mov r15d, [rdx+0x10]
0bcaf97f cmp dword [rbp+0x138], 0x413fe7b0
0bcaf989 jnz 0x0bca0010 ->0
0bcaf98f movzx ebp, word [r15+0x6]
0bcaf994 cmp ebp, 0x4f9
0bcaf99a jnz 0x0bca0010 ->0
0bcaf9a0 mov rbp, [r15+0x8]
0bcaf9a4 add rbp, +0x0a
0bcaf9a8 mov [r15+0x8], rbp
0bcaf9ac add ebx, +0x01
0bcaf9af cmp ebx, eax
0bcaf9b1 jg 0x0bca0014 ->1
0bcaf9b7 xorps xmm7, xmm7
0bcaf9ba cvtsi2sd xmm7, ebx
0bcaf9be movsd [rdx+0x30], xmm7
0bcaf9c3 movsd [rdx+0x18], xmm7
0bcaf9c8 jmp 0x0bcaf9d4
---- TRACE 4 stop -> 3
I have not read this code in detail but here are a couple of observations:
- It contains 32 instructions. This side trace will run on every second iteration of the loop and so it seems to account for about half of the 36 instructions per iteration reported by
perf
. - The trace ends by branching to address
0bcaf9d4
which turns out to be the beginning of the root trace (not the inner loop).
Here finally is the entire root trace, this time including the initial code before the loop that executes when the side trace branches back:
---- TRACE 3 mcode 236
0bcaf9d4 mov dword [0x416ac410], 0x3
0bcaf9df movsd xmm7, [rdx+0x20]
0bcaf9e4 cvttsd2si eax, xmm7
0bcaf9e8 xorps xmm6, xmm6
0bcaf9eb cvtsi2sd xmm6, eax
0bcaf9ef ucomisd xmm7, xmm6
0bcaf9f3 jnz 0x0bca0010 ->0
0bcaf9f9 jpe 0x0bca0010 ->0
0bcaf9ff cmp eax, 0x7ffffffe
0bcafa05 jg 0x0bca0010 ->0
0bcafa0b cvttsd2si ebx, [rdx+0x18]
0bcafa10 test ebx, 0x1
0bcafa16 jnz 0x0bca0014 ->1
0bcafa1c cmp dword [rdx+0x4], -0x0c
0bcafa20 jnz 0x0bca0018 ->2
0bcafa26 mov r8d, [rdx]
0bcafa29 cmp dword [r8+0x1c], +0x0f
0bcafa2e jnz 0x0bca0018 ->2
0bcafa34 mov esi, [r8+0x14]
0bcafa38 mov rdi, 0xfffffffb416b69e8
0bcafa42 cmp rdi, [rsi+0x140]
0bcafa49 jnz 0x0bca0018 ->2
0bcafa4f cmp dword [rsi+0x13c], -0x09
0bcafa56 jnz 0x0bca0018 ->2
0bcafa5c cmp dword [rdx+0x14], -0x0b
0bcafa60 jnz 0x0bca0018 ->2
0bcafa66 mov ecx, [rdx+0x10]
0bcafa69 cmp dword [rsi+0x138], 0x413fe7b0
0bcafa73 jnz 0x0bca0018 ->2
0bcafa79 movzx edx, word [rcx+0x6]
0bcafa7d cmp edx, 0x4f9
0bcafa83 jnz 0x0bca0018 ->2
0bcafa89 mov rbp, [rcx+0x8]
0bcafa8d add rbp, +0x01
0bcafa91 mov [rcx+0x8], rbp
0bcafa95 add ebx, +0x01
0bcafa98 cmp ebx, eax
0bcafa9a jg 0x0bca001c ->3
->LOOP:
0bcafaa0 test ebx, 0x1
0bcafaa6 jnz 0x0bca0024 ->5
0bcafaac add rbp, +0x01
0bcafab0 mov [rcx+0x8], rbp
0bcafab4 add ebx, +0x01
0bcafab7 cmp ebx, eax
0bcafab9 jle 0x0bcafaa0 ->LOOP
0bcafabb jmp 0x0bca0028 ->6
The trace executes an additional 38 instructions before entering the loop. This path will be taken on every second loop iteration, when the exit to the side trace is taken and it branches back to the top. That would seem to account for the rest of the instructions reported by perf
.
So...?
If I were a compiler expert then this is where I would explain why the code compiles in this way and provide interesting links to all the relevant research. But I am not. So all I can really state are my own personal observations.
- The trace structure seems to be significant. The first version was awesomely fast because it kept the CPU on the same trace for an extended period of time.
- Microbenchmarks can be very hard to interpret. How do you decide whether a particular programming construct -- for example the
counter.add(...)
module lookupg - is cheap or expensive? The answer seems very context dependent. - Good profiling tools and a strong mental model are probably critical to achieving predictably high performance.
- Now I understand the LuaJIT numerical computing optimization tip "Reduce unbiased branches. Heavily biased branches (>95% in one direction) are fine." If we were only executing the side trace on 1/20th of the iterations then the cost would not be a problem.
I find this all very interesting. The issues are subtle and complex, but most languages and platforms are subtle and complex to optimize when you first start taking them seriously. I am happy to keep geeking out on these ideas.
Thanks
Thanks to Alex Gall and Andy Wingo for mentioning the issue of side-traces re-entering their parent traces at the root, so that I could recognise it when I saw it too.