Faster software through register based calling
Posted on
The release notes for Go 1.17 mention
an interesting change in the Go compiler:
function arguments and return values will now be passed using registers
instead of the stack
. The
proposal document
for the feature mentions an expected 5-10% throughput improvement across a
range of applications
which is significant, especially for no effort on the
developers part aside from recompiling with a newer version of the Go compiler.
I was curious to see what this actually looks like and decided to take a deeper
look. This will get more than a little nerdy so buckle up!
Note that although the catalyst for this blog article was a change in Go, much of this article should be of interest generally even if you don't use Go.
Refresher: Registers
At this point it's helpful to remind ourselves of what CPU registers are. In a nutshell, they are small amounts of high speed temporary memory built into the processor. Each register has a name and stores one word of data each - this is 64 bits on almost all modern computers.
Some registers are general purpose while others have specific functions. In this article you'll come across the AX, BX and CX general purpose registers, as well as the SP (stack pointer) register which is special purpose.
Refresher: the Stack
It's also useful to remind ourselves what the stack is in a computer program. It's a chunk of memory that's placed at the top of a program's memory. The stack is typically used store local variables, function arguments, function return values and function return addresses. The stack grows downwards as items are added to it.
Eli Bendersky has an excellent article about how stacks work which contains this helpful diagram:
When items are added to the stack we say that they are pushed onto the stack. When items are removed from the stack we say that they are popped off the stack. There are x86 CPU instructions for pushing and popping data onto and off the stack.
The SP register mentioned earlier points to the item currently at the top of the stack.
Note that I'm taking some liberties here. The stack works like this on x86 computers (and many other CPU architectures) but not on all of them.
Calling Conventions
In compiled software, when some code wants to call a function, the arguments for that function need to somehow be passed to the function (and the return values need to be passed back somehow when the function completes). There are different agreed-upon ways to do this and each style of passing arguments and return values around is a "calling convention".
The part of the Go 1.17 release notes quoted above is really about a change in Go's calling conventions.
This is all hidden from you unless you're programming in assembler or are trying to make bits of code written in different programming languages work together. Even so, it's still interesting to see how the machinery works under the hood.
A Small Program
In order to compare the code the Go compiler generates in 1.16 vs 1.17 we need a simple test program. It doesn't have to do much, just call a function that takes a couple of arguments which then returns a value. Here's the trivial program I came up with:
package main
import "fmt"
func add(i, j int) int {
return i + j
}
func main() {
z := add(22, 33)
fmt.Println(z)
}
Disassembling
In order to see the CPU instructions being generated by the Go compiler we need
a disassembler. One tool that can do this is venerable
objdump which comes with the GNU
binutils suite and may already be installed if you're running
Linux. I'll be using objdump
in this article.
go tool objdump
command.
It's tempting to use this output for our exploration here but this intermediate assembly language isn't necessarily a direct representation of the machine code that will be generated for a given platform. For this reason I've chosen to stick with objdump.
Go 1.16's Output
Let's take a look at the output from Go 1.16 which we expect to be using stack based calling. First let's build the binary using Go 1.16 and make sure it works:
$ go1.16.10 build -o prog-116 ./main.go
$ ./prog-116
55
Great! Now lets disassemble it to see the generated instructions:
$ objdump -d prog-116 > prog-116.asm
The first thing I noticed is that there's quite of lot of code:
$ wc -l prog-116.asm
164670 prog-116.asm
That's a lot of instructions for such a small program but this is because every Go program includes the Go runtime which is a non-trivial amount of software for scheduling goroutines and providing all the conveniences we expect as Go developers. Fortunately for us, the instructions directly relating to the code in our test program are right at the bottom:
(I'm omitting the offsets and raw bytes that objdump normally provides for clarity; also some of Go's setup code)
0000000000497640 <main.main>:
...
movq $0x37,(%rsp)
call 40a3e0 <runtime.convT64>
mov 0x8(%rsp),%rax
xorps %xmm0,%xmm0
movups %xmm0,0x40(%rsp)
lea 0xaa7e(%rip),%rcx # 4a2100 <type.*+0xa100>
mov %rcx,0x40(%rsp)
mov %rax,0x48(%rsp)
mov 0xb345d(%rip),%rax # 54aaf0 <os.Stdout>
lea 0x4290e(%rip),%rcx # 4d9fa8 <go.itab.*os.File,io.Writer>
mov %rcx,(%rsp)
mov %rax,0x8(%rsp)
lea 0x40(%rsp),%rax
mov %rax,0x10(%rsp)
movq $0x1,0x18(%rsp)
movq $0x1,0x20(%rsp)
nop
call 491140 <fmt.Fprintln>
...
Weird! This doesn't look like our code at all. Where's the call to our add
function? In fact, movq $0x37,(%rsp)
(move the value 0x37 to memory
location pointed to by the stack pointer register) looks super suspicious. 22 +
33 = 55 which is 0x37 in hex. It looks like the Go compiler has optimised the
code, working out the addition at compile time, eliminating most of our code in
the process!
In order to study this further we need to tell the Go compiler to not
inline the add function which can be
done using a special comment to annotate the add function. add
now looks like this:
//go:noinline
func add(i, j int) int {
return i + j
}
Compiling the code and running objdump
again, the disassembly looks more as
we might expect. Let's start with main()
- I've broken up the disassembly
into pieces and added commentary.
The main func begins with base pointer and stack pointer initialisation:
0000000000497660 <main.main>:
mov %fs:0xfffffffffffffff8,%rcx
cmp 0x10(%rcx),%rsp
jbe 497705 <main.main+0xa5>
sub $0x58,%rsp
mov %rbp,0x50(%rsp)
lea 0x50(%rsp),%rbp
followed by,
movq $0x16,(%rsp)
movq $0x21,0x8(%rsp)
Here we see the arguments to add
being pushed onto the stack in preparation
for the function call. 0x16 (22) is moved to where the stack pointer is
pointing. 0x21 (33) is copied to 8 bytes after where the stack pointer is
pointing (so earlier in the stack). The offset of 8 is important because we're
dealing with 64-bit (8 byte) integers. An 8 byte offset means the 33 is placed
on the stack directly after the 22.
call 4976e0 <main.add>
mov 0x10(%rsp),%rax
mov %rax,0x30(%rsp)
Here's where the add function actually gets called. When the call
instruction
is executed by the CPU, the current value of the instruction pointer is pushed
to the stack and execution jumps to the add
function. Once add
returns,
execution continues here where z
(stack pointer + 0x30 as it turns out) is
assigned to the returned value (stack pointer + 0x10). The AX register is used
as temporary storage when moving the return value.
There's more code that follows in main to handle the call the fmt.Println but that's outside the scope of this article.
One thing I found interesting looking at this code is that the classic push
instructions aren't being used to add values onto the stack. Values are placed
onto the stack using mov
. It turns out that this is for performance reasons.
A mov
generally requires fewer CPU cycles than a push
.
We should also have a look at add
:
0000000000497640 <main.add>:
mov 0x10(%rsp),%rax
mov 0x8(%rsp),%rcx
add %rcx,%rax
mov %rax,0x18(%rsp)
ret
The second argument (at SP + 0x10) is copied to the AX register, and the first
argument (at SP + 0x08) is copied to the CX register. But hang on, weren't the
arguments at SP and SP+0x10? They were but when the call
instruction was
executed, the instruction pointer was pushed to the stack which means the stack
pointer had to be decremented to make room for it - this means the offsets to
the arguments have to be adjusted to account for this.
The add
instruction is easy enough to understand. Here CX and AX are added
(with the result left in AX). The result is then pushed to the return location
(SP + 0x18).
The ret
(return) instruction grabs the return address off the stack and
starts execution just after the call
in main
.
Phew! That's a lot of code for a simple program. Although it's useful to understand what's going on, be thankful that we don't have to write assembly language very often these days!
Examining Go 1.17's Output
Now let's take a look at the same program compiled with Go 1.17. The compilation and disassembly steps are similar to Go 1.16:
$ go build -o prog-117 ./main.go
$ objdump -d prog-117 > prog-117.asm
The main disassembly starts the same as under Go 1.16 but - as expected -
differs in the call to add
:
mov $0x16,%eax
mov $0x21,%ebx
xchg %ax,%ax
call 47e260 <main.add>
Instead of copying the function arguments onto the stack, they're copied into the AX and BX registers. This is the essence of register based calling.
The xchg %ax,%ax
instruction is a bit more mysterious and I only have
theories regarding what it's for. Email me if you know and I'll add the detail
here.
Update: The xchg %ax,%ax
instruction is almost certainly there to work
around a bug
in a number of Intel processors. The instruction ("exchange AX
with AX") does nothing but introduces two bytes of padding before the call
instruction that follows - this serves to work around the processor bug. There's a
Go GitHub issue which has much more detail.
Thank you to the many people who wrote about this.
As we've already seen earlier, the call
instruction moves execution to the
add
function.
Now let's take a look at add
:
000000000047e260 <main.add>:
add %rbx,%rax
ret
Well that's simple! Unlike the Go 1.16 version, there's no need to move
arguments from the stack into registers in order to add them, and there's no
need to move the result back to the stack. The function arguments are expected to
be in the AX and BX registers, and the return value is expected to come back
via AX. The ret
instruction moves execution back to where call
was executed,
using the return address that call
left on the stack.
With so much less work being done when handling function arguments and return values, it's starting to become clearer why register based calling might be faster.
Performance Comparison
So how much faster is register based calling? I created a simple Go benchmark program to check:
package main
import "testing"
//go:noinline
func add(i, j int) int {
return i + j
}
var result int
func BenchmarkIt(b *testing.B) {
x := 22
y := 33
var z int
for i := 0; i < b.N; i++ {
z = add(x, y)
}
result = z
}
Note the use of a variable outside of the benchmark function to ensure that the
compiler won't optimise
the assignment to z
away.
The benchmark can be run like this:
go test bench_test.go -bench=.
On my somewhat long in the tooth laptop, the best result I could get under Go 1.16 was:
BenchmarkIt-4 512385438 2.292 ns/op
With Go 1.17:
BenchmarkIt-4 613585278 1.915 ns/op
That's a noticeable improvement - a 16% reduction in execution time for our example. Not bad, especially as the improvement comes for free for all Go programs just by using a newer version of the compiler.
Conclusion
I hope you found it interesting to explore some lower level details of software that we don't think about much these days and that you learned something along the way.
Many thanks to Ben Hoyt and Brian Thorne for their detailed reviews and input into this article.
Update: This article ended up generating quite a bit of discussion elsewhere, in particular: