I love working with OS signals in Go because goroutines and channels make it so easy. Plus it feels cool to break out of the ho-hum of everyday process execution and get a chance to really communicate with the OS. ;)

A project I work on recently experienced a process crash that involved the SIGBUS signal that taught me not to let the signal name get lost in the noise.

The Crash

A few weeks ago, the SRE team reported occasional pod restarts in a pre-prod environment. The stack trace typically looked like this, always in geoip2-golang.(*Reader).Enterprise:

unexpected fault address 0xffff31b5ee24
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0xffff31b5ee24 pc=0xb14c5c]

goroutine 851 [running]:
runtime.throw({0x17736e4?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1077 +0x40 fp=0x400059cb40 sp=0x400059cb10 pc=0x43be60
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:858 +0xec fp=0x400059cba0 sp=0x400059cb40 pc=0x453dac
github.com/oschwald/maxminddb-golang.nodeReader32.readRight(...)
	/app/vendor/github.com/oschwald/maxminddb-golang/node.go:57
github.com/oschwald/maxminddb-golang.(*nodeReader32).readRight(0x400059cc08?, 0x40ffb4?)
	<autogenerated>:1 +0x5c fp=0x400059cbd0 sp=0x400059cbb0 pc=0xb14c5c
github.com/oschwald/maxminddb-golang.(*Reader).traverseTree(0x4000114b40, {0x400059cdd4, 0x4, 0x1c0?}, 0x1b8?, 0x20)
	/app/vendor/github.com/oschwald/maxminddb-golang/reader.go:288 +0xd8 fp=0x400059cc00 sp=0x400059cbd0 pc=0xb12758
github.com/oschwald/maxminddb-golang.(*Reader).lookupPointer(0x4000114b40, {0x400059cdc8?, 0x400059cc01?, 0xaf2610?})
	/app/vendor/github.com/oschwald/maxminddb-golang/reader.go:264 +0x154 fp=0x400059cc80 sp=0x400059cc00 pc=0xb12524
github.com/oschwald/maxminddb-golang.(*Reader).Lookup(0x4000114b40, {0x400059cdc8?, 0x1629960?, 0x16d4d20?}, {0x14da840, 0x40002188c0})
	/app/vendor/github.com/oschwald/maxminddb-golang/reader.go:137 +0x3c fp=0x400059ccb0 sp=0x400059cc80 pc=0xb11b6c
github.com/oschwald/geoip2-golang.(*Reader).Enterprise(0x400035bb30, {0x400059cdc8, 0x10, 0x10})
	/app/vendor/github.com/oschwald/geoip2-golang/reader.go:324 +0xec fp=0x400059cd80 sp=0x400059ccb0 pc=0xb1549c
github.com/NBCUDTC/cadence-client-api-src/internal/lib/maxmind.(*Reader).LookupByIp(0x40000cec80, {0xfb5264?, 0x400059cea8?}, {0x4000571a00, 0xb})
	/app/internal/lib/maxmind/maxmind.go:140 +0x128 fp=0x400059ce60 sp=0x400059cd80 pc=0xb16df8
github.com/NBCUDTC/cadence-client-api-src/internal/http/controllers.(*radController).geoLookup(0x17496c0?, {0x19d5b48?, 0x400029a140?}, {0x4000571a00?, 0x400023eef8?})
...

This code is looking up the geographic location of an IP address from a memory-mapped MaxMind GeoIP database file.

Expectations

My quick conclusion was that the file is getting corrupted somehow. There’s an updater process that pulls the file from MaxMind on some interval and writes it into an EFS volume. Either the original file from MaxMind was corrupt or somehow the updater is corrupting it. Turns out this conclusion is correct, but not in the sense that I initially expected.

With the file being memory-mapped we expect that the contents will never change. When the updater process puts the new file into place it needs to do so atomically so there’s no way that any reader can see a partial file. If we want to keep using the same filename for this file after it’s updated, the updater needs to write all the bytes to a different file then move it to the expected filename. At the shell mv would accomplish this; code could use or be like this func. After this happens the old bytes are only accessible to processes that already have it open, and the new bytes are available for reading the next time the file is opened.

But the updater isn’t doing that. It is using cp to put the updated file in place. The cp command copies one file to another by reading the original file and writing them to the destination file. If the destination file exists it truncates the file to zero length before writing the contents. Other processes can observe this as it happens, which is typically a complication that is worth avoiding.

That the process is getting a SIGBUS is a huge clue pointing to this bug. If the updater were atomically updating the file with corrupt contents then we’d see an error when trying to open the updated file, but SIGBUS on read points to the contents changing within the currently open file.

SIGBUS

SIGBUS is sent to the process if it tries to access memory that the CPU can’t physically access. Like the perhaps more familiar Segmentation Fault (SIGSEGV) this is a serious error for a program because there’s no way to continue with the normal execution flow. The code has tried to access some memory that can’t be accessed. The only thing to do is to interrupt the execution flow, which is what OS signals are for. Check out the GeeksForGeeks comparison of SIGBUS and SIGSEGV for more detail.

When the mmap’d file is truncated on disk, the process doesn’t automatically find this out. Through the mmap abstraction the geoip2.Reader is still happily seeking around this file looking for the data. As soon as it tries to read beyond the new end of the file the SIBGUS is generated.

The Fix Within Our Code

When I first saw this crash I didn’t know about SIGBUS. I saw a crash reading the file and assumed “corrupted file”, with a few suspects outside my control. What could I do? A quick search on the library GitHub turned up a couple of issues like this one where others ran into this which pointed to the likely cause of the SIGBUS.

The suggested solution is to read the file into a buffer then pass it into geoip2.FromBytes(). The process won’t get a SIGBUS if it’s reading from a buffer. Instead, if we happen to read a truncated file, the library will return an error because its metatdata section is at the end. There are a few sanity checks for this section, the simplest being the search for the bytes []byte("\xAB\xCD\xEFMaxMind.com") indicating its start. That makes reading to a buffer an adequate, but not ideal, solution to this problem. Clearly the better solution is to make the updater update the file atomically.

Panic Instead of Crash

While we wait for another team to get around to fixing their code, we can learn more about signals and handling them. Normally some signals like SIGBUS and SIGSEGV will cause the process to exit. That’s what’s happening here, except the Go signal handler is catching the signal and converting it to a run-time panic. Go also provides the ability to convert this into a panic via runtime/debug.SetPanicOnFault() (godoc), which presents yet another potential solution to this problem.

This code demonstrates it by creating a SIGBUS by mmap-ing an empty file using a non-zero length. After SetPanicOnFault(true) is called, the signal is converted to a panic which is then caught. If we were to use this in our code we could catch the panic and forbid future reads until the file is reopened successfully.

package main

import (
	"fmt"
	"os"
	"syscall"
)

func main() {
	doRecover()
}

func doRecover() {
	defer func() {
		if r := recover(); r != nil {
			fmt.Println("caught panic")
		}
	}()

	doFault()
}

func doFault() {
	// this sets PanicOnFault=true for the duration of this function
	// which will handle the SIGBUS as a normal panic instead of a crash
	defer debug.SetPanicOnFault(debug.SetPanicOnFault(true))

	map_file, _ := os.Create("test.txt")

	data, err := syscall.Mmap(
		int(map_file.Fd()),
		0,
		10,
		syscall.PROT_WRITE,
		syscall.MAP_PRIVATE,
	)
	_, _ = data, err
	data[0] = byte(0)
}

Try it out in the Go playground.