Oops! Something went wrong while submitting the form.
Last week, I found and fixed two crashing bugs in etcd, the distributed key-value store used (among other things) to manage the state of kubernetes clusters. I’m excited to have been able to contribute a bit to such an excellent project!
I didn't specifically set out to work on etcd. I work on Mayhem for API, and my goal was just to integrate the etcd server into our internal fuzz testing harness. We use this harness to validate changing versions of Mayhem against unchanging versions of third party services. While we report any serious findings responsibly, this part of our system is mostly interested in Mayhem's behavior, not that of the service we're fuzzing.
Here's the thing: I know almost nothing about etcd. But it's an especially easy project to start fuzzing for two reasons: it already has a thorough openapi spec, and it defaults to a completely unauthenticated mode, meaning we can fuzz the entire API without setting up credentials.
I built and started etcd like this (and I wish all software were this easy to start!):
And then I started a 5 minute Mayhem run, like this:
$ mapi run etcd 300 rpc.swagger.json --url http://localhost:2379
What's great is that, even knowing almost nothing about etcd, once I started fuzzing, Mayhem uncovered crashing bugs in less than a minute.
Mayhem for API is still a young project. Crashing bugs—where the service we’re fuzzing falls over completely—are something we haven't faced often enough to handle very cleanly until now. In other words, before I could fix etcd crashing, I needed to make Mayhem usefully report the crashes, the way it already usefully reports lots of other kinds of issue.
When I first started fuzzing etcd, it crashed within seconds. However, Mayhem treated the observations it made when etcd died (closed connections and refused connections, mostly) as transient, and eventually gave up... without reporting any issues.
Reporting "failure to respond" style issues—which is what a crash tends to look like to a network client like Mayhem—isn't terribly hard. Reporting only the ones that actually matter is harder, since (from the Mayhem perspective) the behavior of the request that causes the crash is nearly identical to many of the subsequent failures. And we know: false positives from any automated testing tool are the worst! But I'm happy to say that, with some work, we were able to get Mayhem to report just the crash-causing requests without any false positives.
Build Reliable APIs.
Find out how ForAllSecure can ensure the quality of your APIs with autonomous fuzz testing.
The only other tricky bit was a surprising interaction with our "minimization" feature: trying to determine the smallest-possible crash-reproducing payload against a server that has already crashed tended to decide that the entire payload was superfluous. Oops.
The first etcd crash Mayhem found was cleanly deterministic. Starting from a completely empty etcd, Mayhem quickly reported:
with the payload:
What does this mean? It’s basically an input validation oversight: we’re telling etcd to ‘ACTIVATE’ an alarm object that’s missing a valid ‘alarm’ field. But the effects of this oversight are worse than most: the single “poisoned” http payload completely shut down the etcd server, and also corrupted its persistent state on disk, preventing etcd from starting up again without cleaning it up. This is unlikely to be a security problem (etcd shouldn’t normally be surfaced on an insecure network, and even internally, should always be authenticated!) but nobody likes their software to crash on malformed inputs, so I put together a fix.
This was a great first issue to learn my way around the etcd repo a little bit, since it was cleanly reproducible and dramatically observable. And I'm happy to say that my fix PR was merged, thanks to quick turnaround from the etcd team, and this crash no longer occurs in etcd 3.5 and up!
The second etcd crash Mayhem found was a bit tricker at first, although it ended up being quite similar to the first. Within a few minutes of starting against an empty etcd, Mayhem reported:
with the reported payload:
Unfortunately, this issue wasn’t reproducible without setup. With a little bit of digging, though, I was able to figure out that the reported payload was harmless unless the role (‘str’ in the above example… but the name isn’t important!) had previously been created, in which case it caused the crash. It took some time—on the order of a few minutes—for Mayhem fuzzing to successfully create a role and attempt to grant nonexistent permissions to the same role. (This, by the way, is why we strongly suggest running Mayhem against non-empty datasets.)
So what’s the bug, this time? Once again, it’s basically an input validation oversight: we’re telling etcd to grant a permission to a role named ‘str’, but we’re not specifying which permission to grant. Internally, this is causing an unhandled panic. As with the previous bug, the bad request completely shut down the etcd server, and corrupted its state on disk.
Armed with a good repro case, I was again able to get a fix PR merged. This crash, too, no longer occurs in etcd 3.5 and up!
Contributing to etcd was a pleasure. My thanks to the maintainers for working with me to get those PRs shipped, and for making awesome, useful software. Keep it up.
Now that etcd is playing nicely with our long-term internal fuzzing harness, I'm moving on to other projects. If Mayhem finds any more serious issues, we will of course report and/or fix them upstream. That said, our goal isn't to fuzz etcd for issues: it's to make Mayhem for API the best tool it can be for API owners to fuzz their own services!
An etcd expert could learn a lot more than I can by fuzzing with Mayhem, in particular by setting up a cluster with more realistic configuration options, and populating it with some real data. (Please get in touch if you want to do that: we'd love to help!)
And of course... if you have an API of your own, download Mayhem for API, give it a try, and let us know what you think :D