Crash Course in Crash Grouping

When an application gets big enough, you stop fixing crashes one at a time. A complex codebase with a lot of users throws off an enormous volume of crash reports, and staring at them individually stops being possible somewhere around the first thousand. The only way to make sense of it is to group related crashes together, so you can see how many distinct problems you actually have and which ones are worth your time.

The catch is that grouping done naively gets it wrong in both directions. It lumps unrelated crashes together, and it splits related ones apart. Done right, grouping feels a little like magic: ten thousand scattered reports resolve into the handful of real problems hiding underneath, and you suddenly know what to fix. Done wrong, it's a magic trick, the kind where the answer was up the magician's sleeve the whole time and you walk away no wiser. Understanding the difference is what separates a crash list that lies to you from one you can actually triage.

The default: group by the top of the stack

The obvious way to group crashes is by whatever function was at the top of the call stack when the program died. Two crashes that died in the same function are probably the same bug, so you group them. This is BugSplat's default, and most of the time it's exactly right: the crashing function is the bug, identical crashing functions are the same defect, done.

The problem is the cases where the top of the stack isn't the bug.

Why the top of the stack lies to you

Two failure modes break naive top-of-stack grouping, and they pull in opposite directions.

Unrelated crashes that look identical. Plenty of crashes don't die in your code, they die in shared machinery: a system function like KERNELBASE!RaiseException, a third-party library, a runtime's exception machinery. When that happens, the top of the stack is the same generic frame for a whole pile of completely unrelated bugs. Group by it and you get one giant bucket labeled "RaiseException" that's really a dozen different defects wearing the same hat. The grouping says "you have one problem." You have twelve.

Related crashes that look different. The mirror image. Imagine three classes, call them Widget, Gizmo, and Doohickey, that each do their own thing but all eventually call into one shared Save() function that's where the actual fault lives. Three different code paths, three different-looking call stacks, one real bug. Group by the top of the stack and they scatter into three separate groups, so the single defect behind all of them looks like three unrelated problems. You fix it in one place and can't understand why the other two "different" crashes vanished too.

In both cases the fix is the same idea: the right grouping frame often isn't the top of the stack. It's a few frames down, where your code actually was.

The mental model: group where your code is

Here's the way to think about it. A call stack is a story that runs from "what your program was trying to do" at the bottom up to "the exact instruction that failed" at the top. The top frame tells you where it died. A frame a little lower often tells you what was actually going on when it died, and that's frequently the more useful place to group.

When a crash dies in a system call or a third-party library, you want to walk down the stack, past the generic machinery, until you hit the first frame that's your code. That frame is the real identity of the crash. Grouping there pulls the dozen unrelated "RaiseException" crashes apart into the dozen real defects they actually are, and pulls the three scattered Widget/Gizmo/Doohickey crashes together into the one shared Save() bug they actually share.

That's the whole concept. Default to the top of the stack, but when the top is generic, group at the first frame that belongs to you.

How BugSplat does it: auto-group rules

The old way to do this was by hand, one group at a time. BugSplat now handles it with auto-group rules: pattern-based rules that look at each incoming crash and automatically pick a sensible frame to group on.

There are three kinds. Ignore rules skip frames you don't care about (BugSplat ships with defaults that skip common noise like KERNELBASE!RaiseException, so out of the box those crashes group by the next real frame instead of all collapsing together). Ignore frames up to and including rules skip everything through a matching pattern, which is handy for runtime exception machinery that adds several junk frames. And group by rules let you positively match frames that belong to your application. The rules are pattern-based, defined per platform, and the defaults are designed to give you reasonable groups immediately and be tuned from there.

The payoff is that grouping happens automatically and consistently as crashes arrive, instead of you regrouping things by hand after the fact. Set the rules once, and a new crash that dies in a third-party library gets grouped at your code the moment it lands.

Where this really bites: game engines and platform noise

The "crashes die in shared machinery" problem is at its worst on big engines and managed platforms, which is exactly where a lot of crash reporting lives, so it's worth calling out specifically.

If you ship on Unreal, you've seen this. A huge share of crashes surface under generic engine and OS frames: KERNELBASE!RaiseException at the very top, Unreal's own FDebug::EnsureFailed and assertion machinery just below it, plus whatever stack the crash reporter adds on the way out. Group by the top of the stack and a wild assortment of completely unrelated game bugs all pile into one or two enormous buckets named after engine internals. The dashboard tells you that you have two problems. You have ninety, and every one of them is hiding behind the same Unreal plumbing. The same pattern shows up with Unity's native layer, with .NET's exception handling, with any platform that wraps your code in a thick stack of its own before a crash reaches the surface.

This is the scenario auto-group rules were built for. BugSplat's defaults already skip the usual engine and OS noise, and because rules are defined per platform, you can tune a set specifically for Unreal that ignores the engine frames you never want to group on and groups instead at the first frame of your actual game code. Once that's in place, the ninety bugs that were hiding behind RaiseException and EnsureFailed separate out into ninety real, distinct groups, each one landing on the line of your code that's actually responsible. That's the difference between a triage list that's all engine noise and one that's all your game.

Where this pays off: finding the real root cause

Good grouping isn't just tidier. It's how you find causes you'd otherwise miss.

Take that shared Save() bug. Once the three scattered crashes are grouped together at the common frame, you can look at the group as one thing, and if you're on a native Windows integration, you can read the local variables and function arguments at each frame right in the report. Expand the relevant frame and you might find the argument that explains everything, say, a file path coming in as /does/not/exist. Now you know the crash wasn't really about the Save() function at all; it was about the caller handing it a bad path. That's a root cause you'd never spot while the crashes were scattered across three groups that looked unrelated. Three unrelated-looking crashes resolving into one defect with a cause attached is about the closest thing a crash dashboard has to pulling a rabbit out of a hat.

That's the real value of grouping: it turns a pile of individual failures into a short list of distinct defects, each one pointing at a cause you can actually go fix. And once you've found one, you can push it to your issue tracker as a single defect tied to the whole group, rather than filing the same bug three times.

Try it on your own crashes

The concept is simple, the implementation is where the detail lives, and that's exactly the kind of thing that's better followed click-by-click against the current UI than read about. The full walkthrough, setting up auto-group rules, ignoring frames, grouping deeper in the stack, and pushing a grouped defect to your tracker, lives in the docs: Grouping Crashes in BugSplat

And if you want to see grouping work on real data before reading another word, you can start free, submit a few test crashes, and watch a wall of individual reports collapse into the handful of actual defects behind them.

Happy bug hunting!

Features

Demo

How it works

Platforms

Crash reporting

Error reporting

Track stability

Customer stories

The blog