crashpad/doc/overview_design.md
Mark Mentovai 6278690abe Update copyright boilerplate, 2022 edition (Crashpad)
sed -i '' -E -e 's/Copyright (.+) The Crashpad Authors\. All rights reserved\.$/Copyright \1 The Crashpad Authors/' $(git grep -El 'Copyright (.+) The Crashpad Authors\. All rights reserved\.$')

Bug: chromium:1098010
Change-Id: I8d6138469ddbe3d281a5d83f64cf918ec2491611
Reviewed-on: https://chromium-review.googlesource.com/c/crashpad/crashpad/+/3878262
Reviewed-by: Joshua Peraza <jperaza@chromium.org>
Commit-Queue: Mark Mentovai <mark@chromium.org>
2022-09-06 23:54:07 +00:00

26 KiB
Raw Permalink Blame History

Crashpad Overview Design

[TOC]

Objective

Crashpad is a library for capturing, storing and transmitting postmortem crash reports from a client to an upstream collection server. Crashpad aims to make it possible for clients to capture process state at the time of crash with the best possible fidelity and coverage, with the minimum of fuss.

Crashpad also provides a facility for clients to capture dumps of process state on-demand for diagnostic purposes.

Crashpad additionally provides minimal facilities for clients to adorn their crashes with application-specific metadata in the form of per-process key/value pairs. More sophisticated clients are able to adorn crash reports further through extensibility points that allow the embedder to augment the crash report with application-specific metadata.

Background

Its an unfortunate truth that any large piece of software will contain bugs that will cause it to occasionally crash. Even in the absence of bugs, software incompatibilities can cause program instability.

Fixing bugs and incompatibilities in client software that ships to millions of users around the world is a daunting task. User reports and manual reproduction of crashes can work, but even given a user report, often times the problem is not readily reproducible. This is for various reasons, such as e.g. system version or third-party software incompatibility, or the problem can happen due to a race of some sort. Users are also unlikely to report problems they encounter, and user reports are often of poor quality, as unfortunately most users dont have experience with making good bug reports.

Automatic crash telemetry has been the best solution to the problem so far, as this relieves the burden of manual reporting from users, while capturing the hardware and software state at the time of crash.

TODO(siggi): examples of this?

Crash telemetry involves capturing postmortem crash dumps and transmitting them to a backend collection server. On the server they can be stackwalked and symbolized, and evaluated and aggregated in various ways. Stackwalking and symbolizing the reports on an upstream server has several benefits over performing these tasks on the client. High-fidelity stackwalking requires access to bulky unwind data, and it may be desirable to not ship this to end users out of concern for the application size. The process of symbolization requires access to debugging symbols, which can be quite large, and the symbolization process can consume considerable other resources. Transmitting un-stackwalked and un-symbolized postmortem dumps to the collection server also allows deep analysis of individual dumps, which is often necessary to resolve the bug causing the crash.

Transmitting reports to the collection server allows aggregating crashes by cause, which in turn allows assessing the importance of different crashes in terms of the occurrence rate and e.g. the potential security impact.

A postmortem crash dump must contain the program state at the time of crash with sufficient fidelity to allow diagnosing and fixing the problem. As the full program state is usually too large to transmit to an upstream server, the postmortem dump captures a heuristic subset of the full state.

The crashed program is in an indeterminate state and, in fact, has often crashed because of corrupt global state - such as heap. Its therefore important to generate crash reports with as little execution in the crashed process as possible. Different operating systems vary in the facilities they provide for this.

Overview

Crashpad is a client-side library that focuses on capturing machine and program state in a postmortem crash report, and transmitting this report to a backend server - a “collection server”. The Crashpad library is embedded by the client application. Conceptually, Crashpad breaks down into the handler and the client. The handler runs in a separate process from the client or clients. It is responsible for snapshotting the crashing client process state on a crash, saving it to a crash dump, and transmitting the crash dump to an upstream server. Clients register with the handler to allow it to capture and upload their crashes. On iOS, there is no separate process for the handler. This is a limitation of iOS.

The Crashpad handler

The Crashpad handler is instantiated in a process supplied by the embedding application. It provides means for clients to register themselves by some means of IPC, or where operating system support is available, by taking advantage of such support to cause crash notifications to be delivered to the handler. On crash, the handler snapshots the crashed client process state, writes it to a postmortem dump in a database, and may also transmit the dump to an upstream server if so configured.

The Crashpad handler is able to handle cross-bitted requests and generate crash dumps across bitness, where e.g. the handler is a 64-bit process while the client is a 32-bit process or vice versa. In the case of Windows, this is limited by the OS such that a 32-bit handler can only generate crash dumps for 32-bit clients, but a 64-bit handler can acquire nearly all of the detail for a 32-bit process.

The Crashpad client

The Crashpad client provides two main facilities.

  1. Registration with the Crashpad handler.
  2. Metadata communication to the Crashpad handler on crash.

A Crashpad embedder links the Crashpad client library into one or more executables, whether a loadable library or a program file. The client process then registers with the Crashpad handler through some mode of IPC or other operating system-specific support.

On crash, metadata is communicated to the Crashpad handler via the CrashpadInfo structure. Each client executable module linking the Crashpad client library embeds a CrashpadInfo structure, which can be updated by the client with whatever state the client wishes to record with a crash.

Overview image

Here is an overview picture of the conceptual relationships between embedder (in light blue), client modules (darker blue), and Crashpad (in green). Note that multiple client modules can contain a CrashpadInfo structure, but only one registration is necessary.

Detailed Design

Requirements

The purpose of Crashpad is to capture machine, OS and application state in sufficient detail and fidelity to allow developers to diagnose and, where possible, fix the issue causing the crash.

Each distinct crash report is assigned a globally unique ID, in order to allow users to associate them with a user report, report in bug reports and so on.

Its critical to safeguard the users privacy by ensuring that no crash report is ever uploaded without user consent. Likewise its important to ensure that Crashpad never captures or uploads reports from non-client processes.

Concepts

  • Client ID. A UUID tied to a single instance of a Crashpad database. When creating a crash report, the Crashpad handler includes the client ID stored in the database. This provides a means to determine how many individual end users are affected by a specific crash signature.

  • Crash ID. A UUID representing a single crash report. Uploaded crash reports also receive a “server ID.” The Crashpad database indexes both the locally-generated and server-generated IDs.

  • Collection Server. See crash server documentation.

  • Client Process. Any process that has registered with a Crashpad handler.

  • Handler process. A process hosting the Crashpad handler library. This may be a dedicated executable, or it may be hosted within a client executable with control passed to it based on special signaling under the clients control, such as a command-line parameter.

  • CrashpadInfo. A structure used by client modules to provide information to the handler.

  • Annotations. Each CrashpadInfo structure points to a dictionary of {string, string} annotations that the client can use to communicate application state in the case of crash.

  • Database. The Crashpad database contains persistent client settings as well as crash dumps pending upload.

TODO(siggi): moar concepts?

Overview Picture

Here is a rough overview picture of the various Crashpad constructs, their layering and intended use by clients.

Layering image

Dark blue boxes are interfaces, light blue boxes are implementation. Gray is the embedding client application. Note that wherever possible, implementation that necessarily has to be OS-specific, exposes OS-agnostic interfaces to the rest of Crashpad and the client.

Registration

The particulars of how a client registers with the handler varies across operating systems.

macOS

At registration time, the client designates a Mach port monitored by the Crashpad handler as the EXC_CRASH exception port for the client. The port may be acquired by launching a new handler process or by retrieving service already registered with the system. The registration is maintained by the kernel and is inherited by subprocesses at creation time by default, so only the topmost process of a process tree need register.

Crashpad provides a facility for a process to disassociate (unregister) with an existing crash handler, which can be necessary when an older client spawns an updated version.

iOS

iOS registers both a signal handler for SIGABRT and a Mach exception handler with a subset of available exceptions. This is a limitation of iOS.

Windows

There are two modes of registration on Windows. In both cases the handler is advised of the address of a set of structures in the client process address space. These structures include a pair of ExceptionInformation structs, one for generating a postmortem dump for a crashing process, and another one for generating a dump for a non- crashing process.

Normal registration

In the normal registration mode, the client connects to a named pipe by a pre-arranged name. A registration request is written to the pipe. During registration, the handler creates a set of events, duplicates them to the registering client, then returns the handle values in the registration response. This is a blocking process.

Initial Handler Creation

In order to avoid blocking client startup for the creation and initialization of the handler, a different mode of registration can be used for the handler creation. In this mode, the client creates a set of event handles and inherits them into the newly created handler process. The handler process is advised of the handle values and the location of the ExceptionInformation structures by way of command line arguments in this mode.

Linux/Android

On Linux, a registration is a connected socket pair between a client process and the Crashpad handler. This socket pair may be private or shared among many client processes.

Private Connections

Private connections are the default registration mode when starting the handler process in response to a crash or on behalf of another client. This mode is required to use a ptrace broker, which is in turn required to trace Android isolated processes.

Shared Connections

Shared connections are the default mode when using a long-lived handler. The same connected socket pair may be shared among any number of clients. The socket pair is created by the first process to start the handler at which point the client socket end may be shared with other clients by any convenient means (e.g. inheritance).

Capturing Exceptions

The details of how Crashpad captures the exceptions leading to crashes varies between operating systems.

macOS

On macOS, the operating system will notify the handler of client crashes via the Mach port set as the client process exception port. As exceptions are dispatched to the Mach port by the kernel, on macOS, exceptions can be handled entirely from the Crashpad handler without the need to run any code in the crash process at the time of the exception.

iOS

On iOS, the operating system will notify the handler of crashes via the Mach exception port or the signal handler. As exceptions are handled in-process, an intermediate dump file is generated rather than a minidump. See more information about the iOS in-process handler.

Windows

On Windows, the OS dispatches exceptions in the context of the crashing thread. To notify the handler of exceptions, the Crashpad client registers an UnhandledExceptionFilter (UEF) in the client process. When an exception trickles up to the UEF, it stores the exception information and the crashing threads ID in the ExceptionInformation structure registered with the handler. It then sets an event handle to signal the handler to go ahead and process the exception.

Caveats
  • If the crashing threads stack is smashed when an exception occurs, the exception cannot be dispatched. In this case the OS will summarily terminate the process, without the handler having an opportunity to generate a crash report.
  • If an exception is handled in the crashing thread, it will never propagate to the UEF, and thus a crash report wont be generated. This happens a fair bit in Windows as system libraries will often dispatch callbacks under a structured exception handler. This occurs during Window message dispatching on some system configurations, as well as during e.g. DLL entry point notifications.
  • A growing number of conditions in the system and runtime exist where detected corruption or illegal calls result in summary termination of the process, in which case no crash report will be generated.
Out-Of-Process Exception Handling

There exists a mechanism in Windows Error Reporting (WER) that allows a client process to register for handling client exceptions out of the crashing process. Unfortunately this mechanism is difficult to use, and doesnt provide coverage for many of the caveats above. Details here.

Linux/Android

On Linux, exceptions are dispatched as signals to the crashing thread. Crashpad signal handlers will send a message over the socket to the Crashpad handler notifying it of the crash and the location of exception information to be read from the crashing process. When using a shared socket connection, communication is entirely one-way. The client sends its dump request to the handler and then waits until the handler responds with a SIGCONT or a timeout occurs. When using a private socket connection, the handler may respond over the socket to communicate with a ptrace broker process. The broker is forked from the crashing process, executes ptrace requests against the crashing process, and sends the information over the socket to the handler.

The CrashpadInfo structure

The CrashpadInfo structure is used to communicate information from the client to the handler. Each executable module in a client process can contain a CrashpadInfo structure. On a crash, the handler crawls all modules in the crashing process to locate all CrashpadInfo structures present. The CrashpadInfo structures are linked into a special, named section of the executable, where the handler can readily find them.

The CrashpadInfo structure has a magic signature, and contains a size and a version field. The intent is to allow backwards compatibility from older client modules to newer handler. It may also be necessary to provide forwards compatibility from newer clients to older handler, though this hasnt occurred yet.

The CrashpadInfo structure contains such properties as the cap for how much memory to include in the crash dump, some tristate flags for controlling the handlers behavior, a pointer to an annotation dictionary and so on.

Snapshot

Snapshot is a layer of interfaces that represent the machine and OS entities that Crashpad cares about. Different concrete implementations of snapshot can then be backed different ways, such as e.g. from the in-memory representation of a crashed process, or e.g. from the contents of a minidump.

Crash Dump Creation

To create a crash dump, a subset of the machine, OS and application state is grabbed from the crashed process into an in-memory snapshot structure in the handler process. Since the full application state is typically too large for capturing to disk and transmitting to an upstream server, the snapshot contains a heuristically selected subset of the full state.

The precise details of whats captured varies between operating systems, but generally includes the following

  • The set of modules (executable, shared libraries) that are loaded into the crashing process.
  • An enumeration of the threads running in the crashing process, including the register contents and the contents of stack memory of each thread.
  • A selection of the OS-related state of the process, such as e.g. the command line, environment and so on.
  • A selection of memory potentially referenced from registers and from stack.

To capture a crash dump, the crashing process is first suspended, then a snapshot is created in the handler process. The snapshot includes the CrashpadInfo structures of the modules loaded into the process, and the contents of those is used to control the level of detail captured for the crash dump.

Once the snapshot has been constructed, it is then written to a minidump file, which is added to the database. The process is un-suspended after the minidump file has been written. In the case of a crash (as opposed to a client request to produce a dump without crashing), it is then either killed by the operating system or the Crashpad handler.

In general the snapshotting process has to be very intimate with the operating system its working with, so there will be a set of concrete implementation classes, many deriving from the snapshot interfaces, doing this for each operating system.

Minidump

The minidump implementation is responsible for writing a snapshot to a serialized on-disk file in the minidump format. The minidump implementation is OS-agnostic, as it works on an OS-agnostic Snapshot interface.

TODO(siggi): Talk about two-phase writes and contents ordering here.

Database

The Crashpad database contains persistent client settings, including a unique crash client identifier and the upload-enabled bit. Note that the crash client identifier is assigned by Crashpad, and is distinct from any identifiers the client application uses to identify users, installs, machines or such - if any. The expectation is that the client application will manage the users upload consent, and inform Crashpad of changes in consent.

The unique client identifier is set at the time of database creation. It is then recorded into every crash report collected by the handler and communicated to the upstream server.

The database stores a configurable number of recorded crash dumps to a configurable maximum aggregate size. For each crash dump it stores annotations relating to whether the crash dumps have been uploaded. For successfully uploaded crash dumps it also stores their server-assigned ID.

The database consists of a settings file, named "settings.dat" with binary contents (see crashpad::Settings::Data for the file format), as well as directory containing the crash dumps. Additionally each crash dump is adorned with properties relating to the state of the dump for upload and such. The details of how these properties are stored vary between platforms.

macOS

The macOS implementation simply stores database properties on the minidump files in filesystem extended attributes.

iOS

The iOS implementation also stores database properties of minidump files in filesystem extended attributes. Separate from the database, iOS also stores its intermediate dump files adjacent to the database. See more information about iOS intermediate dumps.

Windows

The Windows implementation stores database properties in a binary file named “metadata” at the top level of the database directory.

Report Format

Crash reports are recorded in the Windows minidump format with extensions to support Crashpad additions, such as e.g. Annotations.

Upload to collection server

Wire Format

For the time being, Crashpad uses the Breakpad wire protocol, which is essentially a MIME multipart message communicated over HTTP(S). To support this, the annotations from all the CrashpadInfo structures found in the crashing process are merged to create the Breakpad “crash keys” as form data. The postmortem minidump is then attached as an “application/octet- stream” attachment with the name “upload_file_minidump”. The entirety of the request body, including the minidump, can be gzip-compressed to reduce transmission time and increase transmission reliability. Note that by convention there is a set of “crash keys” that are used to communicate the product, version, client ID and other relevant data about the client, to the server. Crashpad normally stores these values in the minidump file itself, but retrieves them from the minidump and supplies them as form data for compatibility with the Breakpad-style server.

This is a temporary compatibility measure to allow the current Breakpad-based upstream server to handle Crashpad reports. In the fullness of time, the wire protocol is expected to change to remove this redundant transmission and processing of the Annotations.

Transport

The embedding client controls the URL of the collection server by the command line passed to the handler. The handler can upload crashes with HTTP or HTTPS, depending on clients preference. Its strongly suggested use HTTPS transport for crash uploads to protect the users privacy against man-in-the-middle snoopers.

TODO(mmentovai): Certificate pinning.

Throttling & Retry Strategy

To protect both the collection server from DDoS as well as to protect the clients from unreasonable data transfer demands, the handler implements a client-side throttling strategy. At the moment, the strategy is very simplistic, it simply limits uploads to one upload per hour, and failed uploads are aborted.

An experiment has been conducted to lift all throttling. Analysis on the aggregate data this produced shows that multiple crashes within a short timespan on the same client are nearly always due to the same cause. Therefore there is very little loss of signal due to the throttling, though the ability to reconstruct at least the full crash count is highly desirable.

The lack of retry is expected to change soon, as this creates blind spots for client crashes that exclusively occur on e.g. network down events, during suspend and resume and such.

Extensibility

Client Extensibility

Clients are able to extend the generated crash reports in two ways, by manipulating their CrashpadInfo structure. The two extensibility points are:

  1. Nominating a set of address ranges for inclusion in the crash report.
  2. Adding user-defined minidump streams for inclusion in the crash report.

In both cases the CrashpadInfo structure has to be updated before a crash occurs.

Embedder Extensibility

Additionally, embedders of the handler can provide "user stream data source" instances to the handler's main function. Any time a minidump is written, these instances get called.

Each data source may contribute a custom stream to the minidump, which can be computed from e.g. system or application state relevant to the crash.

As a case in point, it can be handy to know whether the system was under memory or other resource duress at the time of crash.

Dependencies

Aside from system headers and APIs, when used outside of Chromium, Crashpad has a dependency on “mini_chromium”, which is a subset of the Chromium base library. This is to allow non-Chromium clients to use Crashpad, without taking a direct dependency on the Chromium base, while allowing Chromium projects to use Crashpad with minimum code duplication or hassle. When using Crashpad as part of Chromium, Chromiums own copy of the base library is used instead of mini_chromium.

The downside to this is that mini_chromium must be kept up to date with interface and implementation changes in Chromium base, for the subset of functionality used by Crashpad.

Caveats

TODO(anyone): You may need to describe what you did not do or why simpler approaches don't work. Mention other things to watch out for (if any).

Security Considerations

Crashpad may be used to capture the state of sandboxed processes and it writes minidumps to disk. It may therefore straddle security boundaries, so its important that Crashpad handle all data it reads out of the crashed process with extreme care. The Crashpad handler takes care to access client address spaces through specially-designed accessors that check pointer validity and enforce accesses within prescribed bounds. The flow of information into the Crashpad handler is exclusively one-way: Crashpad never communicates anything back to its clients, aside from providing single-bit indications of completion.

Privacy Considerations

Crashpad may capture arbitrary contents from crashed process memory, including user IDs and passwords, credit card information, URLs and whatever other content users have trusted the crashing program with. The client program must acquire and honor the users consent to upload crash reports, and appropriately manage the upload state in Crashpads database.

Crashpad must also be careful not to upload crashes for arbitrary processes on the users system. To this end, Crashpad will never upload a process that hasnt registered with the handler, but note that registrations are inherited by child processes on some operating systems.