crashpad/doc/overview_design.md

568 lines
26 KiB
Markdown
Raw Normal View History

<!--
Copyright 2017 The Crashpad Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Crashpad Overview Design
[TOC]
## Objective
Crashpad is a library for capturing, storing and transmitting postmortem crash
reports from a client to an upstream collection server. Crashpad aims to make it
possible for clients to capture process state at the time of crash with the best
possible fidelity and coverage, with the minimum of fuss.
Crashpad also provides a facility for clients to capture dumps of process state
on-demand for diagnostic purposes.
Crashpad additionally provides minimal facilities for clients to adorn their
crashes with application-specific metadata in the form of per-process key/value
pairs. More sophisticated clients are able to adorn crash reports further
through extensibility points that allow the embedder to augment the crash report
with application-specific metadata.
## Background
Its an unfortunate truth that any large piece of software will contain bugs
that will cause it to occasionally crash. Even in the absence of bugs, software
incompatibilities can cause program instability.
Fixing bugs and incompatibilities in client software that ships to millions of
users around the world is a daunting task. User reports and manual reproduction
of crashes can work, but even given a user report, often times the problem is
not readily reproducible. This is for various reasons, such as e.g. system
version or third-party software incompatibility, or the problem can happen due
to a race of some sort. Users are also unlikely to report problems they
encounter, and user reports are often of poor quality, as unfortunately most
users dont have experience with making good bug reports.
Automatic crash telemetry has been the best solution to the problem so far, as
this relieves the burden of manual reporting from users, while capturing the
hardware and software state at the time of crash.
TODO(siggi): examples of this?
Crash telemetry involves capturing postmortem crash dumps and transmitting them
to a backend collection server. On the server they can be stackwalked and
symbolized, and evaluated and aggregated in various ways. Stackwalking and
symbolizing the reports on an upstream server has several benefits over
performing these tasks on the client. High-fidelity stackwalking requires access
to bulky unwind data, and it may be desirable to not ship this to end users out
of concern for the application size. The process of symbolization requires
access to debugging symbols, which can be quite large, and the symbolization
process can consume considerable other resources. Transmitting un-stackwalked
and un-symbolized postmortem dumps to the collection server also allows deep
analysis of individual dumps, which is often necessary to resolve the bug
causing the crash.
Transmitting reports to the collection server allows aggregating crashes by
cause, which in turn allows assessing the importance of different crashes in
terms of the occurrence rate and e.g. the potential security impact.
A postmortem crash dump must contain the program state at the time of crash
with sufficient fidelity to allow diagnosing and fixing the problem. As the full
program state is usually too large to transmit to an upstream server, the
postmortem dump captures a heuristic subset of the full state.
The crashed program is in an indeterminate state and, in fact, has often crashed
because of corrupt global state - such as heap. Its therefore important to
generate crash reports with as little execution in the crashed process as
possible. Different operating systems vary in the facilities they provide for
this.
## Overview
Crashpad is a client-side library that focuses on capturing machine and program
state in a postmortem crash report, and transmitting this report to a backend
server - a “collection server”. The Crashpad library is embedded by the client
application. Conceptually, Crashpad breaks down into the handler and the client.
The handler runs in a separate process from the client or clients. It is
responsible for snapshotting the crashing client process state on a crash,
saving it to a crash dump, and transmitting the crash dump to an upstream
server. Clients register with the handler to allow it to capture and upload
their crashes. On iOS, there is no separate process for the handler.
[This is a limitation of iOS.](ios_overview_design.md#ios-limitations)
### The Crashpad handler
The Crashpad handler is instantiated in a process supplied by the embedding
application. It provides means for clients to register themselves by some means
of IPC, or where operating system support is available, by taking advantage of
such support to cause crash notifications to be delivered to the handler. On
crash, the handler snapshots the crashed client process state, writes it to a
postmortem dump in a database, and may also transmit the dump to an upstream
server if so configured.
The Crashpad handler is able to handle cross-bitted requests and generate crash
dumps across bitness, where e.g. the handler is a 64-bit process while the
client is a 32-bit process or vice versa. In the case of Windows, this is
limited by the OS such that a 32-bit handler can only generate crash dumps for
32-bit clients, but a 64-bit handler can acquire nearly all of the detail for a
32-bit process.
### The Crashpad client
The Crashpad client provides two main facilities.
1. Registration with the Crashpad handler.
2. Metadata communication to the Crashpad handler on crash.
A Crashpad embedder links the Crashpad client library into one or more
executables, whether a loadable library or a program file. The client process
then registers with the Crashpad handler through some mode of IPC or other
operating system-specific support.
On crash, metadata is communicated to the Crashpad handler via the CrashpadInfo
structure. Each client executable module linking the Crashpad client library
embeds a CrashpadInfo structure, which can be updated by the client with
whatever state the client wishes to record with a crash.
![Overview image](overview.png)
Here is an overview picture of the conceptual relationships between embedder (in
light blue), client modules (darker blue), and Crashpad (in green). Note that
multiple client modules can contain a CrashpadInfo structure, but only one
registration is necessary.
## Detailed Design
### Requirements
The purpose of Crashpad is to capture machine, OS and application state in
sufficient detail and fidelity to allow developers to diagnose and, where
possible, fix the issue causing the crash.
Each distinct crash report is assigned a globally unique ID, in order to allow
users to associate them with a user report, report in bug reports and so on.
Its critical to safeguard the users privacy by ensuring that no crash report
is ever uploaded without user consent. Likewise its important to ensure that
Crashpad never captures or uploads reports from non-client processes.
### Concepts
* **Client ID**. A UUID tied to a single instance of a Crashpad database. When
creating a crash report, the Crashpad handler includes the client ID stored
in the database. This provides a means to determine how many individual end
users are affected by a specific crash signature.
* **Crash ID**. A UUID representing a single crash report. Uploaded crash
reports also receive a “server ID.” The Crashpad database indexes both the
locally-generated and server-generated IDs.
* **Collection Server**. See [crash server documentation.](
https://goto.google.com/crash-server-overview)
* **Client Process**. Any process that has registered with a Crashpad handler.
* **Handler process**. A process hosting the Crashpad handler library. This may
be a dedicated executable, or it may be hosted within a client executable
with control passed to it based on special signaling under the clients
control, such as a command-line parameter.
* **CrashpadInfo**. A structure used by client modules to provide information to
the handler.
* **Annotations**. Each CrashpadInfo structure points to a dictionary of
{string, string} annotations that the client can use to communicate
application state in the case of crash.
* **Database**. The Crashpad database contains persistent client settings as
well as crash dumps pending upload.
TODO(siggi): moar concepts?
### Overview Picture
Here is a rough overview picture of the various Crashpad constructs, their
layering and intended use by clients.
![Layering image](layering.png)
Dark blue boxes are interfaces, light blue boxes are implementation. Gray is the
embedding client application. Note that wherever possible, implementation that
necessarily has to be OS-specific, exposes OS-agnostic interfaces to the rest of
Crashpad and the client.
### Registration
The particulars of how a client registers with the handler varies across
operating systems.
#### macOS
At registration time, the client designates a Mach port monitored by the
Crashpad handler as the EXC_CRASH exception port for the client. The port may be
acquired by launching a new handler process or by retrieving service already
registered with the system. The registration is maintained by the kernel and is
inherited by subprocesses at creation time by default, so only the topmost
process of a process tree need register.
Crashpad provides a facility for a process to disassociate (unregister) with an
existing crash handler, which can be necessary when an older client spawns an
updated version.
#### iOS
iOS registers both a signal handler for `SIGABRT` and a Mach exception handler
with a subset of available exceptions. [This is a limitation of
iOS.](ios_overview_design.md#ios-limitations)
#### Windows
There are two modes of registration on Windows. In both cases the handler is
advised of the address of a set of structures in the client process address
space. These structures include a pair of ExceptionInformation structs, one for
generating a postmortem dump for a crashing process, and another one for
generating a dump for a non- crashing process.
##### Normal registration
In the normal registration mode, the client connects to a named pipe by a
pre-arranged name. A registration request is written to the pipe. During
registration, the handler creates a set of events, duplicates them to the
registering client, then returns the handle values in the registration response.
This is a blocking process.
##### Initial Handler Creation
In order to avoid blocking client startup for the creation and initialization of
the handler, a different mode of registration can be used for the handler
creation. In this mode, the client creates a set of event handles and inherits
them into the newly created handler process. The handler process is advised of
the handle values and the location of the ExceptionInformation structures by way
of command line arguments in this mode.
#### Linux/Android
On Linux, a registration is a connected socket pair between a client process and
the Crashpad handler. This socket pair may be private or shared among many
client processes.
##### Private Connections
Private connections are the default registration mode when starting the handler
process in response to a crash or on behalf of another client. This mode is
required to use a ptrace broker, which is in turn required to trace Android
isolated processes.
##### Shared Connections
Shared connections are the default mode when using a long-lived handler. The
same connected socket pair may be shared among any number of clients. The socket
pair is created by the first process to start the handler at which point the
client socket end may be shared with other clients by any convenient means (e.g.
inheritance).
### Capturing Exceptions
The details of how Crashpad captures the exceptions leading to crashes varies
between operating systems.
#### macOS
On macOS, the operating system will notify the handler of client crashes via the
Mach port set as the client process exception port. As exceptions are
dispatched to the Mach port by the kernel, on macOS, exceptions can be handled
entirely from the Crashpad handler without the need to run any code in the crash
process at the time of the exception.
#### iOS
On iOS, the operating system will notify the handler of crashes via the Mach
exception port or the signal handler. As exceptions are handled in-process, an
intermediate dump file is generated rather than a minidump. See more
information about the [iOS in-process
handler.](ios_overview_design.md#ios-in-process-handler)
#### Windows
On Windows, the OS dispatches exceptions in the context of the crashing thread.
To notify the handler of exceptions, the Crashpad client registers an
UnhandledExceptionFilter (UEF) in the client process. When an exception trickles
up to the UEF, it stores the exception information and the crashing threads ID
in the ExceptionInformation structure registered with the handler. It then sets
an event handle to signal the handler to go ahead and process the exception.
##### Caveats
* If the crashing threads stack is smashed when an exception occurs, the
exception cannot be dispatched. In this case the OS will summarily terminate
the process, without the handler having an opportunity to generate a crash
report.
* If an exception is handled in the crashing thread, it will never propagate
to the UEF, and thus a crash report wont be generated. This happens a fair
bit in Windows as system libraries will often dispatch callbacks under a
structured exception handler. This occurs during Window message dispatching
on some system configurations, as well as during e.g. DLL entry point
notifications.
* A growing number of conditions in the system and runtime exist where
detected corruption or illegal calls result in summary termination of the
process, in which case no crash report will be generated.
###### Out-Of-Process Exception Handling
There exists a mechanism in Windows Error Reporting (WER) that allows a client
process to register for handling client exceptions out of the crashing process.
Unfortunately this mechanism is difficult to use, and doesnt provide coverage
for many of the caveats above. [Details
here.](https://crashpad.chromium.org/bug/133)
#### Linux/Android
On Linux, exceptions are dispatched as signals to the crashing thread. Crashpad
signal handlers will send a message over the socket to the Crashpad handler
notifying it of the crash and the location of exception information to be read
from the crashing process. When using a shared socket connection, communication
is entirely one-way. The client sends its dump request to the handler and then
waits until the handler responds with a SIGCONT or a timeout occurs. When using
a private socket connection, the handler may respond over the socket to
communicate with a ptrace broker process. The broker is forked from the crashing
process, executes ptrace requests against the crashing process, and sends the
information over the socket to the handler.
### The CrashpadInfo structure
The CrashpadInfo structure is used to communicate information from the client to
the handler. Each executable module in a client process can contain a
CrashpadInfo structure. On a crash, the handler crawls all modules in the
crashing process to locate all CrashpadInfo structures present. The CrashpadInfo
structures are linked into a special, named section of the executable, where the
handler can readily find them.
The CrashpadInfo structure has a magic signature, and contains a size and a
version field. The intent is to allow backwards compatibility from older client
modules to newer handler. It may also be necessary to provide forwards
compatibility from newer clients to older handler, though this hasnt occurred
yet.
The CrashpadInfo structure contains such properties as the cap for how much
memory to include in the crash dump, some tristate flags for controlling the
handlers behavior, a pointer to an annotation dictionary and so on.
### Snapshot
Snapshot is a layer of interfaces that represent the machine and OS entities
that Crashpad cares about. Different concrete implementations of snapshot can
then be backed different ways, such as e.g. from the in-memory representation of
a crashed process, or e.g. from the contents of a minidump.
### Crash Dump Creation
To create a crash dump, a subset of the machine, OS and application state is
grabbed from the crashed process into an in-memory snapshot structure in the
handler process. Since the full application state is typically too large for
capturing to disk and transmitting to an upstream server, the snapshot contains
a heuristically selected subset of the full state.
The precise details of whats captured varies between operating systems, but
generally includes the following
* The set of modules (executable, shared libraries) that are loaded into the
crashing process.
* An enumeration of the threads running in the crashing process, including the
register contents and the contents of stack memory of each thread.
* A selection of the OS-related state of the process, such as e.g. the command
line, environment and so on.
* A selection of memory potentially referenced from registers and from stack.
To capture a crash dump, the crashing process is first suspended, then a
snapshot is created in the handler process. The snapshot includes the
CrashpadInfo structures of the modules loaded into the process, and the contents
of those is used to control the level of detail captured for the crash dump.
Once the snapshot has been constructed, it is then written to a minidump file,
which is added to the database. The process is un-suspended after the minidump
file has been written. In the case of a crash (as opposed to a client request to
produce a dump without crashing), it is then either killed by the operating
system or the Crashpad handler.
In general the snapshotting process has to be very intimate with the operating
system its working with, so there will be a set of concrete implementation
classes, many deriving from the snapshot interfaces, doing this for each
operating system.
### Minidump
The minidump implementation is responsible for writing a snapshot to a
serialized on-disk file in the minidump format. The minidump implementation is
OS-agnostic, as it works on an OS-agnostic Snapshot interface.
TODO(siggi): Talk about two-phase writes and contents ordering here.
### Database
The Crashpad database contains persistent client settings, including a unique
crash client identifier and the upload-enabled bit. Note that the crash client
identifier is assigned by Crashpad, and is distinct from any identifiers the
client application uses to identify users, installs, machines or such - if any.
The expectation is that the client application will manage the users upload
consent, and inform Crashpad of changes in consent.
The unique client identifier is set at the time of database creation. It is then
recorded into every crash report collected by the handler and communicated to
the upstream server.
The database stores a configurable number of recorded crash dumps to a
configurable maximum aggregate size. For each crash dump it stores annotations
relating to whether the crash dumps have been uploaded. For successfully
uploaded crash dumps it also stores their server-assigned ID.
The database consists of a settings file, named "settings.dat" with binary
contents (see crashpad::Settings::Data for the file format), as well as
directory containing the crash dumps. Additionally each crash dump is adorned
with properties relating to the state of the dump for upload and such. The
details of how these properties are stored vary between platforms.
#### macOS
The macOS implementation simply stores database properties on the minidump files
in filesystem extended attributes.
#### iOS
The iOS implementation also stores database properties of minidump files in
filesystem extended attributes. Separate from the database, iOS also stores its
intermediate dump files adjacent to the database. See more information about
[iOS intermediate
dumps.](ios_overview_design.md#the-crashpad-intermediatedump-format)
#### Windows
The Windows implementation stores database properties in a binary file named
“metadata” at the top level of the database directory.
### Report Format
Crash reports are recorded in the Windows minidump format with
extensions to support Crashpad additions, such as e.g. Annotations.
### Upload to collection server
#### Wire Format
For the time being, Crashpad uses the Breakpad wire protocol, which is
essentially a MIME multipart message communicated over HTTP(S). To support this,
the annotations from all the CrashpadInfo structures found in the crashing
process are merged to create the Breakpad “crash keys” as form data. The
postmortem minidump is then attached as an “application/octet- stream”
attachment with the name “upload_file_minidump”. The entirety of the request
body, including the minidump, can be gzip-compressed to reduce transmission time
and increase transmission reliability. Note that by convention there is a set of
“crash keys” that are used to communicate the product, version, client ID and
other relevant data about the client, to the server. Crashpad normally stores
these values in the minidump file itself, but retrieves them from the minidump
and supplies them as form data for compatibility with the Breakpad-style server.
This is a temporary compatibility measure to allow the current Breakpad-based
upstream server to handle Crashpad reports. In the fullness of time, the wire
protocol is expected to change to remove this redundant transmission and
processing of the Annotations.
#### Transport
The embedding client controls the URL of the collection server by the command
line passed to the handler. The handler can upload crashes with HTTP or HTTPS,
depending on clients preference. Its strongly suggested use HTTPS transport
for crash uploads to protect the users privacy against man-in-the-middle
snoopers.
TODO(mmentovai): Certificate pinning.
#### Throttling & Retry Strategy
To protect both the collection server from DDoS as well as to protect the
clients from unreasonable data transfer demands, the handler implements a
client-side throttling strategy. At the moment, the strategy is very simplistic,
it simply limits uploads to one upload per hour, and failed uploads are aborted.
An experiment has been conducted to lift all throttling. Analysis on the
aggregate data this produced shows that multiple crashes within a short timespan
on the same client are nearly always due to the same cause. Therefore there is
very little loss of signal due to the throttling, though the ability to
reconstruct at least the full crash count is highly desirable.
The lack of retry is expected to [change
soon](https://crashpad.chromium.org/bug/23), as this creates blind spots for
client crashes that exclusively occur on e.g. network down events, during
suspend and resume and such.
### Extensibility
#### Client Extensibility
Clients are able to extend the generated crash reports in two ways, by
manipulating their CrashpadInfo structure.
The two extensibility points are:
1. Nominating a set of address ranges for inclusion in the crash report.
2. Adding user-defined minidump streams for inclusion in the crash report.
In both cases the CrashpadInfo structure has to be updated before a crash
occurs.
##### Embedder Extensibility
Additionally, embedders of the handler can provide "user stream data source"
instances to the handler's main function. Any time a minidump is written, these
instances get called.
Each data source may contribute a custom stream to the minidump, which can be
computed from e.g. system or application state relevant to the crash.
As a case in point, it can be handy to know whether the system was under memory
or other resource duress at the time of crash.
### Dependencies
Aside from system headers and APIs, when used outside of Chromium, Crashpad has
a dependency on “mini_chromium”, which is a subset of the Chromium base library.
This is to allow non-Chromium clients to use Crashpad, without taking a direct
dependency on the Chromium base, while allowing Chromium projects to use
Crashpad with minimum code duplication or hassle. When using Crashpad as part of
Chromium, Chromiums own copy of the base library is used instead of
mini_chromium.
The downside to this is that mini_chromium must be kept up to date with
interface and implementation changes in Chromium base, for the subset of
functionality used by Crashpad.
## Caveats
TODO(anyone): You may need to describe what you did not do or why simpler
approaches don't work. Mention other things to watch out for (if any).
## Security Considerations
Crashpad may be used to capture the state of sandboxed processes and it writes
minidumps to disk. It may therefore straddle security boundaries, so its
important that Crashpad handle all data it reads out of the crashed process with
extreme care. The Crashpad handler takes care to access client address spaces
through specially-designed accessors that check pointer validity and enforce
accesses within prescribed bounds. The flow of information into the Crashpad
handler is exclusively one-way: Crashpad never communicates anything back to
its clients, aside from providing single-bit indications of completion.
## Privacy Considerations
Crashpad may capture arbitrary contents from crashed process memory, including
user IDs and passwords, credit card information, URLs and whatever other content
users have trusted the crashing program with. The client program must acquire
and honor the users consent to upload crash reports, and appropriately manage
the upload state in Crashpads database.
Crashpad must also be careful not to upload crashes for arbitrary processes on
the users system. To this end, Crashpad will never upload a process that hasnt
registered with the handler, but note that registrations are inherited by child
processes on some operating systems.