From 33e4cf1d9cd0e2f16d663ad00ba8d8ffd66d6b52 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Felix=20Geisend=C3=B6rfer?= <felix@felixge.de>
Date: Wed, 3 Feb 2021 15:38:42 +0100
Subject: [PATCH] Initial block profile page, not release yet

---
 block.md | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 141 insertions(+)
 create mode 100644 block.md

diff --git a/block.md b/block.md
new file mode 100644
index 0000000..c8dceb3
--- /dev/null
+++ b/block.md
@@ -0,0 +1,141 @@
+This document was last updated for `go1.15.7` but probably still applies to older/newer versions for the most parts.
+
+# Block Profiling in Go
+
+## Description
+
+The block profile in Go lets you analyze how much time your program spends waiting on the blocking operations listed below:
+
+- [select](https://github.com/golang/go/blob/go1.15.7/src/runtime/select.go#L511)
+- [chan send](https://github.com/golang/go/blob/go1.15.7/src/runtime/chan.go#L279) (except sending to a nil chan which blocks forever but is [not tracked](https://github.com/golang/go/blob/go1.15.7/src/runtime/chan.go#L163))
+- [chan receive](https://github.com/golang/go/blob/go1.15.7/src/runtime/chan.go#L586) (except receiving from a nil chan which blocks forever but is [not tracked](https://github.com/golang/go/blob/go1.15.7/src/runtime/chan.go#L466))
+- [semacquire](https://github.com/golang/go/blob/go1.15.7/src/runtime/sema.go#L150) ( [`Mutex.Lock`](https://golang.org/pkg/sync/#Mutex.Lock), [`RWMutex.RLock`](https://golang.org/pkg/sync/#RWMutex.RLock) , [`RWMutex.Lock`](https://golang.org/pkg/sync/#RWMutex.Lock), [`WaitGroup.Wait`](https://golang.org/pkg/sync/#WaitGroup.Wait))
+- [notifyListWait](https://github.com/golang/go/blob/go1.15.7/src/runtime/sema.go#L515) ( [`Cond.Wait`](https://golang.org/pkg/sync/#Cond.Wait))
+
+Time is only tracked when Go has to suspend the goroutine's execution by parking it into a [waiting](https://github.com/golang/go/blob/go1.15.7/src/runtime/runtime2.go#L51-L59) state. So for example a `Mutex.Lock()` operation will not show up in your profile if the lock can be immediately aquired.
+
+The operations above are a subset of the [waiting states](https://github.com/golang/go/blob/go1.15.7/src/runtime/runtime2.go#L996-L1024) used by the Go runtime, i.e. the operations below **will not** show up in a block profile:
+
+- [`time.Sleep`](https://golang.org/pkg/time/#Sleep) (but [`time.After`](https://golang.org/pkg/time/#After), [`time.Tick`](https://golang.org/pkg/time/#Tick) and other channel based wrappers will show up)
+- GC
+- Syscalls
+- Internal Locks (e.g. for [stopTheWorld](https://github.com/golang/go/blob/go1.15.7/src/runtime/proc.go#L900))
+- Blocking in [cgo](https://golang.org/cmd/cgo/) calls
+
+## Usage
+
+The block profiler is disabled by default. To record all blocking events regardless of their duration, simply call:
+
+```
+runtime.SetBlockProfileRate(1)
+```
+
+You should read the [Accuracy](#accuracy) and [Overhead](#overhead) sections below to figure out if passing `1` as the sampling rate is a good idea for you or not.
+
+Block durations are aggregated over the lifetime of the program (while the profiling is enabled). To get a [pprof formated](./pprof.md) snapshot of the current stack traces that lead to blocking events and their cumulative time duration, you can call:
+
+```go
+pprof.Lookup("block").WriteTo(myFile, 0)
+```
+
+Alternatively you can use [github.com/pkg/profile](https://pkg.go.dev/github.com/pkg/profile) for convenience, or [net/http/pprof](net/http/pprof) to expose profiling via http, or use a [continious profiler](https://www.datadoghq.com/product/code-profiling/) to collect the data automatically in production.
+
+Last but not least you can use the [`runtime.BlockProfile`](https://golang.org/pkg/runtime/#BlockProfile) API to get programmatically get the same information.
+
+## Overhead
+
+**tl;dr:** A blockprofile rate `>= 10000` (10µs) should have negligable impact on production apps, including those suffering from extreme contention.
+
+### Implementation Details
+
+Block profiling is essentially implemented like this inside of the Go runtime (see the links in the [Description](#description) above for real code):
+
+```go
+func chansend(...) {
+  var t0 int64
+  if blockprofilerate > 0 {
+    t0 = cputicks()
+  }
+  // ... park goroutine in waiting state while blocked ...
+  if blockprofilerate > 0 {
+    cycles := cputicks() - t0
+    if blocksampled(cycles) {
+      saveblockevent(cycles)
+    }
+  }
+}
+```
+
+This means that unless you enable block profiling, the overhead should be effectively zero thanks to CPU branch prediction.
+
+When block profiling is enabled, every blocking operation will pay the overhead of two `cputicks()` calls. On `amd64` this is done via [optimized assembly](https://github.com/golang/go/blob/go1.15.7/src/runtime/asm_amd64.s#L874-L887) using the [RDTSC instruction](https://en.wikipedia.org/wiki/Time_Stamp_Counter) and takes a negligible `~10ns/op` on [my machine](https://github.com/felixge/dump/tree/master/cputicks). On other platforms various alternative clock sources are used which may have higher overheads and lower accuracy.
+
+Depending on the configured `blockprofilerate` (more about this in the [Accuracy](#accuracy) section) the block event may end up getting saved. This means a stack trace is collected which takes `~1µs` on [my machine](https://github.com/felixge/dump/tree/master/go-callers-bench) (stackdepth=16). The stack is then used as a key to update an [internal hashmap](https://github.com/golang/go/blob/go1.15.7/src/runtime/mprof.go#L144) by incrementing the corresponding [`blockRecord`](https://github.com/golang/go/blob/go1.15.7/src/runtime/mprof.go#L133-L138) count and cycles.
+
+```go
+type blockRecord struct {
+	count  int64
+	cycles int64
+}
+```
+
+The costs of updating the hash map is probably similar to collecting the stack traces, but I haven't measured it yet.
+
+### Benchmarks
+
+Anyway, what does all of this mean in terms of overhead for your application? It means that block profiling is **low overhead**. Unless your application spends literally all of its time parking/unparking goroutines due to contention, you probably won't be able to see a measurable impact even when sampling every block event.
+
+That being said, the benchmark results below (see [Methodology](./bench/)) should give you an idea of the **theoretical worst case** overhead block profiling could have. The graph `chan(cap=0)` shows that setting `blockprofilerate` from  `1` to `1000` on a workload that consists entirely in sending tiny messages across unbuffered channels decreases throughput significantly. Using a buffered channel as in graph `chan(cap=128)` greatly reduces the problem to the point that it probably won't matter for real applications that don't spend all of their time on channel communication overheads.
+
+It's also interesting to note that I was unable to see significant overheads for `mutex` based workloads. I believe this is due to the fact that mutexes employe spin locks before parking a goroutine when there is contention. If somebody has a good idea for a workload that exhibits high non-spinning mutex contention in Go, please let me know!
+
+![block_linux_x86_64](./bench/block_linux_x86_64.png)
+
+### Initialization Costs 
+
+The first call to `runtime.SetBlockProfileRate()` takes `100ms` because it tries to [measure](https://github.com/golang/go/blob/go1.15.7/src/runtime/runtime.go#L22-L47) the speed difference between the wall clock and the [TSC](https://en.wikipedia.org/wiki/Time_Stamp_Counter) clock. However, recent changes around async preemption have [broken](https://github.com/golang/go/issues/40653#issuecomment-766340860) this code, so the call is taking only `~10ms` right now.
+
+### Memory Usage
+
+Block profiling utilizes a shared hash map that takes up 1.4 MiB even when empty. Unless you explicitly [disable heap profiling](https://twitter.com/felixge/status/1355846360562589696) in your application, this map will get allocated regardless of whether you use the block profiler or not.
+
+Addtionally each unique stack trace will take up some additional memory. The `BuckHashSys` field of [`runtime.MemStats`](https://golang.org/pkg/runtime/#MemStats) allows you to inspect this usage at runtime. In the future I might try to provide additional information about this along with real world data.
+
+## Accuracy
+
+🚧
+
+- `rate` concept [same as memory profiler](https://codereview.appspot.com/6443115#msg38)? 
+- sampling (should cycles < rate events be multiplied?)
+- multi socket cpus / tsc
+
+- max stack depth 32
+- spin locks
+- [runtime: problems with rdtsc in VMs moving backward · Issue #8976 · golang/go](https://github.com/golang/go/issues/8976)
+- https://github.com/golang/go/issues/16755#issuecomment-332279965 ?
+
+## Relationship with Time
+
+🚧
+
+## Relationship with Mutex Profiling
+
+🚧
+
+## pprof Output
+
+🚧
+
+## pprof Labels
+
+🚧
+
+## History
+
+Block profiling was [implemented](https://codereview.appspot.com/6443115) by [Dmitry Vyukov](https://github.com/dvyukov) and first appeared in the [go1.1](https://golang.org/doc/go1.1) release (2013-05-13).
+
+## Disclaimers
+
+I work at [Datadog](https://www.datadoghq.com/) on [Continuous Profiling](https://www.datadoghq.com/product/code-profiling/) for Go (you should check it out) and they generously allowed me to do all this research and publish it.
+
+The information on this page is believed to be correct, but no warranty is provided. Feedback is welcome!
\ No newline at end of file