1
0
mirror of https://github.com/jesseduffield/lazygit.git synced 2026-05-22 10:15:43 +02:00

Bump tcell dependency to v3

This commit is contained in:
Stefan Haller
2026-04-01 16:12:16 +02:00
parent 64996d12d9
commit 5d3715f96b
142 changed files with 15398 additions and 8297 deletions
+3
View File
@@ -0,0 +1,3 @@
.DS_Store
*.out
*.test
+51
View File
@@ -0,0 +1,51 @@
The goals and overview of this package can be found in the README.md file,
start by reading that.
The goal of this package is to determine the display (column) width of a
string, UTF-8 bytes, or runes, as would happen in a monospace font, especially
in a terminal.
When troubleshooting, write Go unit tests instead of executing debug scripts.
The tests can return whatever logs or output you need. If those tests are
only for temporary troubleshooting, clean up the tests after the debugging is
done.
(Separate executable debugging scripts are messy, tend to have conflicting
dependencies and are hard to cleanup.)
If you make changes to the trie generation in internal/gen, it can be invoked
by running `go generate` from the top package directory.
## Pull Requests and branches
For PRs (pull requests), you can use the gh CLI tool. Compare the current branch with main. Reviewing a PR and reviewing a branch are about the same, but the PR may add context.
Understand the goals of the PR. Note any API changes, especially breaking changes.
Look for thoroughness of tests, as well as GoDoc comments.
Retrieve and consider the comments on the PR, which may have come from GitHub Copilot or Cursor BugBot. Think like GitHub Copilot or Cursor BugBot.
Offer to optionally post a brief summary of the review to the PR, via the gh CLI tool.
## Tagged Go releases
If I ask you whether we are ready to release, this means a tagged Go release on the main branch. Go releases are git tagged with a version number.
Review the changes since the last release, i.e. the previous git tag. Ensure that the changes are complete and correct. Identify new features, bug fixes, and performance improvements.
Identify breaking changes, especially API changes.
Ensure good test coverage. Look for performance changes, especially performance regressions, by running benchmarks against the previous release.
Ensure that the documentation in READMEs and GoDocs are complete, correct and consistent.
## Comparisons to go-runewidth
We originally attempted to make this package compatible with go-runewidth.
However, we found that there were too many differences in the handling of
certain characters and properties.
We believe, preliminarily, that our choices are more correct and complete,
by using more complete categories such as Unicode Cf (format) for zero-width
and Mn (Nonspacing_Mark) for combining marks.
+129
View File
@@ -0,0 +1,129 @@
# Changelog
## [0.11.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.10.0...v0.11.0)
### Added
- New `ControlSequences8Bit` option to treat 8-bit ECMA-48 (C1) escape sequences as zero-width. (#22)
### Changed
- Upgraded uax29 dependency to v2.7.0 for 8-bit escape sequence support in the grapheme iterator.
- Truncation now validates that preserved trailing escape sequences are zero-width, preventing edge cases where non-zero-width sequences could leak into output.
### Note
- `ControlSequences8Bit` is deliberately ignored by `TruncateString` and `TruncateBytes`, because C1 byte values (0x80–0x9F) overlap with UTF-8 multi-byte encoding.
## [0.10.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.9.0...v0.10.0)
### Added
- New `ControlSequences` option to treat ECMA-48/ANSI escape sequences as zero-width. (#20)
- `TruncateString` and `TruncateBytes` now preserve trailing ANSI escape sequences (such as SGR resets) when `ControlSequences` is true, preventing color bleed in terminal output.
### Changed
- Removed `stringish` dependency; generic type constraints are now inline `~string | []byte`.
- Upgraded uax29 dependency to v2.6.0 for ANSI escape sequence support in the grapheme iterator.
## [0.9.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.8.0...v0.9.0)
### Changed
- Unicode 17 support: East Asian Width and emoji data updated to Unicode 17.0.0. (#18)
- Upgraded uax29 dependency to v2.5.0 (Unicode 17 grapheme segmentation).
## [0.8.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.7.0...v0.8.0)
### Changed
- Performance: ASCII fast path that applies to any run of printable
ASCII. 2x-10x faster for ASCII text vs v0.7.0. (#16)
- Upgraded uax29 dependency to v2.4.0 for Unicode 16 support. Text that includes
Indic_Conjunct_Break may segment differently (and more correctly). (#15)
## [0.7.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.2...v0.7.0)
### Added
- New `TruncateString` and `TruncateBytes` methods to truncate strings to a
maximum display width, with optional tail (like an ellipsis). (#13)
## [0.6.2]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.1...v0.6.2)
### Changed
- Internal: reduced property categories for simpler trie.
## [0.6.1]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.0...v0.6.1)
### Changed
- Perf improvements: replaced the ASCII lookup table with a simple
function. A bit more cache-friendly. More inlining.
- Bug fix: single regional indicators are now treated as width 2, since that
is what actual terminals do.
## [0.6.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.5.0...v0.6.0)
### Added
- New `StringGraphemes` and `BytesGraphemes` methods, for iterating over the
widths of grapheme clusters.
### Changed
- Fast ASCII lookups
## [0.5.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.4.1...v0.5.0)
### Added
- Unicode 16 support
- Improved emoji presentation handling per Unicode TR51
### Changed
- Corrected VS15 (U+FE0E) handling: now preserves base character width (no-op) per Unicode TR51
- Performance optimizations: reduced property lookups
### Fixed
- VS15 variation selector now correctly preserves base character width instead of forcing width 1
## [0.4.1]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.4.0...v0.4.1)
### Changed
- Updated uax29 dependency
- Improved flag handling
## [0.4.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.3.1...v0.4.0)
### Added
- Support for variation selectors (VS15, VS16) and regional indicator pairs (flags)
## [0.3.1]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.3.0...v0.3.1)
### Added
- Fuzz testing support
### Changed
- Updated stringish dependency
## [0.3.0]
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.2.0...v0.3.0)
### Changed
- Dropped compatibility with go-runewidth
- Trie implementation cleanup
+21
View File
@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Matt Sherman
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
+190
View File
@@ -0,0 +1,190 @@
# displaywidth
A high-performance Go package for measuring the monospace display width of strings, UTF-8 bytes, and runes.
[![Documentation](https://pkg.go.dev/badge/github.com/clipperhouse/displaywidth.svg)](https://pkg.go.dev/github.com/clipperhouse/displaywidth)
[![Test](https://github.com/clipperhouse/displaywidth/actions/workflows/gotest.yml/badge.svg)](https://github.com/clipperhouse/displaywidth/actions/workflows/gotest.yml)
[![Fuzz](https://github.com/clipperhouse/displaywidth/actions/workflows/gofuzz.yml/badge.svg)](https://github.com/clipperhouse/displaywidth/actions/workflows/gofuzz.yml)
## Install
```bash
go get github.com/clipperhouse/displaywidth
```
## Usage
```go
package main
import (
"fmt"
"github.com/clipperhouse/displaywidth"
)
func main() {
width := displaywidth.String("Hello, 世界!")
fmt.Println(width)
width = displaywidth.Bytes([]byte("🌍"))
fmt.Println(width)
width = displaywidth.Rune('🌍')
fmt.Println(width)
}
```
For most purposes, you should use the `String` or `Bytes` methods. They sum
the widths of grapheme clusters in the string or byte slice.
> Note: in your application, iterating over runes to measure width is likely incorrect;
the smallest unit of display is a grapheme, not a rune.
### Iterating over graphemes
If you need the individual graphemes:
```go
import (
"fmt"
"github.com/clipperhouse/displaywidth"
)
func main() {
g := displaywidth.StringGraphemes("Hello, 世界!")
for g.Next() {
width := g.Width()
value := g.Value()
// do something with the width or value
}
}
```
### Options
Create the options you need, and then use methods on the options struct.
```go
var myOptions = displaywidth.Options{
EastAsianWidth: true,
ControlSequences: true,
}
width := myOptions.String("Hello, 世界!")
```
#### ControlSequences
`ControlSequences` specifies whether to ignore ECMA-48 escape sequences
when calculating the display width. When `false` (default), ANSI escape
sequences are treated as just a series of characters. When `true`, they are
treated as a single zero-width unit.
#### ControlSequences8Bit
`ControlSequences8Bit` specifies whether to ignore 8-bit ECMA-48 escape sequences
when calculating the display width. When `false` (default), these are treated
as just a series of characters. When `true`, they are treated as a single
zero-width unit.
Note: this option is ignored by the `Truncate` methods, as the concatenation
can lead to unintended UTF-8 semantics.
#### EastAsianWidth
`EastAsianWidth` defines how
[East Asian Ambiguous characters](https://www.unicode.org/reports/tr11/#Ambiguous)
are treated.
When `false` (default), East Asian Ambiguous characters are treated as width 1.
When `true`, they are treated as width 2.
You may wish to configure this based on environment variables or locale.
`go-runewidth`, for example, does so
[during package initialization](https://github.com/mattn/go-runewidth/blob/master/runewidth.go#L26C1-L45C2). `displaywidth` does not do this automatically, we prefer to leave it to you.
## Technical standards and compatibility
This package implements the Unicode East Asian Width standard
([UAX #11](https://www.unicode.org/reports/tr11/tr11-43.html)), and handles
[version selectors](https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)),
and [regional indicator pairs](https://en.wikipedia.org/wiki/Regional_indicator_symbol)
(flags). We implement [Unicode TR51](https://www.unicode.org/reports/tr51/tr51-27.html)
for emojis. We are keeping an eye on
[emerging standards](https://www.jeffquast.com/post/state-of-terminal-emulation-2025/).
For control sequences, we implement the [ECMA-48](https://ecma-international.org/publications-and-standards/standards/ecma-48/) standard for 7-bit and 8-bit control sequences.
`clipperhouse/displaywidth`, `mattn/go-runewidth`, and `rivo/uniseg` will
give the same outputs for most real-world text. Extensive details are in the
[compatibility analysis](comparison/COMPATIBILITY_ANALYSIS.md).
## Invalid UTF-8
This package does not validate UTF-8. If you pass invalid UTF-8, the results
are undefined. We fuzz against invalid UTF-8 to ensure we don't panic or
loop indefinitely.
The `ControlSequences8Bit` option means that we will segment valid 8-bit
control sequences, which are typically _not_ valid UTF-8. 8-bit control bytes
happen to also be UTF-8 continuation bytes. Use with caution.
## Prior Art
[mattn/go-runewidth](https://github.com/mattn/go-runewidth)
[rivo/uniseg](https://github.com/rivo/uniseg)
[x/text/width](https://pkg.go.dev/golang.org/x/text/width)
[x/text/internal/triegen](https://pkg.go.dev/golang.org/x/text/internal/triegen)
## Benchmarks
```bash
cd comparison
go test -bench=. -benchmem
```
```
goos: darwin
goarch: arm64
pkg: github.com/clipperhouse/displaywidth/comparison
cpu: Apple M2
BenchmarkString_Mixed/clipperhouse/displaywidth-8 5784 ns/op 291.69 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/mattn/go-runewidth-8 14751 ns/op 114.36 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/rivo/uniseg-8 19360 ns/op 87.14 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/clipperhouse/displaywidth-8 54.60 ns/op 2344.32 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/mattn/go-runewidth-8 1195 ns/op 107.08 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/rivo/uniseg-8 1578 ns/op 81.13 MB/s 0 B/op 0 allocs/op
BenchmarkString_EastAsian/clipperhouse/displaywidth-8 5837 ns/op 289.01 MB/s 0 B/op 0 allocs/op
BenchmarkString_EastAsian/mattn/go-runewidth-8 24418 ns/op 69.09 MB/s 0 B/op 0 allocs/op
BenchmarkString_EastAsian/rivo/uniseg-8 19339 ns/op 87.23 MB/s 0 B/op 0 allocs/op
BenchmarkString_Emoji/clipperhouse/displaywidth-8 3225 ns/op 224.51 MB/s 0 B/op 0 allocs/op
BenchmarkString_Emoji/mattn/go-runewidth-8 4851 ns/op 149.25 MB/s 0 B/op 0 allocs/op
BenchmarkString_Emoji/rivo/uniseg-8 6591 ns/op 109.85 MB/s 0 B/op 0 allocs/op
BenchmarkRune_Mixed/clipperhouse/displaywidth-8 3385 ns/op 498.34 MB/s 0 B/op 0 allocs/op
BenchmarkRune_Mixed/mattn/go-runewidth-8 5354 ns/op 315.07 MB/s 0 B/op 0 allocs/op
BenchmarkRune_EastAsian/clipperhouse/displaywidth-8 3397 ns/op 496.56 MB/s 0 B/op 0 allocs/op
BenchmarkRune_EastAsian/mattn/go-runewidth-8 15673 ns/op 107.64 MB/s 0 B/op 0 allocs/op
BenchmarkRune_ASCII/clipperhouse/displaywidth-8 255.7 ns/op 500.53 MB/s 0 B/op 0 allocs/op
BenchmarkRune_ASCII/mattn/go-runewidth-8 261.5 ns/op 489.55 MB/s 0 B/op 0 allocs/op
BenchmarkRune_Emoji/clipperhouse/displaywidth-8 1371 ns/op 528.22 MB/s 0 B/op 0 allocs/op
BenchmarkRune_Emoji/mattn/go-runewidth-8 2267 ns/op 319.43 MB/s 0 B/op 0 allocs/op
BenchmarkTruncateWithTail/clipperhouse/displaywidth-8 3229 ns/op 54.82 MB/s 192 B/op 14 allocs/op
BenchmarkTruncateWithTail/mattn/go-runewidth-8 8408 ns/op 21.05 MB/s 192 B/op 14 allocs/op
BenchmarkTruncateWithoutTail/clipperhouse/displaywidth-8 3554 ns/op 64.43 MB/s 0 B/op 0 allocs/op
BenchmarkTruncateWithoutTail/mattn/go-runewidth-8 11189 ns/op 20.47 MB/s 0 B/op 0 allocs/op
```
Here are some notes on [how to make Unicode things fast](https://clipperhouse.com/go-unicode/).
+3
View File
@@ -0,0 +1,3 @@
package displaywidth
//go:generate go run -C internal/gen .
+73
View File
@@ -0,0 +1,73 @@
package displaywidth
import (
"github.com/clipperhouse/uax29/v2/graphemes"
)
// Graphemes is an iterator over grapheme clusters.
//
// Iterate using the Next method, and get the width of the current grapheme
// using the Width method.
type Graphemes[T ~string | []byte] struct {
iter *graphemes.Iterator[T]
options Options
}
// Next advances the iterator to the next grapheme cluster.
func (g *Graphemes[T]) Next() bool {
return g.iter.Next()
}
// Value returns the current grapheme cluster.
func (g *Graphemes[T]) Value() T {
return g.iter.Value()
}
// Width returns the display width of the current grapheme cluster.
func (g *Graphemes[T]) Width() int {
return graphemeWidth(g.Value(), g.options)
}
// StringGraphemes returns an iterator over grapheme clusters for the given
// string.
//
// Iterate using the Next method, and get the width of the current grapheme
// using the Width method.
func StringGraphemes(s string) Graphemes[string] {
return DefaultOptions.StringGraphemes(s)
}
// StringGraphemes returns an iterator over grapheme clusters for the given
// string, with the given options.
//
// Iterate using the Next method, and get the width of the current grapheme
// using the Width method.
func (options Options) StringGraphemes(s string) Graphemes[string] {
g := graphemes.FromString(s)
g.AnsiEscapeSequences = options.ControlSequences
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
return Graphemes[string]{iter: g, options: options}
}
// BytesGraphemes returns an iterator over grapheme clusters for the given
// []byte.
//
// Iterate using the Next method, and get the width of the current grapheme
// using the Width method.
func BytesGraphemes(s []byte) Graphemes[[]byte] {
return DefaultOptions.BytesGraphemes(s)
}
// BytesGraphemes returns an iterator over grapheme clusters for the given
// []byte, with the given options.
//
// Iterate using the Next method, and get the width of the current grapheme
// using the Width method.
func (options Options) BytesGraphemes(s []byte) Graphemes[[]byte] {
g := graphemes.FromBytes(s)
g.AnsiEscapeSequences = options.ControlSequences
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
return Graphemes[[]byte]{iter: g, options: options}
}
+30
View File
@@ -0,0 +1,30 @@
package displaywidth
// Options allows you to specify the treatment of ambiguous East Asian
// characters and ANSI escape sequences.
type Options struct {
// EastAsianWidth specifies whether to treat ambiguous East Asian characters
// as width 1 or 2. When false (default), ambiguous East Asian characters
// are treated as width 1. When true, they are width 2.
EastAsianWidth bool
// ControlSequences specifies whether to ignore 7-bit ECMA-48 escape sequences
// when calculating the display width. When false (default), ANSI escape
// sequences are treated as just a series of characters. When true, they are
// treated as a single zero-width unit.
ControlSequences bool
// ControlSequences8Bit specifies whether to ignore 8-bit ECMA-48 escape sequences
// when calculating the display width. When false (default), these are treated
// as just a series of characters. When true, they are treated as a single
// zero-width unit.
ControlSequences8Bit bool
}
// DefaultOptions is the default options for the display width
// calculation, which is EastAsianWidth false, ControlSequences false, and
// ControlSequences8Bit false.
var DefaultOptions = Options{
EastAsianWidth: false,
ControlSequences: false,
ControlSequences8Bit: false,
}
File diff suppressed because it is too large Load Diff
+149
View File
@@ -0,0 +1,149 @@
package displaywidth
import (
"strings"
"github.com/clipperhouse/uax29/v2/graphemes"
)
// TruncateString truncates a string to the given maxWidth, and appends the
// given tail if the string is truncated.
//
// It ensures the visible width, including the width of the tail, is less than or
// equal to maxWidth.
//
// When [Options.ControlSequences] is true, 7-bit ANSI escape sequences that
// appear after the truncation point are preserved in the output. This ensures
// that escape sequences such as SGR resets are not lost, preventing color
// bleed in terminal output.
//
// [Options.ControlSequences8Bit] is ignored by truncation. 8-bit C1 byte values
// (0x80-0x9F) overlap with UTF-8 multi-byte encoding, so manipulating them
// during truncation can shift byte boundaries and form unintended visible
// characters. Use [Options.String] or [Options.Bytes] for 8-bit-aware width
// measurement.
func (options Options) TruncateString(s string, maxWidth int, tail string) string {
// We deliberately ignore ControlSequences8Bit for truncation, see above.
options.ControlSequences8Bit = false
maxWidthWithoutTail := maxWidth - options.String(tail)
var pos, total int
g := graphemes.FromString(s)
g.AnsiEscapeSequences = options.ControlSequences
for g.Next() {
gw := graphemeWidth(g.Value(), options)
if total+gw <= maxWidthWithoutTail {
pos = g.End()
}
total += gw
if total > maxWidth {
if options.ControlSequences {
// Build result with trailing 7-bit ANSI escape sequences preserved
var b strings.Builder
b.Grow(len(s) + len(tail)) // at most original + tail
b.WriteString(s[:pos])
b.WriteString(tail)
rem := graphemes.FromString(s[pos:])
rem.AnsiEscapeSequences = options.ControlSequences
for rem.Next() {
v := rem.Value()
// Only preserve 7-bit escapes (ESC = 0x1B) that measure
// as zero-width on their own; some sequences (e.g. SOS)
// are only valid in their original context.
if len(v) > 0 && v[0] == 0x1B && options.String(v) == 0 {
b.WriteString(v)
}
}
return b.String()
}
return s[:pos] + tail
}
}
// No truncation
return s
}
// TruncateString truncates a string to the given maxWidth, and appends the
// given tail if the string is truncated.
//
// It ensures the total width, including the width of the tail, is less than or
// equal to maxWidth.
func TruncateString(s string, maxWidth int, tail string) string {
return DefaultOptions.TruncateString(s, maxWidth, tail)
}
// TruncateBytes truncates a []byte to the given maxWidth, and appends the
// given tail if the []byte is truncated.
//
// It ensures the visible width, including the width of the tail, is less than or
// equal to maxWidth.
//
// When [Options.ControlSequences] is true, 7-bit ANSI escape sequences that
// appear after the truncation point are preserved in the output. This ensures
// that escape sequences such as SGR resets are not lost, preventing color
// bleed in terminal output.
//
// [Options.ControlSequences8Bit] is ignored by truncation. 8-bit C1 byte values
// (0x80-0x9F) overlap with UTF-8 multi-byte encoding, so manipulating them
// during truncation can shift byte boundaries and form unintended visible
// characters. Use [Options.String] or [Options.Bytes] for 8-bit-aware width
// measurement.
func (options Options) TruncateBytes(s []byte, maxWidth int, tail []byte) []byte {
// We deliberately ignore ControlSequences8Bit for truncation, see above.
options.ControlSequences8Bit = false
maxWidthWithoutTail := maxWidth - options.Bytes(tail)
var pos, total int
g := graphemes.FromBytes(s)
g.AnsiEscapeSequences = options.ControlSequences
for g.Next() {
gw := graphemeWidth(g.Value(), options)
if total+gw <= maxWidthWithoutTail {
pos = g.End()
}
total += gw
if total > maxWidth {
if options.ControlSequences {
// Build result with trailing 7-bit ANSI escape sequences preserved
result := make([]byte, 0, len(s)+len(tail)) // at most original + tail
result = append(result, s[:pos]...)
result = append(result, tail...)
rem := graphemes.FromBytes(s[pos:])
rem.AnsiEscapeSequences = options.ControlSequences
for rem.Next() {
v := rem.Value()
// Only preserve 7-bit escapes (ESC = 0x1B) that measure
// as zero-width on their own; some sequences (e.g. SOS)
// are only valid in their original context.
if len(v) > 0 && v[0] == 0x1B && options.Bytes(v) == 0 {
result = append(result, v...)
}
}
return result
}
result := make([]byte, 0, pos+len(tail))
result = append(result, s[:pos]...)
result = append(result, tail...)
return result
}
}
// No truncation
return s
}
// TruncateBytes truncates a []byte to the given maxWidth, and appends the
// given tail if the []byte is truncated.
//
// It ensures the total width, including the width of the tail, is less than or
// equal to maxWidth.
func TruncateBytes(s []byte, maxWidth int, tail []byte) []byte {
return DefaultOptions.TruncateBytes(s, maxWidth, tail)
}
+239
View File
@@ -0,0 +1,239 @@
package displaywidth
import (
"unicode/utf8"
"github.com/clipperhouse/uax29/v2/graphemes"
)
// String calculates the display width of a string,
// by iterating over grapheme clusters in the string
// and summing their widths.
func String(s string) int {
return DefaultOptions.String(s)
}
// String calculates the display width of a string, for the given options, by
// iterating over grapheme clusters in the string and summing their widths.
func (options Options) String(s string) int {
width := 0
pos := 0
for pos < len(s) {
// Try ASCII optimization
asciiLen := printableASCIILength(s[pos:])
if asciiLen > 0 {
width += asciiLen
pos += asciiLen
continue
}
// Not ASCII, use grapheme parsing
g := graphemes.FromString(s[pos:])
g.AnsiEscapeSequences = options.ControlSequences
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
start := pos
for g.Next() {
v := g.Value()
width += graphemeWidth(v, options)
pos += len(v)
// Quick check: if remaining might have printable ASCII, break to outer loop
if pos < len(s) && s[pos] >= 0x20 && s[pos] <= 0x7E {
break
}
}
// Defensive, should not happen: if no progress was made,
// skip a byte to prevent infinite loop. Only applies if
// the grapheme parser misbehaves.
if pos == start {
pos++
}
}
return width
}
// Bytes calculates the display width of a []byte,
// by iterating over grapheme clusters in the byte slice
// and summing their widths.
func Bytes(s []byte) int {
return DefaultOptions.Bytes(s)
}
// Bytes calculates the display width of a []byte, for the given options, by
// iterating over grapheme clusters in the slice and summing their widths.
func (options Options) Bytes(s []byte) int {
width := 0
pos := 0
for pos < len(s) {
// Try ASCII optimization
asciiLen := printableASCIILength(s[pos:])
if asciiLen > 0 {
width += asciiLen
pos += asciiLen
continue
}
// Not ASCII, use grapheme parsing
g := graphemes.FromBytes(s[pos:])
g.AnsiEscapeSequences = options.ControlSequences
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
start := pos
for g.Next() {
v := g.Value()
width += graphemeWidth(v, options)
pos += len(v)
// Quick check: if remaining might have printable ASCII, break to outer loop
if pos < len(s) && s[pos] >= 0x20 && s[pos] <= 0x7E {
break
}
}
// Defensive, should not happen: if no progress was made,
// skip a byte to prevent infinite loop. Only applies if
// the grapheme parser misbehaves.
if pos == start {
pos++
}
}
return width
}
// Rune calculates the display width of a rune. You
// should almost certainly use [String] or [Bytes] for
// most purposes.
//
// The smallest unit of display width is a grapheme
// cluster, not a rune. Iterating over runes to measure
// width is incorrect in many cases.
func Rune(r rune) int {
return DefaultOptions.Rune(r)
}
// Rune calculates the display width of a rune, for the given options.
//
// You should almost certainly use [String] or [Bytes] for most purposes.
//
// The smallest unit of display width is a grapheme cluster, not a rune.
// Iterating over runes to measure width is incorrect in many cases.
func (options Options) Rune(r rune) int {
if r < utf8.RuneSelf {
return asciiWidth(byte(r))
}
// Surrogates (U+D800-U+DFFF) are invalid UTF-8.
if r >= 0xD800 && r <= 0xDFFF {
return 0
}
var buf [4]byte
n := utf8.EncodeRune(buf[:], r)
// Skip the grapheme iterator
return graphemeWidth(buf[:n], options)
}
const _Default property = 0
// graphemeWidth returns the display width of a grapheme cluster.
// The passed string must be a single grapheme cluster.
func graphemeWidth[T ~string | []byte](s T, options Options) int {
if len(s) == 0 {
return 0
}
// C1 controls (0x80-0x9F) are zero-width when 8-bit control sequences
// are enabled. This must be checked before the single-byte optimization
// below, which would otherwise return width 1 for these bytes.
if options.ControlSequences8Bit && s[0] >= 0x80 && s[0] <= 0x9F {
return 0
}
// Optimization: single-byte graphemes need no property lookup
if len(s) == 1 {
return asciiWidth(s[0])
}
// Multi-byte grapheme clusters led by a C0 control (0x00-0x1F)
if s[0] <= 0x1F {
return 0
}
p, sz := lookup(s)
prop := property(p)
// Variation Selector 16 (VS16) requests emoji presentation
if prop != _Wide && sz > 0 && len(s) >= sz+3 {
vs := s[sz : sz+3]
if isVS16(vs) {
prop = _Wide
}
// VS15 (0x8E) requests text presentation but does not affect width,
// in my reading of Unicode TR51. Falls through to return the base
// character's property.
}
if options.EastAsianWidth && prop == _East_Asian_Ambiguous {
prop = _Wide
}
if prop > upperBound {
prop = _Default
}
return propertyWidths[prop]
}
func asciiWidth(b byte) int {
if b <= 0x1F || b == 0x7F {
return 0
}
return 1
}
// printableASCIILength returns the length of consecutive printable ASCII bytes
// starting at the beginning of s.
func printableASCIILength[T string | []byte](s T) int {
i := 0
for ; i < len(s); i++ {
b := s[i]
// Printable ASCII is 0x20-0x7E (space through tilde)
if b < 0x20 || b > 0x7E {
break
}
}
// If the next byte is non-ASCII (>= 0x80), back off by 1. The grapheme
// parser may group the last ASCII byte with subsequent non-ASCII bytes,
// such as combining marks.
if i > 0 && i < len(s) && s[i] >= 0x80 {
i--
}
return i
}
// isVS16 checks if the slice matches VS16 (U+FE0F) UTF-8 encoding
// (EF B8 8F). It assumes len(s) >= 3.
func isVS16[T ~string | []byte](s T) bool {
return s[0] == 0xEF && s[1] == 0xB8 && s[2] == 0x8F
}
// propertyWidths is a jump table of sorts, instead of a switch
var propertyWidths = [4]int{
_Default: 1,
_Zero_Width: 0,
_Wide: 2,
_East_Asian_Ambiguous: 1,
}
const upperBound = property(len(propertyWidths) - 1)
+21
View File
@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2020 Matt Sherman
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
+120
View File
@@ -0,0 +1,120 @@
An implementation of grapheme cluster boundaries from [Unicode text segmentation](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) (UAX 29), for Unicode 17.
[![Documentation](https://pkg.go.dev/badge/github.com/clipperhouse/uax29/v2/graphemes.svg)](https://pkg.go.dev/github.com/clipperhouse/uax29/v2/graphemes)
![Tests](https://github.com/clipperhouse/uax29/actions/workflows/gotest.yml/badge.svg)
![Fuzz](https://github.com/clipperhouse/uax29/actions/workflows/gofuzz.yml/badge.svg)
## Quick start
```
go get github.com/clipperhouse/uax29/v2/graphemes
```
```go
import "github.com/clipperhouse/uax29/v2/graphemes"
text := "Hello, 世界. Nice dog! 👍🐶"
g := graphemes.FromString(text)
for g.Next() { // Next() returns true until end of data
fmt.Println(g.Value()) // Do something with the current grapheme
}
```
_A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points._
## Conformance
We use the Unicode [test suite](https://unicode.org/reports/tr41/tr41-36.html#Tests29).
![Tests](https://github.com/clipperhouse/uax29/actions/workflows/gotest.yml/badge.svg)
![Fuzz](https://github.com/clipperhouse/uax29/actions/workflows/gofuzz.yml/badge.svg)
## APIs
### If you have a `string`
```go
text := "Hello, 世界. Nice dog! 👍🐶"
g := graphemes.FromString(text)
for g.Next() { // Next() returns true until end of data
fmt.Println(g.Value()) // Do something with the current grapheme
}
```
### If you have an `io.Reader`
`FromReader` embeds a [`bufio.Scanner`](https://pkg.go.dev/bufio#Scanner), so just use those methods.
```go
r := getYourReader() // from a file or network maybe
g := graphemes.FromReader(r)
for g.Scan() { // Scan() returns true until error or EOF
fmt.Println(g.Text()) // Do something with the current grapheme
}
if g.Err() != nil { // Check the error
log.Fatal(g.Err())
}
```
### If you have a `[]byte`
```go
b := []byte("Hello, 世界. Nice dog! 👍🐶")
g := graphemes.FromBytes(b)
for g.Next() { // Next() returns true until end of data
fmt.Println(g.Value()) // Do something with the current grapheme
}
```
### ANSI escape sequences
By the UAX 29 specification, ANSI escape sequences are not grapheme clusters. To treat 7-bit ANSI escape sequences as a single cluster, set `AnsiEscapeSequences` to true.
```go
text := "Hello, \x1b[31mworld\x1b[0m!"
g := graphemes.FromString(text)
g.AnsiEscapeSequences = true
for g.Next() {
fmt.Println(g.Value())
}
```
To also parse 8-bit C1 controls (non-UTF-8 bytes), set `AnsiEscapeSequences8Bit` to true.
```go
g.AnsiEscapeSequences = true // 7-bit forms (ESC ...)
g.AnsiEscapeSequences8Bit = true // 8-bit C1 forms (0x80-0x9F), not valid UTF-8
```
For ESC-initiated (7-bit) control strings, only 7-bit terminators are recognized.
For C1-initiated (8-bit) control strings, only C1 ST (`0x9C`) is recognized as ST.
We implement [ECMA-48](https://ecma-international.org/publications-and-standards/standards/ecma-48/) control codes in both 7-bit and 8-bit representations. 8-bit control codes are not UTF-8 encoded and are not valid UTF-8, caveat emptor.
### Benchmarks
```
goos: darwin
goarch: arm64
pkg: github.com/clipperhouse/uax29/graphemes/comparative
cpu: Apple M2
BenchmarkGraphemesMixed/clipperhouse/uax29-8 142635 ns/op 245.12 MB/s 0 B/op 0 allocs/op
BenchmarkGraphemesMixed/rivo/uniseg-8 2018284 ns/op 17.32 MB/s 0 B/op 0 allocs/op
BenchmarkGraphemesASCII/clipperhouse/uax29-8 8846 ns/op 508.73 MB/s 0 B/op 0 allocs/op
BenchmarkGraphemesASCII/rivo/uniseg-8 366760 ns/op 12.27 MB/s 0 B/op 0 allocs/op
```
### Invalid inputs
Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
Your pipeline should probably include a call to [`utf8.Valid()`](https://pkg.go.dev/unicode/utf8#Valid).
+138
View File
@@ -0,0 +1,138 @@
package graphemes
// ansiEscapeLength returns the byte length of a valid 7-bit ANSI escape
// sequence at the start of data, or 0 if none.
//
// Recognized forms (ECMA-48 / ISO 6429):
// - CSI: ESC [ then parameter bytes (0x30-0x3F), intermediate (0x20-0x2F), final (0x40-0x7E)
// - OSC: ESC ] then payload until BEL (0x07), 7-bit ST (ESC \), CAN (0x18), or SUB (0x1A)
// - DCS, SOS, PM, APC: ESC P/X/^/_ then payload until 7-bit ST (ESC \), CAN, or SUB
// - Two-byte: ESC + Fe/Fs (0x40-0x7E excluding above), or Fp (0x30-0x3F), or nF (0x20-0x2F then final)
func ansiEscapeLength[T ~string | ~[]byte](data T) int {
n := len(data)
if n < 2 || data[0] != esc {
return 0
}
b1 := data[1]
switch b1 {
case '[': // CSI
body := csiBodyLength(data[2:])
if body == 0 {
return 0
}
return 2 + body
case ']': // OSC - allows BEL or 7-bit ST terminator
body := oscLength(data[2:])
if body < 0 {
return 0
}
return 2 + body
case 'P', 'X', '^', '_': // DCS, SOS, PM, APC
body := stSequenceLength(data[2:])
if body < 0 {
return 0
}
return 2 + body
}
if b1 >= 0x40 && b1 <= 0x7E {
// Fe/Fs two-byte; [ ] P X ^ _ handled above
return 2
}
if b1 >= 0x30 && b1 <= 0x3F {
// Fp (private) two-byte
return 2
}
if b1 >= 0x20 && b1 <= 0x2F {
// nF: intermediates then one final (0x30-0x7E)
i := 2
for i < n && data[i] >= 0x20 && data[i] <= 0x2F {
i++
}
if i < n && data[i] >= 0x30 && data[i] <= 0x7E {
return i + 1
}
return 0
}
return 0
}
// csiBodyLength returns the length of the CSI body (param/intermediate/final bytes).
// data is the slice after "ESC [".
// Per ECMA-48, the CSI body has the form:
//
// parameters (0x30–0x3F)*, intermediates (0x20–0x2F)*, final (0x40–0x7E)
//
// Once an intermediate byte is seen, subsequent parameter bytes are invalid.
func csiBodyLength[T ~string | ~[]byte](data T) int {
seenIntermediate := false
for i := 0; i < len(data); i++ {
b := data[i]
if b >= 0x30 && b <= 0x3F {
if seenIntermediate {
return 0
}
continue
}
if b >= 0x20 && b <= 0x2F {
seenIntermediate = true
continue
}
if b >= 0x40 && b <= 0x7E {
return i + 1
}
return 0
}
return 0
}
// oscLength returns the length of the OSC body.
// data is the slice after "ESC ]".
//
// Returns:
// - n >= 0: consumed body length (includes BEL/ST terminator when present)
// - -1: not terminated in the provided data
//
// OSC accepts BEL (0x07) or 7-bit ST (ESC \) as terminators by widespread convention.
// Per ECMA-48, CAN (0x18) and SUB (0x1A) cancel the control string; in that
// case they are not part of the OSC sequence length.
func oscLength[T ~string | ~[]byte](data T) int {
for i := 0; i < len(data); i++ {
b := data[i]
if b == bel {
return i + 1
}
if b == can || b == sub {
return i
}
if b == esc && i+1 < len(data) && data[i+1] == '\\' {
return i + 2
}
}
return -1
}
// stSequenceLength returns the length of a control-string body.
// data is the slice after "ESC x".
//
// Returns:
// - n >= 0: consumed body length (includes ST terminator when present)
// - -1: not terminated in the provided data
//
// Used for DCS, SOS, PM, and APC, which per ECMA-48 terminate with ST.
// ST here is the 7-bit form (ESC \).
// CAN (0x18) and SUB (0x1A) cancel the control string; in that case they are
// not part of the sequence length.
func stSequenceLength[T ~string | ~[]byte](data T) int {
for i := 0; i < len(data); i++ {
if data[i] == can || data[i] == sub {
return i
}
if data[i] == esc && i+1 < len(data) && data[i+1] == '\\' {
return i + 2
}
}
return -1
}
+79
View File
@@ -0,0 +1,79 @@
package graphemes
// ansiEscapeLength8Bit returns the byte length of a valid 8-bit C1 ANSI
// sequence at the start of data, or 0 if none.
//
// Recognized forms (ECMA-48 / ISO 6429):
// - C1 CSI (0x9B) body as parameter/intermediate/final bytes
// - C1 OSC (0x9D) body terminated by BEL, C1 ST, CAN, or SUB
// - C1 DCS/SOS/PM/APC (0x90/0x98/0x9E/0x9F) body terminated by C1 ST, CAN, or SUB
// - Standalone C1 controls (0x80..0x9F not listed above): single byte
func ansiEscapeLength8Bit[T ~string | ~[]byte](data T) int {
if len(data) == 0 {
return 0
}
switch data[0] {
case 0x9B: // C1 CSI
body := csiBodyLength(data[1:])
if body == 0 {
return 0
}
return 1 + body
case 0x9D: // C1 OSC
body := oscLengthC1(data[1:])
if body < 0 {
return 0
}
return 1 + body
case 0x90, 0x98, 0x9E, 0x9F: // C1 DCS, SOS, PM, APC
body := stSequenceLengthC1(data[1:])
if body < 0 {
return 0
}
return 1 + body
default:
if data[0] >= 0x80 && data[0] <= 0x9F {
return 1
}
}
return 0
}
// oscLengthC1 returns the length of a C1 OSC body.
// data is the slice after the C1 OSC initiator (0x9D).
//
// Returns:
// - n >= 0: consumed body length (includes BEL/ST terminator when present)
// - -1: not terminated in the provided data
//
// Terminators: BEL (0x07) or C1 ST (0x9C).
// CAN (0x18) and SUB (0x1A) cancel the control string.
func oscLengthC1[T ~string | ~[]byte](data T) int {
for i := 0; i < len(data); i++ {
b := data[i]
if b == bel || b == st {
return i + 1
}
if b == can || b == sub {
return i
}
}
return -1
}
// stSequenceLengthC1 parses DCS/SOS/PM/APC bodies that terminate with C1 ST
// (0x9C), or are canceled by CAN/SUB.
func stSequenceLengthC1[T ~string | ~[]byte](data T) int {
for i := 0; i < len(data); i++ {
b := data[i]
if b == can || b == sub {
return i
}
if b == st {
return i + 1
}
}
return -1
}
+144
View File
@@ -0,0 +1,144 @@
package graphemes
import "unicode/utf8"
// FromString returns an iterator for the grapheme clusters in the input string.
// Iterate while Next() is true, and access the grapheme via Value().
func FromString(s string) *Iterator[string] {
return &Iterator[string]{
split: splitFuncString,
data: s,
}
}
// FromBytes returns an iterator for the grapheme clusters in the input bytes.
// Iterate while Next() is true, and access the grapheme via Value().
func FromBytes(b []byte) *Iterator[[]byte] {
return &Iterator[[]byte]{
split: splitFuncBytes,
data: b,
}
}
// Iterator is a generic iterator for grapheme clusters in strings or byte slices,
// with an ASCII hot path optimization.
type Iterator[T ~string | ~[]byte] struct {
split func(T, bool) (int, T, error)
data T
pos int
start int
// AnsiEscapeSequences treats 7-bit ANSI escape sequences (ECMA-48) as
// single grapheme clusters when true. The default is false.
//
// 8-bit controls are not enabled by this option. See [AnsiEscapeSequences8Bit].
AnsiEscapeSequences bool
// AnsiEscapeSequences8Bit treats 8-bit C1 ANSI escape sequences (ECMA-48) as single
// grapheme clusters when true. The default is false.
//
// 8-bit control bytes are not UTF-8 encoded, i.e. not valid UTF-8. If you
// choose this option, you are choosing to interpret non-UTF-8 data, caveat
// emptor.
AnsiEscapeSequences8Bit bool
}
var (
splitFuncString = splitFunc[string]
splitFuncBytes = splitFunc[[]byte]
)
const (
esc = 0x1B
cr = 0x0D
bel = 0x07
can = 0x18
sub = 0x1A
st = 0x9C
)
// Next advances the iterator to the next grapheme cluster.
// Returns false when there are no more grapheme clusters.
func (iter *Iterator[T]) Next() bool {
if iter.pos >= len(iter.data) {
return false
}
iter.start = iter.pos
b := iter.data[iter.pos]
if iter.AnsiEscapeSequences && b == esc {
if a := ansiEscapeLength(iter.data[iter.pos:]); a > 0 {
iter.pos += a
return true
}
}
if iter.AnsiEscapeSequences8Bit && b >= 0x80 && b <= 0x9F {
if a := ansiEscapeLength8Bit(iter.data[iter.pos:]); a > 0 {
iter.pos += a
return true
}
}
// ASCII hot path: any ASCII is one grapheme when next byte is ASCII or end.
if b < utf8.RuneSelf && b != cr {
if iter.pos+1 >= len(iter.data) || iter.data[iter.pos+1] < utf8.RuneSelf {
iter.pos++
return true
}
}
// Fall back to UAX29 grapheme parsing
remaining := iter.data[iter.pos:]
advance, _, err := iter.split(remaining, true)
if err != nil {
panic(err)
}
if advance <= 0 {
panic("splitFunc returned a zero or negative advance")
}
iter.pos += advance
if iter.pos > len(iter.data) {
panic("splitFunc advanced beyond end of data")
}
return true
}
// Value returns the current grapheme cluster.
func (iter *Iterator[T]) Value() T {
return iter.data[iter.start:iter.pos]
}
// Start returns the byte position of the current grapheme in the original data.
func (iter *Iterator[T]) Start() int {
return iter.start
}
// End returns the byte position after the current grapheme in the original data.
func (iter *Iterator[T]) End() int {
return iter.pos
}
// Reset resets the iterator to the beginning of the data.
func (iter *Iterator[T]) Reset() {
iter.start = 0
iter.pos = 0
}
// SetText sets the data for the iterator to operate on, and resets all state.
func (iter *Iterator[T]) SetText(data T) {
iter.data = data
iter.start = 0
iter.pos = 0
}
// First returns the first grapheme cluster without advancing the iterator.
func (iter *Iterator[T]) First() T {
if len(iter.data) == 0 {
return iter.data
}
// Use a copy to leverage Next()'s ASCII optimization
cp := *iter
cp.pos = 0
cp.start = 0
cp.Next()
return cp.Value()
}
+25
View File
@@ -0,0 +1,25 @@
// Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
package graphemes
import (
"bufio"
"io"
)
type Scanner struct {
*bufio.Scanner
}
// FromReader returns a Scanner, to split graphemes per
// https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries.
//
// It embeds a [bufio.Scanner], so you can use its methods.
//
// Iterate through graphemes by calling Scan() until false, then check Err().
func FromReader(r io.Reader) *Scanner {
sc := bufio.NewScanner(r)
sc.Split(SplitFunc)
return &Scanner{
Scanner: sc,
}
}
+205
View File
@@ -0,0 +1,205 @@
package graphemes
import (
"bufio"
)
// is determines if lookup intersects propert(ies)
func (lookup property) is(properties property) bool {
return (lookup & properties) != 0
}
const _Ignore = _Extend
// incbState tracks state for GB9c rule (Indic conjunct clusters)
// Pattern: Consonant (Extend|Linker)* Linker (Extend|Linker)* × Consonant
type incbState int
const (
incbNone incbState = iota // initial/reset
incbConsonant // seen Consonant, awaiting Linker
incbLinker // seen Consonant and Linker (conjunct ready)
)
// SplitFunc is a bufio.SplitFunc implementation of Unicode grapheme cluster segmentation, for use with bufio.Scanner.
//
// See https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries.
var SplitFunc bufio.SplitFunc = splitFunc[[]byte]
func splitFunc[T ~string | ~[]byte](data T, atEOF bool) (advance int, token T, err error) {
var empty T
if len(data) == 0 {
return 0, empty, nil
}
// These vars are stateful across loop iterations
var pos int
var lastExIgnore property = 0 // "last excluding ignored categories"
var lastLastExIgnore property = 0 // "last one before that"
var regionalIndicatorCount int
// GB9c state: tracking Indic conjunct clusters
var incb incbState
// Rules are usually of the form Cat1 × Cat2; "current" refers to the first property
// to the right of the ×, from which we look back or forward
current, w := lookup(data[pos:])
if w == 0 {
if !atEOF {
// Rune extends past current data, request more
return 0, empty, nil
}
pos = len(data)
return pos, data[:pos], nil
}
// https://unicode.org/reports/tr29/#GB1
// Start of text always advances
pos += w
for {
eot := pos == len(data) // "end of text"
if eot {
if !atEOF {
// Token extends past current data, request more
return 0, empty, nil
}
// https://unicode.org/reports/tr29/#GB2
break
}
/*
We've switched the evaluation order of GB1 and GB2. It's ok:
because we've checked for len(data) at the top of this function,
sot and eot are mutually exclusive, order doesn't matter.
*/
// Rules are usually of the form Cat1 × Cat2; "current" refers to the first property
// to the right of the ×, from which we look back or forward
// Remember previous properties to avoid lookups/lookbacks
last := current
if !last.is(_Ignore) {
lastLastExIgnore = lastExIgnore
lastExIgnore = last
}
// Update GB9c state based on what we just advanced past
if last.is(_InCBConsonant | _InCBLinker | _InCBExtend) {
switch {
case last.is(_InCBConsonant):
if incb != incbLinker {
incb = incbConsonant
}
case last.is(_InCBLinker):
if incb >= incbConsonant {
incb = incbLinker
}
// case last.is(_InCBExtend): stay in current state
}
} else {
incb = incbNone
}
current, w = lookup(data[pos:])
if w == 0 {
if atEOF {
// Just return the bytes, we can't do anything with them
pos = len(data)
break
}
// Rune extends past current data, request more
return 0, empty, nil
}
// Optimization: no rule can possibly apply
if current|last == 0 { // i.e. both are zero
break
}
// https://unicode.org/reports/tr29/#GB3
if current.is(_LF) && last.is(_CR) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB4
// https://unicode.org/reports/tr29/#GB5
if (current | last).is(_Control | _CR | _LF) {
break
}
// https://unicode.org/reports/tr29/#GB6
if current.is(_L|_V|_LV|_LVT) && last.is(_L) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB7
if current.is(_V|_T) && last.is(_LV|_V) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB8
if current.is(_T) && last.is(_LVT|_T) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB9
if current.is(_Extend | _ZWJ) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB9a
if current.is(_SpacingMark) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB9b
if last.is(_Prepend) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB9c
// Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.
if incb == incbLinker && current.is(_InCBConsonant) {
// After matching the pattern, reset state to start tracking a new pattern
// The current Consonant becomes the start of the new pattern
incb = incbConsonant
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB11
if current.is(_ExtendedPictographic) && last.is(_ZWJ) && lastLastExIgnore.is(_ExtendedPictographic) {
pos += w
continue
}
// https://unicode.org/reports/tr29/#GB12
// https://unicode.org/reports/tr29/#GB13
if (current & last).is(_RegionalIndicator) {
regionalIndicatorCount++
odd := regionalIndicatorCount%2 == 1
if odd {
pos += w
continue
}
}
// If we fall through all the above rules, it's a grapheme cluster break
break
}
// Return token
return pos, data[:pos], nil
}
File diff suppressed because it is too large Load Diff