mirror of
https://github.com/jesseduffield/lazygit.git
synced 2026-05-22 10:15:43 +02:00
Bump tcell dependency to v3
This commit is contained in:
@@ -0,0 +1,3 @@
|
||||
.DS_Store
|
||||
*.out
|
||||
*.test
|
||||
+51
@@ -0,0 +1,51 @@
|
||||
The goals and overview of this package can be found in the README.md file,
|
||||
start by reading that.
|
||||
|
||||
The goal of this package is to determine the display (column) width of a
|
||||
string, UTF-8 bytes, or runes, as would happen in a monospace font, especially
|
||||
in a terminal.
|
||||
|
||||
When troubleshooting, write Go unit tests instead of executing debug scripts.
|
||||
The tests can return whatever logs or output you need. If those tests are
|
||||
only for temporary troubleshooting, clean up the tests after the debugging is
|
||||
done.
|
||||
|
||||
(Separate executable debugging scripts are messy, tend to have conflicting
|
||||
dependencies and are hard to cleanup.)
|
||||
|
||||
If you make changes to the trie generation in internal/gen, it can be invoked
|
||||
by running `go generate` from the top package directory.
|
||||
|
||||
## Pull Requests and branches
|
||||
|
||||
For PRs (pull requests), you can use the gh CLI tool. Compare the current branch with main. Reviewing a PR and reviewing a branch are about the same, but the PR may add context.
|
||||
|
||||
Understand the goals of the PR. Note any API changes, especially breaking changes.
|
||||
|
||||
Look for thoroughness of tests, as well as GoDoc comments.
|
||||
|
||||
Retrieve and consider the comments on the PR, which may have come from GitHub Copilot or Cursor BugBot. Think like GitHub Copilot or Cursor BugBot.
|
||||
|
||||
Offer to optionally post a brief summary of the review to the PR, via the gh CLI tool.
|
||||
|
||||
## Tagged Go releases
|
||||
|
||||
If I ask you whether we are ready to release, this means a tagged Go release on the main branch. Go releases are git tagged with a version number.
|
||||
|
||||
Review the changes since the last release, i.e. the previous git tag. Ensure that the changes are complete and correct. Identify new features, bug fixes, and performance improvements.
|
||||
|
||||
Identify breaking changes, especially API changes.
|
||||
|
||||
Ensure good test coverage. Look for performance changes, especially performance regressions, by running benchmarks against the previous release.
|
||||
|
||||
Ensure that the documentation in READMEs and GoDocs are complete, correct and consistent.
|
||||
|
||||
## Comparisons to go-runewidth
|
||||
|
||||
We originally attempted to make this package compatible with go-runewidth.
|
||||
However, we found that there were too many differences in the handling of
|
||||
certain characters and properties.
|
||||
|
||||
We believe, preliminarily, that our choices are more correct and complete,
|
||||
by using more complete categories such as Unicode Cf (format) for zero-width
|
||||
and Mn (Nonspacing_Mark) for combining marks.
|
||||
+129
@@ -0,0 +1,129 @@
|
||||
# Changelog
|
||||
|
||||
## [0.11.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.10.0...v0.11.0)
|
||||
|
||||
### Added
|
||||
- New `ControlSequences8Bit` option to treat 8-bit ECMA-48 (C1) escape sequences as zero-width. (#22)
|
||||
|
||||
### Changed
|
||||
- Upgraded uax29 dependency to v2.7.0 for 8-bit escape sequence support in the grapheme iterator.
|
||||
- Truncation now validates that preserved trailing escape sequences are zero-width, preventing edge cases where non-zero-width sequences could leak into output.
|
||||
|
||||
### Note
|
||||
- `ControlSequences8Bit` is deliberately ignored by `TruncateString` and `TruncateBytes`, because C1 byte values (0x80–0x9F) overlap with UTF-8 multi-byte encoding.
|
||||
|
||||
## [0.10.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.9.0...v0.10.0)
|
||||
|
||||
### Added
|
||||
- New `ControlSequences` option to treat ECMA-48/ANSI escape sequences as zero-width. (#20)
|
||||
- `TruncateString` and `TruncateBytes` now preserve trailing ANSI escape sequences (such as SGR resets) when `ControlSequences` is true, preventing color bleed in terminal output.
|
||||
|
||||
### Changed
|
||||
- Removed `stringish` dependency; generic type constraints are now inline `~string | []byte`.
|
||||
- Upgraded uax29 dependency to v2.6.0 for ANSI escape sequence support in the grapheme iterator.
|
||||
|
||||
## [0.9.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.8.0...v0.9.0)
|
||||
|
||||
### Changed
|
||||
- Unicode 17 support: East Asian Width and emoji data updated to Unicode 17.0.0. (#18)
|
||||
- Upgraded uax29 dependency to v2.5.0 (Unicode 17 grapheme segmentation).
|
||||
|
||||
## [0.8.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.7.0...v0.8.0)
|
||||
|
||||
### Changed
|
||||
- Performance: ASCII fast path that applies to any run of printable
|
||||
ASCII. 2x-10x faster for ASCII text vs v0.7.0. (#16)
|
||||
- Upgraded uax29 dependency to v2.4.0 for Unicode 16 support. Text that includes
|
||||
Indic_Conjunct_Break may segment differently (and more correctly). (#15)
|
||||
|
||||
## [0.7.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.2...v0.7.0)
|
||||
|
||||
### Added
|
||||
- New `TruncateString` and `TruncateBytes` methods to truncate strings to a
|
||||
maximum display width, with optional tail (like an ellipsis). (#13)
|
||||
|
||||
## [0.6.2]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.1...v0.6.2)
|
||||
|
||||
### Changed
|
||||
- Internal: reduced property categories for simpler trie.
|
||||
|
||||
## [0.6.1]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.6.0...v0.6.1)
|
||||
|
||||
### Changed
|
||||
- Perf improvements: replaced the ASCII lookup table with a simple
|
||||
function. A bit more cache-friendly. More inlining.
|
||||
- Bug fix: single regional indicators are now treated as width 2, since that
|
||||
is what actual terminals do.
|
||||
|
||||
## [0.6.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.5.0...v0.6.0)
|
||||
|
||||
### Added
|
||||
- New `StringGraphemes` and `BytesGraphemes` methods, for iterating over the
|
||||
widths of grapheme clusters.
|
||||
|
||||
### Changed
|
||||
- Fast ASCII lookups
|
||||
|
||||
## [0.5.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.4.1...v0.5.0)
|
||||
|
||||
### Added
|
||||
- Unicode 16 support
|
||||
- Improved emoji presentation handling per Unicode TR51
|
||||
|
||||
### Changed
|
||||
- Corrected VS15 (U+FE0E) handling: now preserves base character width (no-op) per Unicode TR51
|
||||
- Performance optimizations: reduced property lookups
|
||||
|
||||
### Fixed
|
||||
- VS15 variation selector now correctly preserves base character width instead of forcing width 1
|
||||
|
||||
## [0.4.1]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.4.0...v0.4.1)
|
||||
|
||||
### Changed
|
||||
- Updated uax29 dependency
|
||||
- Improved flag handling
|
||||
|
||||
## [0.4.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.3.1...v0.4.0)
|
||||
|
||||
### Added
|
||||
- Support for variation selectors (VS15, VS16) and regional indicator pairs (flags)
|
||||
|
||||
## [0.3.1]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.3.0...v0.3.1)
|
||||
|
||||
### Added
|
||||
- Fuzz testing support
|
||||
|
||||
### Changed
|
||||
- Updated stringish dependency
|
||||
|
||||
## [0.3.0]
|
||||
|
||||
[Compare](https://github.com/clipperhouse/displaywidth/compare/v0.2.0...v0.3.0)
|
||||
|
||||
### Changed
|
||||
- Dropped compatibility with go-runewidth
|
||||
- Trie implementation cleanup
|
||||
+21
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 Matt Sherman
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
+190
@@ -0,0 +1,190 @@
|
||||
# displaywidth
|
||||
|
||||
A high-performance Go package for measuring the monospace display width of strings, UTF-8 bytes, and runes.
|
||||
|
||||
[](https://pkg.go.dev/github.com/clipperhouse/displaywidth)
|
||||
[](https://github.com/clipperhouse/displaywidth/actions/workflows/gotest.yml)
|
||||
[](https://github.com/clipperhouse/displaywidth/actions/workflows/gofuzz.yml)
|
||||
|
||||
## Install
|
||||
```bash
|
||||
go get github.com/clipperhouse/displaywidth
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"github.com/clipperhouse/displaywidth"
|
||||
)
|
||||
|
||||
func main() {
|
||||
width := displaywidth.String("Hello, 世界!")
|
||||
fmt.Println(width)
|
||||
|
||||
width = displaywidth.Bytes([]byte("🌍"))
|
||||
fmt.Println(width)
|
||||
|
||||
width = displaywidth.Rune('🌍')
|
||||
fmt.Println(width)
|
||||
}
|
||||
```
|
||||
|
||||
For most purposes, you should use the `String` or `Bytes` methods. They sum
|
||||
the widths of grapheme clusters in the string or byte slice.
|
||||
|
||||
> Note: in your application, iterating over runes to measure width is likely incorrect;
|
||||
the smallest unit of display is a grapheme, not a rune.
|
||||
|
||||
### Iterating over graphemes
|
||||
|
||||
If you need the individual graphemes:
|
||||
|
||||
```go
|
||||
import (
|
||||
"fmt"
|
||||
"github.com/clipperhouse/displaywidth"
|
||||
)
|
||||
|
||||
func main() {
|
||||
g := displaywidth.StringGraphemes("Hello, 世界!")
|
||||
for g.Next() {
|
||||
width := g.Width()
|
||||
value := g.Value()
|
||||
// do something with the width or value
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
Create the options you need, and then use methods on the options struct.
|
||||
|
||||
```go
|
||||
var myOptions = displaywidth.Options{
|
||||
EastAsianWidth: true,
|
||||
ControlSequences: true,
|
||||
}
|
||||
|
||||
width := myOptions.String("Hello, 世界!")
|
||||
```
|
||||
|
||||
#### ControlSequences
|
||||
|
||||
`ControlSequences` specifies whether to ignore ECMA-48 escape sequences
|
||||
when calculating the display width. When `false` (default), ANSI escape
|
||||
sequences are treated as just a series of characters. When `true`, they are
|
||||
treated as a single zero-width unit.
|
||||
|
||||
#### ControlSequences8Bit
|
||||
|
||||
`ControlSequences8Bit` specifies whether to ignore 8-bit ECMA-48 escape sequences
|
||||
when calculating the display width. When `false` (default), these are treated
|
||||
as just a series of characters. When `true`, they are treated as a single
|
||||
zero-width unit.
|
||||
|
||||
Note: this option is ignored by the `Truncate` methods, as the concatenation
|
||||
can lead to unintended UTF-8 semantics.
|
||||
|
||||
#### EastAsianWidth
|
||||
|
||||
`EastAsianWidth` defines how
|
||||
[East Asian Ambiguous characters](https://www.unicode.org/reports/tr11/#Ambiguous)
|
||||
are treated.
|
||||
|
||||
When `false` (default), East Asian Ambiguous characters are treated as width 1.
|
||||
When `true`, they are treated as width 2.
|
||||
|
||||
You may wish to configure this based on environment variables or locale.
|
||||
`go-runewidth`, for example, does so
|
||||
[during package initialization](https://github.com/mattn/go-runewidth/blob/master/runewidth.go#L26C1-L45C2). `displaywidth` does not do this automatically, we prefer to leave it to you.
|
||||
|
||||
|
||||
## Technical standards and compatibility
|
||||
|
||||
This package implements the Unicode East Asian Width standard
|
||||
([UAX #11](https://www.unicode.org/reports/tr11/tr11-43.html)), and handles
|
||||
[version selectors](https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)),
|
||||
and [regional indicator pairs](https://en.wikipedia.org/wiki/Regional_indicator_symbol)
|
||||
(flags). We implement [Unicode TR51](https://www.unicode.org/reports/tr51/tr51-27.html)
|
||||
for emojis. We are keeping an eye on
|
||||
[emerging standards](https://www.jeffquast.com/post/state-of-terminal-emulation-2025/).
|
||||
|
||||
For control sequences, we implement the [ECMA-48](https://ecma-international.org/publications-and-standards/standards/ecma-48/) standard for 7-bit and 8-bit control sequences.
|
||||
|
||||
`clipperhouse/displaywidth`, `mattn/go-runewidth`, and `rivo/uniseg` will
|
||||
give the same outputs for most real-world text. Extensive details are in the
|
||||
[compatibility analysis](comparison/COMPATIBILITY_ANALYSIS.md).
|
||||
|
||||
## Invalid UTF-8
|
||||
|
||||
This package does not validate UTF-8. If you pass invalid UTF-8, the results
|
||||
are undefined. We fuzz against invalid UTF-8 to ensure we don't panic or
|
||||
loop indefinitely.
|
||||
|
||||
The `ControlSequences8Bit` option means that we will segment valid 8-bit
|
||||
control sequences, which are typically _not_ valid UTF-8. 8-bit control bytes
|
||||
happen to also be UTF-8 continuation bytes. Use with caution.
|
||||
|
||||
## Prior Art
|
||||
|
||||
[mattn/go-runewidth](https://github.com/mattn/go-runewidth)
|
||||
|
||||
[rivo/uniseg](https://github.com/rivo/uniseg)
|
||||
|
||||
[x/text/width](https://pkg.go.dev/golang.org/x/text/width)
|
||||
|
||||
[x/text/internal/triegen](https://pkg.go.dev/golang.org/x/text/internal/triegen)
|
||||
|
||||
## Benchmarks
|
||||
|
||||
```bash
|
||||
cd comparison
|
||||
go test -bench=. -benchmem
|
||||
```
|
||||
|
||||
```
|
||||
goos: darwin
|
||||
goarch: arm64
|
||||
pkg: github.com/clipperhouse/displaywidth/comparison
|
||||
cpu: Apple M2
|
||||
|
||||
BenchmarkString_Mixed/clipperhouse/displaywidth-8 5784 ns/op 291.69 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_Mixed/mattn/go-runewidth-8 14751 ns/op 114.36 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_Mixed/rivo/uniseg-8 19360 ns/op 87.14 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkString_ASCII/clipperhouse/displaywidth-8 54.60 ns/op 2344.32 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_ASCII/mattn/go-runewidth-8 1195 ns/op 107.08 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_ASCII/rivo/uniseg-8 1578 ns/op 81.13 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkString_EastAsian/clipperhouse/displaywidth-8 5837 ns/op 289.01 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_EastAsian/mattn/go-runewidth-8 24418 ns/op 69.09 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_EastAsian/rivo/uniseg-8 19339 ns/op 87.23 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkString_Emoji/clipperhouse/displaywidth-8 3225 ns/op 224.51 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_Emoji/mattn/go-runewidth-8 4851 ns/op 149.25 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkString_Emoji/rivo/uniseg-8 6591 ns/op 109.85 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkRune_Mixed/clipperhouse/displaywidth-8 3385 ns/op 498.34 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkRune_Mixed/mattn/go-runewidth-8 5354 ns/op 315.07 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkRune_EastAsian/clipperhouse/displaywidth-8 3397 ns/op 496.56 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkRune_EastAsian/mattn/go-runewidth-8 15673 ns/op 107.64 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkRune_ASCII/clipperhouse/displaywidth-8 255.7 ns/op 500.53 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkRune_ASCII/mattn/go-runewidth-8 261.5 ns/op 489.55 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkRune_Emoji/clipperhouse/displaywidth-8 1371 ns/op 528.22 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkRune_Emoji/mattn/go-runewidth-8 2267 ns/op 319.43 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkTruncateWithTail/clipperhouse/displaywidth-8 3229 ns/op 54.82 MB/s 192 B/op 14 allocs/op
|
||||
BenchmarkTruncateWithTail/mattn/go-runewidth-8 8408 ns/op 21.05 MB/s 192 B/op 14 allocs/op
|
||||
|
||||
BenchmarkTruncateWithoutTail/clipperhouse/displaywidth-8 3554 ns/op 64.43 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkTruncateWithoutTail/mattn/go-runewidth-8 11189 ns/op 20.47 MB/s 0 B/op 0 allocs/op
|
||||
```
|
||||
|
||||
Here are some notes on [how to make Unicode things fast](https://clipperhouse.com/go-unicode/).
|
||||
+3
@@ -0,0 +1,3 @@
|
||||
package displaywidth
|
||||
|
||||
//go:generate go run -C internal/gen .
|
||||
+73
@@ -0,0 +1,73 @@
|
||||
package displaywidth
|
||||
|
||||
import (
|
||||
"github.com/clipperhouse/uax29/v2/graphemes"
|
||||
)
|
||||
|
||||
// Graphemes is an iterator over grapheme clusters.
|
||||
//
|
||||
// Iterate using the Next method, and get the width of the current grapheme
|
||||
// using the Width method.
|
||||
type Graphemes[T ~string | []byte] struct {
|
||||
iter *graphemes.Iterator[T]
|
||||
options Options
|
||||
}
|
||||
|
||||
// Next advances the iterator to the next grapheme cluster.
|
||||
func (g *Graphemes[T]) Next() bool {
|
||||
return g.iter.Next()
|
||||
}
|
||||
|
||||
// Value returns the current grapheme cluster.
|
||||
func (g *Graphemes[T]) Value() T {
|
||||
return g.iter.Value()
|
||||
}
|
||||
|
||||
// Width returns the display width of the current grapheme cluster.
|
||||
func (g *Graphemes[T]) Width() int {
|
||||
return graphemeWidth(g.Value(), g.options)
|
||||
}
|
||||
|
||||
// StringGraphemes returns an iterator over grapheme clusters for the given
|
||||
// string.
|
||||
//
|
||||
// Iterate using the Next method, and get the width of the current grapheme
|
||||
// using the Width method.
|
||||
func StringGraphemes(s string) Graphemes[string] {
|
||||
return DefaultOptions.StringGraphemes(s)
|
||||
}
|
||||
|
||||
// StringGraphemes returns an iterator over grapheme clusters for the given
|
||||
// string, with the given options.
|
||||
//
|
||||
// Iterate using the Next method, and get the width of the current grapheme
|
||||
// using the Width method.
|
||||
func (options Options) StringGraphemes(s string) Graphemes[string] {
|
||||
g := graphemes.FromString(s)
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
|
||||
|
||||
return Graphemes[string]{iter: g, options: options}
|
||||
}
|
||||
|
||||
// BytesGraphemes returns an iterator over grapheme clusters for the given
|
||||
// []byte.
|
||||
//
|
||||
// Iterate using the Next method, and get the width of the current grapheme
|
||||
// using the Width method.
|
||||
func BytesGraphemes(s []byte) Graphemes[[]byte] {
|
||||
return DefaultOptions.BytesGraphemes(s)
|
||||
}
|
||||
|
||||
// BytesGraphemes returns an iterator over grapheme clusters for the given
|
||||
// []byte, with the given options.
|
||||
//
|
||||
// Iterate using the Next method, and get the width of the current grapheme
|
||||
// using the Width method.
|
||||
func (options Options) BytesGraphemes(s []byte) Graphemes[[]byte] {
|
||||
g := graphemes.FromBytes(s)
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
|
||||
|
||||
return Graphemes[[]byte]{iter: g, options: options}
|
||||
}
|
||||
+30
@@ -0,0 +1,30 @@
|
||||
package displaywidth
|
||||
|
||||
// Options allows you to specify the treatment of ambiguous East Asian
|
||||
// characters and ANSI escape sequences.
|
||||
type Options struct {
|
||||
// EastAsianWidth specifies whether to treat ambiguous East Asian characters
|
||||
// as width 1 or 2. When false (default), ambiguous East Asian characters
|
||||
// are treated as width 1. When true, they are width 2.
|
||||
EastAsianWidth bool
|
||||
|
||||
// ControlSequences specifies whether to ignore 7-bit ECMA-48 escape sequences
|
||||
// when calculating the display width. When false (default), ANSI escape
|
||||
// sequences are treated as just a series of characters. When true, they are
|
||||
// treated as a single zero-width unit.
|
||||
ControlSequences bool
|
||||
// ControlSequences8Bit specifies whether to ignore 8-bit ECMA-48 escape sequences
|
||||
// when calculating the display width. When false (default), these are treated
|
||||
// as just a series of characters. When true, they are treated as a single
|
||||
// zero-width unit.
|
||||
ControlSequences8Bit bool
|
||||
}
|
||||
|
||||
// DefaultOptions is the default options for the display width
|
||||
// calculation, which is EastAsianWidth false, ControlSequences false, and
|
||||
// ControlSequences8Bit false.
|
||||
var DefaultOptions = Options{
|
||||
EastAsianWidth: false,
|
||||
ControlSequences: false,
|
||||
ControlSequences8Bit: false,
|
||||
}
|
||||
+1699
File diff suppressed because it is too large
Load Diff
+149
@@ -0,0 +1,149 @@
|
||||
package displaywidth
|
||||
|
||||
import (
|
||||
"strings"
|
||||
|
||||
"github.com/clipperhouse/uax29/v2/graphemes"
|
||||
)
|
||||
|
||||
// TruncateString truncates a string to the given maxWidth, and appends the
|
||||
// given tail if the string is truncated.
|
||||
//
|
||||
// It ensures the visible width, including the width of the tail, is less than or
|
||||
// equal to maxWidth.
|
||||
//
|
||||
// When [Options.ControlSequences] is true, 7-bit ANSI escape sequences that
|
||||
// appear after the truncation point are preserved in the output. This ensures
|
||||
// that escape sequences such as SGR resets are not lost, preventing color
|
||||
// bleed in terminal output.
|
||||
//
|
||||
// [Options.ControlSequences8Bit] is ignored by truncation. 8-bit C1 byte values
|
||||
// (0x80-0x9F) overlap with UTF-8 multi-byte encoding, so manipulating them
|
||||
// during truncation can shift byte boundaries and form unintended visible
|
||||
// characters. Use [Options.String] or [Options.Bytes] for 8-bit-aware width
|
||||
// measurement.
|
||||
func (options Options) TruncateString(s string, maxWidth int, tail string) string {
|
||||
// We deliberately ignore ControlSequences8Bit for truncation, see above.
|
||||
options.ControlSequences8Bit = false
|
||||
|
||||
maxWidthWithoutTail := maxWidth - options.String(tail)
|
||||
|
||||
var pos, total int
|
||||
g := graphemes.FromString(s)
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
|
||||
for g.Next() {
|
||||
gw := graphemeWidth(g.Value(), options)
|
||||
if total+gw <= maxWidthWithoutTail {
|
||||
pos = g.End()
|
||||
}
|
||||
total += gw
|
||||
if total > maxWidth {
|
||||
if options.ControlSequences {
|
||||
// Build result with trailing 7-bit ANSI escape sequences preserved
|
||||
var b strings.Builder
|
||||
b.Grow(len(s) + len(tail)) // at most original + tail
|
||||
b.WriteString(s[:pos])
|
||||
b.WriteString(tail)
|
||||
|
||||
rem := graphemes.FromString(s[pos:])
|
||||
rem.AnsiEscapeSequences = options.ControlSequences
|
||||
|
||||
for rem.Next() {
|
||||
v := rem.Value()
|
||||
// Only preserve 7-bit escapes (ESC = 0x1B) that measure
|
||||
// as zero-width on their own; some sequences (e.g. SOS)
|
||||
// are only valid in their original context.
|
||||
if len(v) > 0 && v[0] == 0x1B && options.String(v) == 0 {
|
||||
b.WriteString(v)
|
||||
}
|
||||
}
|
||||
return b.String()
|
||||
}
|
||||
return s[:pos] + tail
|
||||
}
|
||||
}
|
||||
// No truncation
|
||||
return s
|
||||
}
|
||||
|
||||
// TruncateString truncates a string to the given maxWidth, and appends the
|
||||
// given tail if the string is truncated.
|
||||
//
|
||||
// It ensures the total width, including the width of the tail, is less than or
|
||||
// equal to maxWidth.
|
||||
func TruncateString(s string, maxWidth int, tail string) string {
|
||||
return DefaultOptions.TruncateString(s, maxWidth, tail)
|
||||
}
|
||||
|
||||
// TruncateBytes truncates a []byte to the given maxWidth, and appends the
|
||||
// given tail if the []byte is truncated.
|
||||
//
|
||||
// It ensures the visible width, including the width of the tail, is less than or
|
||||
// equal to maxWidth.
|
||||
//
|
||||
// When [Options.ControlSequences] is true, 7-bit ANSI escape sequences that
|
||||
// appear after the truncation point are preserved in the output. This ensures
|
||||
// that escape sequences such as SGR resets are not lost, preventing color
|
||||
// bleed in terminal output.
|
||||
//
|
||||
// [Options.ControlSequences8Bit] is ignored by truncation. 8-bit C1 byte values
|
||||
// (0x80-0x9F) overlap with UTF-8 multi-byte encoding, so manipulating them
|
||||
// during truncation can shift byte boundaries and form unintended visible
|
||||
// characters. Use [Options.String] or [Options.Bytes] for 8-bit-aware width
|
||||
// measurement.
|
||||
func (options Options) TruncateBytes(s []byte, maxWidth int, tail []byte) []byte {
|
||||
// We deliberately ignore ControlSequences8Bit for truncation, see above.
|
||||
options.ControlSequences8Bit = false
|
||||
|
||||
maxWidthWithoutTail := maxWidth - options.Bytes(tail)
|
||||
|
||||
var pos, total int
|
||||
g := graphemes.FromBytes(s)
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
|
||||
for g.Next() {
|
||||
gw := graphemeWidth(g.Value(), options)
|
||||
if total+gw <= maxWidthWithoutTail {
|
||||
pos = g.End()
|
||||
}
|
||||
total += gw
|
||||
if total > maxWidth {
|
||||
if options.ControlSequences {
|
||||
// Build result with trailing 7-bit ANSI escape sequences preserved
|
||||
result := make([]byte, 0, len(s)+len(tail)) // at most original + tail
|
||||
result = append(result, s[:pos]...)
|
||||
result = append(result, tail...)
|
||||
|
||||
rem := graphemes.FromBytes(s[pos:])
|
||||
rem.AnsiEscapeSequences = options.ControlSequences
|
||||
|
||||
for rem.Next() {
|
||||
v := rem.Value()
|
||||
// Only preserve 7-bit escapes (ESC = 0x1B) that measure
|
||||
// as zero-width on their own; some sequences (e.g. SOS)
|
||||
// are only valid in their original context.
|
||||
if len(v) > 0 && v[0] == 0x1B && options.Bytes(v) == 0 {
|
||||
result = append(result, v...)
|
||||
}
|
||||
}
|
||||
return result
|
||||
}
|
||||
result := make([]byte, 0, pos+len(tail))
|
||||
result = append(result, s[:pos]...)
|
||||
result = append(result, tail...)
|
||||
return result
|
||||
}
|
||||
}
|
||||
// No truncation
|
||||
return s
|
||||
}
|
||||
|
||||
// TruncateBytes truncates a []byte to the given maxWidth, and appends the
|
||||
// given tail if the []byte is truncated.
|
||||
//
|
||||
// It ensures the total width, including the width of the tail, is less than or
|
||||
// equal to maxWidth.
|
||||
func TruncateBytes(s []byte, maxWidth int, tail []byte) []byte {
|
||||
return DefaultOptions.TruncateBytes(s, maxWidth, tail)
|
||||
}
|
||||
+239
@@ -0,0 +1,239 @@
|
||||
package displaywidth
|
||||
|
||||
import (
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/clipperhouse/uax29/v2/graphemes"
|
||||
)
|
||||
|
||||
// String calculates the display width of a string,
|
||||
// by iterating over grapheme clusters in the string
|
||||
// and summing their widths.
|
||||
func String(s string) int {
|
||||
return DefaultOptions.String(s)
|
||||
}
|
||||
|
||||
// String calculates the display width of a string, for the given options, by
|
||||
// iterating over grapheme clusters in the string and summing their widths.
|
||||
func (options Options) String(s string) int {
|
||||
width := 0
|
||||
pos := 0
|
||||
|
||||
for pos < len(s) {
|
||||
// Try ASCII optimization
|
||||
asciiLen := printableASCIILength(s[pos:])
|
||||
if asciiLen > 0 {
|
||||
width += asciiLen
|
||||
pos += asciiLen
|
||||
continue
|
||||
}
|
||||
|
||||
// Not ASCII, use grapheme parsing
|
||||
g := graphemes.FromString(s[pos:])
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
|
||||
|
||||
start := pos
|
||||
|
||||
for g.Next() {
|
||||
v := g.Value()
|
||||
width += graphemeWidth(v, options)
|
||||
pos += len(v)
|
||||
|
||||
// Quick check: if remaining might have printable ASCII, break to outer loop
|
||||
if pos < len(s) && s[pos] >= 0x20 && s[pos] <= 0x7E {
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// Defensive, should not happen: if no progress was made,
|
||||
// skip a byte to prevent infinite loop. Only applies if
|
||||
// the grapheme parser misbehaves.
|
||||
if pos == start {
|
||||
pos++
|
||||
}
|
||||
}
|
||||
|
||||
return width
|
||||
}
|
||||
|
||||
// Bytes calculates the display width of a []byte,
|
||||
// by iterating over grapheme clusters in the byte slice
|
||||
// and summing their widths.
|
||||
func Bytes(s []byte) int {
|
||||
return DefaultOptions.Bytes(s)
|
||||
}
|
||||
|
||||
// Bytes calculates the display width of a []byte, for the given options, by
|
||||
// iterating over grapheme clusters in the slice and summing their widths.
|
||||
func (options Options) Bytes(s []byte) int {
|
||||
width := 0
|
||||
pos := 0
|
||||
|
||||
for pos < len(s) {
|
||||
// Try ASCII optimization
|
||||
asciiLen := printableASCIILength(s[pos:])
|
||||
if asciiLen > 0 {
|
||||
width += asciiLen
|
||||
pos += asciiLen
|
||||
continue
|
||||
}
|
||||
|
||||
// Not ASCII, use grapheme parsing
|
||||
g := graphemes.FromBytes(s[pos:])
|
||||
g.AnsiEscapeSequences = options.ControlSequences
|
||||
g.AnsiEscapeSequences8Bit = options.ControlSequences8Bit
|
||||
|
||||
start := pos
|
||||
|
||||
for g.Next() {
|
||||
v := g.Value()
|
||||
width += graphemeWidth(v, options)
|
||||
pos += len(v)
|
||||
|
||||
// Quick check: if remaining might have printable ASCII, break to outer loop
|
||||
if pos < len(s) && s[pos] >= 0x20 && s[pos] <= 0x7E {
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// Defensive, should not happen: if no progress was made,
|
||||
// skip a byte to prevent infinite loop. Only applies if
|
||||
// the grapheme parser misbehaves.
|
||||
if pos == start {
|
||||
pos++
|
||||
}
|
||||
}
|
||||
|
||||
return width
|
||||
}
|
||||
|
||||
// Rune calculates the display width of a rune. You
|
||||
// should almost certainly use [String] or [Bytes] for
|
||||
// most purposes.
|
||||
//
|
||||
// The smallest unit of display width is a grapheme
|
||||
// cluster, not a rune. Iterating over runes to measure
|
||||
// width is incorrect in many cases.
|
||||
func Rune(r rune) int {
|
||||
return DefaultOptions.Rune(r)
|
||||
}
|
||||
|
||||
// Rune calculates the display width of a rune, for the given options.
|
||||
//
|
||||
// You should almost certainly use [String] or [Bytes] for most purposes.
|
||||
//
|
||||
// The smallest unit of display width is a grapheme cluster, not a rune.
|
||||
// Iterating over runes to measure width is incorrect in many cases.
|
||||
func (options Options) Rune(r rune) int {
|
||||
if r < utf8.RuneSelf {
|
||||
return asciiWidth(byte(r))
|
||||
}
|
||||
|
||||
// Surrogates (U+D800-U+DFFF) are invalid UTF-8.
|
||||
if r >= 0xD800 && r <= 0xDFFF {
|
||||
return 0
|
||||
}
|
||||
|
||||
var buf [4]byte
|
||||
n := utf8.EncodeRune(buf[:], r)
|
||||
|
||||
// Skip the grapheme iterator
|
||||
return graphemeWidth(buf[:n], options)
|
||||
}
|
||||
|
||||
const _Default property = 0
|
||||
|
||||
// graphemeWidth returns the display width of a grapheme cluster.
|
||||
// The passed string must be a single grapheme cluster.
|
||||
func graphemeWidth[T ~string | []byte](s T, options Options) int {
|
||||
if len(s) == 0 {
|
||||
return 0
|
||||
}
|
||||
|
||||
// C1 controls (0x80-0x9F) are zero-width when 8-bit control sequences
|
||||
// are enabled. This must be checked before the single-byte optimization
|
||||
// below, which would otherwise return width 1 for these bytes.
|
||||
if options.ControlSequences8Bit && s[0] >= 0x80 && s[0] <= 0x9F {
|
||||
return 0
|
||||
}
|
||||
|
||||
// Optimization: single-byte graphemes need no property lookup
|
||||
if len(s) == 1 {
|
||||
return asciiWidth(s[0])
|
||||
}
|
||||
|
||||
// Multi-byte grapheme clusters led by a C0 control (0x00-0x1F)
|
||||
if s[0] <= 0x1F {
|
||||
return 0
|
||||
}
|
||||
|
||||
p, sz := lookup(s)
|
||||
prop := property(p)
|
||||
|
||||
// Variation Selector 16 (VS16) requests emoji presentation
|
||||
if prop != _Wide && sz > 0 && len(s) >= sz+3 {
|
||||
vs := s[sz : sz+3]
|
||||
if isVS16(vs) {
|
||||
prop = _Wide
|
||||
}
|
||||
// VS15 (0x8E) requests text presentation but does not affect width,
|
||||
// in my reading of Unicode TR51. Falls through to return the base
|
||||
// character's property.
|
||||
}
|
||||
|
||||
if options.EastAsianWidth && prop == _East_Asian_Ambiguous {
|
||||
prop = _Wide
|
||||
}
|
||||
|
||||
if prop > upperBound {
|
||||
prop = _Default
|
||||
}
|
||||
|
||||
return propertyWidths[prop]
|
||||
}
|
||||
|
||||
func asciiWidth(b byte) int {
|
||||
if b <= 0x1F || b == 0x7F {
|
||||
return 0
|
||||
}
|
||||
return 1
|
||||
}
|
||||
|
||||
// printableASCIILength returns the length of consecutive printable ASCII bytes
|
||||
// starting at the beginning of s.
|
||||
func printableASCIILength[T string | []byte](s T) int {
|
||||
i := 0
|
||||
for ; i < len(s); i++ {
|
||||
b := s[i]
|
||||
// Printable ASCII is 0x20-0x7E (space through tilde)
|
||||
if b < 0x20 || b > 0x7E {
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// If the next byte is non-ASCII (>= 0x80), back off by 1. The grapheme
|
||||
// parser may group the last ASCII byte with subsequent non-ASCII bytes,
|
||||
// such as combining marks.
|
||||
if i > 0 && i < len(s) && s[i] >= 0x80 {
|
||||
i--
|
||||
}
|
||||
|
||||
return i
|
||||
}
|
||||
|
||||
// isVS16 checks if the slice matches VS16 (U+FE0F) UTF-8 encoding
|
||||
// (EF B8 8F). It assumes len(s) >= 3.
|
||||
func isVS16[T ~string | []byte](s T) bool {
|
||||
return s[0] == 0xEF && s[1] == 0xB8 && s[2] == 0x8F
|
||||
}
|
||||
|
||||
// propertyWidths is a jump table of sorts, instead of a switch
|
||||
var propertyWidths = [4]int{
|
||||
_Default: 1,
|
||||
_Zero_Width: 0,
|
||||
_Wide: 2,
|
||||
_East_Asian_Ambiguous: 1,
|
||||
}
|
||||
|
||||
const upperBound = property(len(propertyWidths) - 1)
|
||||
+21
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2020 Matt Sherman
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
+120
@@ -0,0 +1,120 @@
|
||||
An implementation of grapheme cluster boundaries from [Unicode text segmentation](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) (UAX 29), for Unicode 17.
|
||||
|
||||
[](https://pkg.go.dev/github.com/clipperhouse/uax29/v2/graphemes)
|
||||

|
||||

|
||||
|
||||
## Quick start
|
||||
|
||||
```
|
||||
go get github.com/clipperhouse/uax29/v2/graphemes
|
||||
```
|
||||
|
||||
```go
|
||||
import "github.com/clipperhouse/uax29/v2/graphemes"
|
||||
|
||||
text := "Hello, 世界. Nice dog! 👍🐶"
|
||||
g := graphemes.FromString(text)
|
||||
|
||||
for g.Next() { // Next() returns true until end of data
|
||||
fmt.Println(g.Value()) // Do something with the current grapheme
|
||||
}
|
||||
```
|
||||
|
||||
_A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points._
|
||||
|
||||
## Conformance
|
||||
|
||||
We use the Unicode [test suite](https://unicode.org/reports/tr41/tr41-36.html#Tests29).
|
||||
|
||||

|
||||

|
||||
|
||||
## APIs
|
||||
|
||||
### If you have a `string`
|
||||
|
||||
```go
|
||||
text := "Hello, 世界. Nice dog! 👍🐶"
|
||||
g := graphemes.FromString(text)
|
||||
|
||||
for g.Next() { // Next() returns true until end of data
|
||||
fmt.Println(g.Value()) // Do something with the current grapheme
|
||||
}
|
||||
```
|
||||
|
||||
### If you have an `io.Reader`
|
||||
|
||||
`FromReader` embeds a [`bufio.Scanner`](https://pkg.go.dev/bufio#Scanner), so just use those methods.
|
||||
|
||||
```go
|
||||
r := getYourReader() // from a file or network maybe
|
||||
g := graphemes.FromReader(r)
|
||||
|
||||
for g.Scan() { // Scan() returns true until error or EOF
|
||||
fmt.Println(g.Text()) // Do something with the current grapheme
|
||||
}
|
||||
|
||||
if g.Err() != nil { // Check the error
|
||||
log.Fatal(g.Err())
|
||||
}
|
||||
```
|
||||
|
||||
### If you have a `[]byte`
|
||||
|
||||
```go
|
||||
b := []byte("Hello, 世界. Nice dog! 👍🐶")
|
||||
|
||||
g := graphemes.FromBytes(b)
|
||||
|
||||
for g.Next() { // Next() returns true until end of data
|
||||
fmt.Println(g.Value()) // Do something with the current grapheme
|
||||
}
|
||||
```
|
||||
|
||||
### ANSI escape sequences
|
||||
|
||||
By the UAX 29 specification, ANSI escape sequences are not grapheme clusters. To treat 7-bit ANSI escape sequences as a single cluster, set `AnsiEscapeSequences` to true.
|
||||
|
||||
```go
|
||||
text := "Hello, \x1b[31mworld\x1b[0m!"
|
||||
g := graphemes.FromString(text)
|
||||
g.AnsiEscapeSequences = true
|
||||
|
||||
for g.Next() {
|
||||
fmt.Println(g.Value())
|
||||
}
|
||||
```
|
||||
|
||||
To also parse 8-bit C1 controls (non-UTF-8 bytes), set `AnsiEscapeSequences8Bit` to true.
|
||||
|
||||
```go
|
||||
g.AnsiEscapeSequences = true // 7-bit forms (ESC ...)
|
||||
g.AnsiEscapeSequences8Bit = true // 8-bit C1 forms (0x80-0x9F), not valid UTF-8
|
||||
```
|
||||
|
||||
For ESC-initiated (7-bit) control strings, only 7-bit terminators are recognized.
|
||||
For C1-initiated (8-bit) control strings, only C1 ST (`0x9C`) is recognized as ST.
|
||||
|
||||
We implement [ECMA-48](https://ecma-international.org/publications-and-standards/standards/ecma-48/) control codes in both 7-bit and 8-bit representations. 8-bit control codes are not UTF-8 encoded and are not valid UTF-8, caveat emptor.
|
||||
|
||||
### Benchmarks
|
||||
|
||||
```
|
||||
goos: darwin
|
||||
goarch: arm64
|
||||
pkg: github.com/clipperhouse/uax29/graphemes/comparative
|
||||
cpu: Apple M2
|
||||
|
||||
BenchmarkGraphemesMixed/clipperhouse/uax29-8 142635 ns/op 245.12 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkGraphemesMixed/rivo/uniseg-8 2018284 ns/op 17.32 MB/s 0 B/op 0 allocs/op
|
||||
|
||||
BenchmarkGraphemesASCII/clipperhouse/uax29-8 8846 ns/op 508.73 MB/s 0 B/op 0 allocs/op
|
||||
BenchmarkGraphemesASCII/rivo/uniseg-8 366760 ns/op 12.27 MB/s 0 B/op 0 allocs/op
|
||||
```
|
||||
|
||||
### Invalid inputs
|
||||
|
||||
Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
|
||||
|
||||
Your pipeline should probably include a call to [`utf8.Valid()`](https://pkg.go.dev/unicode/utf8#Valid).
|
||||
+138
@@ -0,0 +1,138 @@
|
||||
package graphemes
|
||||
|
||||
// ansiEscapeLength returns the byte length of a valid 7-bit ANSI escape
|
||||
// sequence at the start of data, or 0 if none.
|
||||
//
|
||||
// Recognized forms (ECMA-48 / ISO 6429):
|
||||
// - CSI: ESC [ then parameter bytes (0x30-0x3F), intermediate (0x20-0x2F), final (0x40-0x7E)
|
||||
// - OSC: ESC ] then payload until BEL (0x07), 7-bit ST (ESC \), CAN (0x18), or SUB (0x1A)
|
||||
// - DCS, SOS, PM, APC: ESC P/X/^/_ then payload until 7-bit ST (ESC \), CAN, or SUB
|
||||
// - Two-byte: ESC + Fe/Fs (0x40-0x7E excluding above), or Fp (0x30-0x3F), or nF (0x20-0x2F then final)
|
||||
func ansiEscapeLength[T ~string | ~[]byte](data T) int {
|
||||
n := len(data)
|
||||
if n < 2 || data[0] != esc {
|
||||
return 0
|
||||
}
|
||||
|
||||
b1 := data[1]
|
||||
switch b1 {
|
||||
case '[': // CSI
|
||||
body := csiBodyLength(data[2:])
|
||||
if body == 0 {
|
||||
return 0
|
||||
}
|
||||
return 2 + body
|
||||
case ']': // OSC - allows BEL or 7-bit ST terminator
|
||||
body := oscLength(data[2:])
|
||||
if body < 0 {
|
||||
return 0
|
||||
}
|
||||
return 2 + body
|
||||
case 'P', 'X', '^', '_': // DCS, SOS, PM, APC
|
||||
body := stSequenceLength(data[2:])
|
||||
if body < 0 {
|
||||
return 0
|
||||
}
|
||||
return 2 + body
|
||||
}
|
||||
|
||||
if b1 >= 0x40 && b1 <= 0x7E {
|
||||
// Fe/Fs two-byte; [ ] P X ^ _ handled above
|
||||
return 2
|
||||
}
|
||||
if b1 >= 0x30 && b1 <= 0x3F {
|
||||
// Fp (private) two-byte
|
||||
return 2
|
||||
}
|
||||
if b1 >= 0x20 && b1 <= 0x2F {
|
||||
// nF: intermediates then one final (0x30-0x7E)
|
||||
i := 2
|
||||
for i < n && data[i] >= 0x20 && data[i] <= 0x2F {
|
||||
i++
|
||||
}
|
||||
if i < n && data[i] >= 0x30 && data[i] <= 0x7E {
|
||||
return i + 1
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
return 0
|
||||
}
|
||||
|
||||
// csiBodyLength returns the length of the CSI body (param/intermediate/final bytes).
|
||||
// data is the slice after "ESC [".
|
||||
// Per ECMA-48, the CSI body has the form:
|
||||
//
|
||||
// parameters (0x30–0x3F)*, intermediates (0x20–0x2F)*, final (0x40–0x7E)
|
||||
//
|
||||
// Once an intermediate byte is seen, subsequent parameter bytes are invalid.
|
||||
func csiBodyLength[T ~string | ~[]byte](data T) int {
|
||||
seenIntermediate := false
|
||||
for i := 0; i < len(data); i++ {
|
||||
b := data[i]
|
||||
if b >= 0x30 && b <= 0x3F {
|
||||
if seenIntermediate {
|
||||
return 0
|
||||
}
|
||||
continue
|
||||
}
|
||||
if b >= 0x20 && b <= 0x2F {
|
||||
seenIntermediate = true
|
||||
continue
|
||||
}
|
||||
if b >= 0x40 && b <= 0x7E {
|
||||
return i + 1
|
||||
}
|
||||
return 0
|
||||
}
|
||||
return 0
|
||||
}
|
||||
|
||||
// oscLength returns the length of the OSC body.
|
||||
// data is the slice after "ESC ]".
|
||||
//
|
||||
// Returns:
|
||||
// - n >= 0: consumed body length (includes BEL/ST terminator when present)
|
||||
// - -1: not terminated in the provided data
|
||||
//
|
||||
// OSC accepts BEL (0x07) or 7-bit ST (ESC \) as terminators by widespread convention.
|
||||
// Per ECMA-48, CAN (0x18) and SUB (0x1A) cancel the control string; in that
|
||||
// case they are not part of the OSC sequence length.
|
||||
func oscLength[T ~string | ~[]byte](data T) int {
|
||||
for i := 0; i < len(data); i++ {
|
||||
b := data[i]
|
||||
if b == bel {
|
||||
return i + 1
|
||||
}
|
||||
if b == can || b == sub {
|
||||
return i
|
||||
}
|
||||
if b == esc && i+1 < len(data) && data[i+1] == '\\' {
|
||||
return i + 2
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
|
||||
// stSequenceLength returns the length of a control-string body.
|
||||
// data is the slice after "ESC x".
|
||||
//
|
||||
// Returns:
|
||||
// - n >= 0: consumed body length (includes ST terminator when present)
|
||||
// - -1: not terminated in the provided data
|
||||
//
|
||||
// Used for DCS, SOS, PM, and APC, which per ECMA-48 terminate with ST.
|
||||
// ST here is the 7-bit form (ESC \).
|
||||
// CAN (0x18) and SUB (0x1A) cancel the control string; in that case they are
|
||||
// not part of the sequence length.
|
||||
func stSequenceLength[T ~string | ~[]byte](data T) int {
|
||||
for i := 0; i < len(data); i++ {
|
||||
if data[i] == can || data[i] == sub {
|
||||
return i
|
||||
}
|
||||
if data[i] == esc && i+1 < len(data) && data[i+1] == '\\' {
|
||||
return i + 2
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
+79
@@ -0,0 +1,79 @@
|
||||
package graphemes
|
||||
|
||||
// ansiEscapeLength8Bit returns the byte length of a valid 8-bit C1 ANSI
|
||||
// sequence at the start of data, or 0 if none.
|
||||
//
|
||||
// Recognized forms (ECMA-48 / ISO 6429):
|
||||
// - C1 CSI (0x9B) body as parameter/intermediate/final bytes
|
||||
// - C1 OSC (0x9D) body terminated by BEL, C1 ST, CAN, or SUB
|
||||
// - C1 DCS/SOS/PM/APC (0x90/0x98/0x9E/0x9F) body terminated by C1 ST, CAN, or SUB
|
||||
// - Standalone C1 controls (0x80..0x9F not listed above): single byte
|
||||
func ansiEscapeLength8Bit[T ~string | ~[]byte](data T) int {
|
||||
if len(data) == 0 {
|
||||
return 0
|
||||
}
|
||||
|
||||
switch data[0] {
|
||||
case 0x9B: // C1 CSI
|
||||
body := csiBodyLength(data[1:])
|
||||
if body == 0 {
|
||||
return 0
|
||||
}
|
||||
return 1 + body
|
||||
case 0x9D: // C1 OSC
|
||||
body := oscLengthC1(data[1:])
|
||||
if body < 0 {
|
||||
return 0
|
||||
}
|
||||
return 1 + body
|
||||
case 0x90, 0x98, 0x9E, 0x9F: // C1 DCS, SOS, PM, APC
|
||||
body := stSequenceLengthC1(data[1:])
|
||||
if body < 0 {
|
||||
return 0
|
||||
}
|
||||
return 1 + body
|
||||
default:
|
||||
if data[0] >= 0x80 && data[0] <= 0x9F {
|
||||
return 1
|
||||
}
|
||||
}
|
||||
|
||||
return 0
|
||||
}
|
||||
|
||||
// oscLengthC1 returns the length of a C1 OSC body.
|
||||
// data is the slice after the C1 OSC initiator (0x9D).
|
||||
//
|
||||
// Returns:
|
||||
// - n >= 0: consumed body length (includes BEL/ST terminator when present)
|
||||
// - -1: not terminated in the provided data
|
||||
//
|
||||
// Terminators: BEL (0x07) or C1 ST (0x9C).
|
||||
// CAN (0x18) and SUB (0x1A) cancel the control string.
|
||||
func oscLengthC1[T ~string | ~[]byte](data T) int {
|
||||
for i := 0; i < len(data); i++ {
|
||||
b := data[i]
|
||||
if b == bel || b == st {
|
||||
return i + 1
|
||||
}
|
||||
if b == can || b == sub {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
|
||||
// stSequenceLengthC1 parses DCS/SOS/PM/APC bodies that terminate with C1 ST
|
||||
// (0x9C), or are canceled by CAN/SUB.
|
||||
func stSequenceLengthC1[T ~string | ~[]byte](data T) int {
|
||||
for i := 0; i < len(data); i++ {
|
||||
b := data[i]
|
||||
if b == can || b == sub {
|
||||
return i
|
||||
}
|
||||
if b == st {
|
||||
return i + 1
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
+144
@@ -0,0 +1,144 @@
|
||||
package graphemes
|
||||
|
||||
import "unicode/utf8"
|
||||
|
||||
// FromString returns an iterator for the grapheme clusters in the input string.
|
||||
// Iterate while Next() is true, and access the grapheme via Value().
|
||||
func FromString(s string) *Iterator[string] {
|
||||
return &Iterator[string]{
|
||||
split: splitFuncString,
|
||||
data: s,
|
||||
}
|
||||
}
|
||||
|
||||
// FromBytes returns an iterator for the grapheme clusters in the input bytes.
|
||||
// Iterate while Next() is true, and access the grapheme via Value().
|
||||
func FromBytes(b []byte) *Iterator[[]byte] {
|
||||
return &Iterator[[]byte]{
|
||||
split: splitFuncBytes,
|
||||
data: b,
|
||||
}
|
||||
}
|
||||
|
||||
// Iterator is a generic iterator for grapheme clusters in strings or byte slices,
|
||||
// with an ASCII hot path optimization.
|
||||
type Iterator[T ~string | ~[]byte] struct {
|
||||
split func(T, bool) (int, T, error)
|
||||
data T
|
||||
pos int
|
||||
start int
|
||||
// AnsiEscapeSequences treats 7-bit ANSI escape sequences (ECMA-48) as
|
||||
// single grapheme clusters when true. The default is false.
|
||||
//
|
||||
// 8-bit controls are not enabled by this option. See [AnsiEscapeSequences8Bit].
|
||||
AnsiEscapeSequences bool
|
||||
// AnsiEscapeSequences8Bit treats 8-bit C1 ANSI escape sequences (ECMA-48) as single
|
||||
// grapheme clusters when true. The default is false.
|
||||
//
|
||||
// 8-bit control bytes are not UTF-8 encoded, i.e. not valid UTF-8. If you
|
||||
// choose this option, you are choosing to interpret non-UTF-8 data, caveat
|
||||
// emptor.
|
||||
AnsiEscapeSequences8Bit bool
|
||||
}
|
||||
|
||||
var (
|
||||
splitFuncString = splitFunc[string]
|
||||
splitFuncBytes = splitFunc[[]byte]
|
||||
)
|
||||
|
||||
const (
|
||||
esc = 0x1B
|
||||
cr = 0x0D
|
||||
bel = 0x07
|
||||
can = 0x18
|
||||
sub = 0x1A
|
||||
st = 0x9C
|
||||
)
|
||||
|
||||
// Next advances the iterator to the next grapheme cluster.
|
||||
// Returns false when there are no more grapheme clusters.
|
||||
func (iter *Iterator[T]) Next() bool {
|
||||
if iter.pos >= len(iter.data) {
|
||||
return false
|
||||
}
|
||||
iter.start = iter.pos
|
||||
|
||||
b := iter.data[iter.pos]
|
||||
if iter.AnsiEscapeSequences && b == esc {
|
||||
if a := ansiEscapeLength(iter.data[iter.pos:]); a > 0 {
|
||||
iter.pos += a
|
||||
return true
|
||||
}
|
||||
}
|
||||
if iter.AnsiEscapeSequences8Bit && b >= 0x80 && b <= 0x9F {
|
||||
if a := ansiEscapeLength8Bit(iter.data[iter.pos:]); a > 0 {
|
||||
iter.pos += a
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
// ASCII hot path: any ASCII is one grapheme when next byte is ASCII or end.
|
||||
if b < utf8.RuneSelf && b != cr {
|
||||
if iter.pos+1 >= len(iter.data) || iter.data[iter.pos+1] < utf8.RuneSelf {
|
||||
iter.pos++
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
// Fall back to UAX29 grapheme parsing
|
||||
remaining := iter.data[iter.pos:]
|
||||
advance, _, err := iter.split(remaining, true)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
if advance <= 0 {
|
||||
panic("splitFunc returned a zero or negative advance")
|
||||
}
|
||||
iter.pos += advance
|
||||
if iter.pos > len(iter.data) {
|
||||
panic("splitFunc advanced beyond end of data")
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
// Value returns the current grapheme cluster.
|
||||
func (iter *Iterator[T]) Value() T {
|
||||
return iter.data[iter.start:iter.pos]
|
||||
}
|
||||
|
||||
// Start returns the byte position of the current grapheme in the original data.
|
||||
func (iter *Iterator[T]) Start() int {
|
||||
return iter.start
|
||||
}
|
||||
|
||||
// End returns the byte position after the current grapheme in the original data.
|
||||
func (iter *Iterator[T]) End() int {
|
||||
return iter.pos
|
||||
}
|
||||
|
||||
// Reset resets the iterator to the beginning of the data.
|
||||
func (iter *Iterator[T]) Reset() {
|
||||
iter.start = 0
|
||||
iter.pos = 0
|
||||
}
|
||||
|
||||
// SetText sets the data for the iterator to operate on, and resets all state.
|
||||
func (iter *Iterator[T]) SetText(data T) {
|
||||
iter.data = data
|
||||
iter.start = 0
|
||||
iter.pos = 0
|
||||
}
|
||||
|
||||
// First returns the first grapheme cluster without advancing the iterator.
|
||||
func (iter *Iterator[T]) First() T {
|
||||
if len(iter.data) == 0 {
|
||||
return iter.data
|
||||
}
|
||||
|
||||
// Use a copy to leverage Next()'s ASCII optimization
|
||||
cp := *iter
|
||||
cp.pos = 0
|
||||
cp.start = 0
|
||||
cp.Next()
|
||||
return cp.Value()
|
||||
}
|
||||
+25
@@ -0,0 +1,25 @@
|
||||
// Package graphemes implements Unicode grapheme cluster boundaries: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
|
||||
package graphemes
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
"io"
|
||||
)
|
||||
|
||||
type Scanner struct {
|
||||
*bufio.Scanner
|
||||
}
|
||||
|
||||
// FromReader returns a Scanner, to split graphemes per
|
||||
// https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries.
|
||||
//
|
||||
// It embeds a [bufio.Scanner], so you can use its methods.
|
||||
//
|
||||
// Iterate through graphemes by calling Scan() until false, then check Err().
|
||||
func FromReader(r io.Reader) *Scanner {
|
||||
sc := bufio.NewScanner(r)
|
||||
sc.Split(SplitFunc)
|
||||
return &Scanner{
|
||||
Scanner: sc,
|
||||
}
|
||||
}
|
||||
+205
@@ -0,0 +1,205 @@
|
||||
package graphemes
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
)
|
||||
|
||||
// is determines if lookup intersects propert(ies)
|
||||
func (lookup property) is(properties property) bool {
|
||||
return (lookup & properties) != 0
|
||||
}
|
||||
|
||||
const _Ignore = _Extend
|
||||
|
||||
// incbState tracks state for GB9c rule (Indic conjunct clusters)
|
||||
// Pattern: Consonant (Extend|Linker)* Linker (Extend|Linker)* × Consonant
|
||||
type incbState int
|
||||
|
||||
const (
|
||||
incbNone incbState = iota // initial/reset
|
||||
incbConsonant // seen Consonant, awaiting Linker
|
||||
incbLinker // seen Consonant and Linker (conjunct ready)
|
||||
)
|
||||
|
||||
// SplitFunc is a bufio.SplitFunc implementation of Unicode grapheme cluster segmentation, for use with bufio.Scanner.
|
||||
//
|
||||
// See https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries.
|
||||
var SplitFunc bufio.SplitFunc = splitFunc[[]byte]
|
||||
|
||||
func splitFunc[T ~string | ~[]byte](data T, atEOF bool) (advance int, token T, err error) {
|
||||
var empty T
|
||||
if len(data) == 0 {
|
||||
return 0, empty, nil
|
||||
}
|
||||
|
||||
// These vars are stateful across loop iterations
|
||||
var pos int
|
||||
var lastExIgnore property = 0 // "last excluding ignored categories"
|
||||
var lastLastExIgnore property = 0 // "last one before that"
|
||||
var regionalIndicatorCount int
|
||||
|
||||
// GB9c state: tracking Indic conjunct clusters
|
||||
var incb incbState
|
||||
|
||||
// Rules are usually of the form Cat1 × Cat2; "current" refers to the first property
|
||||
// to the right of the ×, from which we look back or forward
|
||||
|
||||
current, w := lookup(data[pos:])
|
||||
if w == 0 {
|
||||
if !atEOF {
|
||||
// Rune extends past current data, request more
|
||||
return 0, empty, nil
|
||||
}
|
||||
pos = len(data)
|
||||
return pos, data[:pos], nil
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB1
|
||||
// Start of text always advances
|
||||
pos += w
|
||||
|
||||
for {
|
||||
eot := pos == len(data) // "end of text"
|
||||
|
||||
if eot {
|
||||
if !atEOF {
|
||||
// Token extends past current data, request more
|
||||
return 0, empty, nil
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB2
|
||||
break
|
||||
}
|
||||
|
||||
/*
|
||||
We've switched the evaluation order of GB1↓ and GB2↑. It's ok:
|
||||
because we've checked for len(data) at the top of this function,
|
||||
sot and eot are mutually exclusive, order doesn't matter.
|
||||
*/
|
||||
|
||||
// Rules are usually of the form Cat1 × Cat2; "current" refers to the first property
|
||||
// to the right of the ×, from which we look back or forward
|
||||
|
||||
// Remember previous properties to avoid lookups/lookbacks
|
||||
last := current
|
||||
if !last.is(_Ignore) {
|
||||
lastLastExIgnore = lastExIgnore
|
||||
lastExIgnore = last
|
||||
}
|
||||
|
||||
// Update GB9c state based on what we just advanced past
|
||||
if last.is(_InCBConsonant | _InCBLinker | _InCBExtend) {
|
||||
switch {
|
||||
case last.is(_InCBConsonant):
|
||||
if incb != incbLinker {
|
||||
incb = incbConsonant
|
||||
}
|
||||
case last.is(_InCBLinker):
|
||||
if incb >= incbConsonant {
|
||||
incb = incbLinker
|
||||
}
|
||||
// case last.is(_InCBExtend): stay in current state
|
||||
}
|
||||
} else {
|
||||
incb = incbNone
|
||||
}
|
||||
|
||||
current, w = lookup(data[pos:])
|
||||
if w == 0 {
|
||||
if atEOF {
|
||||
// Just return the bytes, we can't do anything with them
|
||||
pos = len(data)
|
||||
break
|
||||
}
|
||||
// Rune extends past current data, request more
|
||||
return 0, empty, nil
|
||||
}
|
||||
|
||||
// Optimization: no rule can possibly apply
|
||||
if current|last == 0 { // i.e. both are zero
|
||||
break
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB3
|
||||
if current.is(_LF) && last.is(_CR) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB4
|
||||
// https://unicode.org/reports/tr29/#GB5
|
||||
if (current | last).is(_Control | _CR | _LF) {
|
||||
break
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB6
|
||||
if current.is(_L|_V|_LV|_LVT) && last.is(_L) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB7
|
||||
if current.is(_V|_T) && last.is(_LV|_V) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB8
|
||||
if current.is(_T) && last.is(_LVT|_T) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB9
|
||||
if current.is(_Extend | _ZWJ) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB9a
|
||||
if current.is(_SpacingMark) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB9b
|
||||
if last.is(_Prepend) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB9c
|
||||
// Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.
|
||||
if incb == incbLinker && current.is(_InCBConsonant) {
|
||||
// After matching the pattern, reset state to start tracking a new pattern
|
||||
// The current Consonant becomes the start of the new pattern
|
||||
incb = incbConsonant
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB11
|
||||
if current.is(_ExtendedPictographic) && last.is(_ZWJ) && lastLastExIgnore.is(_ExtendedPictographic) {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
|
||||
// https://unicode.org/reports/tr29/#GB12
|
||||
// https://unicode.org/reports/tr29/#GB13
|
||||
if (current & last).is(_RegionalIndicator) {
|
||||
regionalIndicatorCount++
|
||||
|
||||
odd := regionalIndicatorCount%2 == 1
|
||||
if odd {
|
||||
pos += w
|
||||
continue
|
||||
}
|
||||
}
|
||||
|
||||
// If we fall through all the above rules, it's a grapheme cluster break
|
||||
break
|
||||
}
|
||||
|
||||
// Return token
|
||||
return pos, data[:pos], nil
|
||||
}
|
||||
+1717
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user