nerd-fonts/bin/scripts/name_parser/README.md

## Creating Consistently Grouped Patched Fonts

This is a small sub-project to font-patcher that uses a little bit more knowledge
to come up with font names and name parts. In applications multiple fonts are grouped
under a 'Family'. Each member of the Family has a different 'SubFamily' or 'Style'.

Consider a font named 'Times' that has two variants: normal and bold. For this font the
Family would be 'Times' and the 'Style' would be 'Regular' (i.e normal) in one file and
'Bold' in the other file.

With this information applications are able to group all 'Times' together and additionally choose the
'Bold' font if the user pushes the 'B' button on the font style dialog in that application.

### Motivation

Quite a number of patched fonts have inconsistent or simply wrong font grouping. The naming in
general is sometimes surprising and not following naming conventions. This is in part due to
the font-patcher, but in part the source fonts are already strange.
This results in invisible (but installed) fonts in some applications, inconsistent naming
(Familyname differs from Fullname) and not correctly working bold/italic selectors in some applications.

And we would like to have the information within the names sorted in a consistent way.
usually a font name consists of these parts (in this order):

1. Name base (e.g. `Noto`)
2. Variant (e.g. `Sans`)
3. Subvariant (e.g. `Display`)
4. Weight (e.g. `Black`)
5. Style (e.g. `Italic`)

This is important because we want to add subvariant information, namely the `Nerd Font` part.

Example:

* (old) `Iosevka Term Light Italic Nerd Font`
* (new) `Iosevka Term Nerd Font Light Italic`

### The Plan

To solve these issues the font name parts have to be analyzed more thoroughly and then categorized.
These categories are then used to assemble the names in correct order. The simple (not
typographically aware) applications shall always get groups of at most four styles, and these
are Regular, Bold, Italic, and Bold-Italic. Other styles turn up as Families, because this is the
only way they would work in these more simple applications.

Typographically aware applications, on the other hand, get all styles grouped under one Family name.

First experiments showed that the full information can usually be restored already from the file
names that our source fonts have.

This new naming is complete optional (but recommended). Give the option `--parser to` font-patcher
and it will try to come up with reasonable grouping and naming. Leave the option out and it will
work as it always did.

### The Tests

In this directory there are two tests.

1. The first test checks the basics of the algorithm. It takes the filenames of all fonts in
   `src/unpatched\_fonts`, then it calculates the naming and compares it to the original
   naming in the font files. Ideally they would be equal.
2. The second test does a 'production run'. It patches each font in `src/unpatched_fonts/`
   and patches it two times: Once without `--parser` and once with. Then it compares the
   naming, and it also shows the original font naming (for comparison).

All tests base on these assumptions

* Fullname must be roughly equal
* Fontname must be roughly equal
* Familyname must roughly equal, order of all words does not matter
  _(Order of words is ignored with test 2 only)_
* SubFamilyname must be equal, order of words does not matter
  _(First word must be equal, order of other words is ignored with test 2 only)_
* Typographic names can be empty if the correct typographic name would be equal to the ordinary name
* Tests are done case insensitive
* Some special exemptions are made (see `lenient_cmp()` in test scripts)

#### Test 1

`fontforge name_parser_test1 ../../../src/unpatched-fonts/**/*.[ot]tf 2>/dev/null`

This test takes the filename of a font, parses it and generates names from it. Then the actual
font is opened and the generated names are compared with the stored names. This test is used
to test the algorithm itself. Of course no SIL table is active as we want to preserve the original
names.

The output shows all the names, always two lines: first the generated names, then the readout
names. If there are differences the generated names are tagged with `+` and the readout ones
with `-`. If there are differences the actually different name part is marked with an `X`.

The differences have reasons, and there is a file with textual explanations for them. So far
all differences are 'ok'. A new run of the script will compare all differences with the stored
ones and alert the user if a new difference is detected (or a difference vanished). In this
way changes of the algorithm can be tested with a wide base of inputs.

#### Test 2

`fontforge name_parser_test2 ../../../src/unpatched-fonts/**/*.[ot]tf 2>/dev/null`

This test compares actually patched fonts. Every font in `src/unpatched_fonts/` is patched two
times: First with the 'old/classic' `font-patcher` naming, and second with the new naming
algorithm in action (by specifying `--parser`). Again the name parts are compared with some
lenience and an output generated like test 1 does.

Also again a file with known differences (with explanations) is read, and any new or vanished
differences are reported. In the report an additional line is given, tagged with `>`, that
contains the names of the original font, for human interpretation (often the reason
for a difference is obvious, because the classic `font-patcher` dropped information.

_Note: Fonts `NotoColorEmoji` and `Lilex-VF` are not patchable, and thus ignored_
_Note: Fonts `iosevka-heavyoblique`, `iosevka-term-heavyoblique`, `iosevka-mediumoblique` crash my machine and are ignored_

### Differences

The naming of the patched fonts, if `--parse` is applied, will be different. Of course, that is the goal.
What are the differences in particular:

* `Nerd Font` is not added in the end, but after the extended base name before the style
* The SubFamily contains only 4 Styles max: Regular, Bold, Italic, Bold-Italic
* The Noto fonts retain their abbreviated style names in the Family information
* `Nerd Font Mono` fonts get a `M` in windows mode (I believe that has been left out accidentally before)

Apart from these general things, all changes are documented in detail in the `name_parser_test2` issues file.
Here is an overview over all the things that get renamed and why:

| Occurences | Description |
|------------|-------------|
| 511        | Add weight/style to family |
|  43        | The fonts name is M+ not Mplus |
|  36        | Drop unneeded Typogr.Family/Typogr.Style |
|  26        | 'Term' is missing from Family |
|  22        | Change regular-equivalent name to Regular |
|  19        | Put Oblique into own SubFamily (and mark it as italic) |
|   5        | Drop Regular from Style |
|   4        | We handle (TTF) as sub-name |
|   4        | Fullname has been missing 'Nerd Font' |
|   4        | Bold / Bold-Italic are just a styles of Regular |
|   2        | Original font broken (Light in Family) |
|   2        | Classify Medium as own weigt and not Bold |
|   2        | Bold and Italic are styles of a basefont |
|   1        | Weight Condensed does not belong to base name |
|   1        | Use only Regular/Bold/Italic in SubFamily |
|   1        | Handle Retina as Weight and not Style |
|   1        | Do not call Semibold Light-Bold |

From the count we see that almost all fonts are affected by incorrect Family naming.

### Further steps

One can examine all the (current) naming differences in the `name_parser_test2.known_issues`
file. The Explanation is followed by three lines of names: source-file, patched-with-parser,
and patched-classic.

The Explanation sorts most differences into common groups. This helps to weed out
explanations that might do not need much attention.

### Helper scripts

There are some helper scripts that help examining the font files. Of course there are other,
more professional tools to dump font information, but here we get all we need in a concise
way:
* `query_names` `font_name [font_name ...]`
* `query_panose` `font_name`
* `query_sftn` `[<sfnt-name>] font_name`
* `query_version` `font_name`

They can be invoked like this `$ fontforge query_sfnt foo.ttf`.

### Appendix: The `name_parser_test*.known_issues` files

All differences of 'old' to 'new' naming (if not one of the very general kind like resorting of
the words) are documented in the `known_issues` files. For each difference a reason is given.

The files consist of entries that spans 3 (for test 1) or 4 (for test 2) lines.
| Line starts with | Contents |
|------------------|----------|
| #                | Reson for the difference (or `AUTOGENERATED`) |
| >                | Naming fo the original/source font (only test 2) |
| +                | Naming with `--parser` (new naming) |
| -                | Naming classically generated by font-patcher |

After any test run a `known_issues.new` file is generated. It contains all the issues
from the `known_issues` file that were detected. Original issues that are not
existing anymore are at the bottom of the new file, clearly marked as such. If new
(previously unexplained) issues were detected they show up with the `AUTOGENERATED`
reason.

After adding new fonts or replacing font files the test can be rerun. If there are issues
in the `.new` file they should be documented there, and the `.new` file replace the
original `known_issues` file (after removing possible 'obsolete' issues that are listed in
the bottom of the new file).

In this way one can tweak the parser code and compare very easily what a change
means for all the fonts, which will break or be repaired.
Draft: Introduce a file name parser DO NOT MERGE [why] A lot of the fonts have incorrect naming after patching. A completely different approach can help to come up with a consistent naming scheme. [how] See bin/scripts/name-parser/README.md Signed-off-by: Fini Jastrow <ulf.fini.jastrow@desy.de> 2021-12-02 23:29:54 +02:00			`## Creating Consistently Grouped Patched Fonts`

			`This is a small sub-project to font-patcher that uses a little bit more knowledge`
			`to come up with font names and name parts. In applications multiple fonts are grouped`
			`under a 'Family'. Each member of the Family has a different 'SubFamily' or 'Style'.`

			`Consider a font named 'Times' that has two variants: normal and bold. For this font the`
			`Family would be 'Times' and the 'Style' would be 'Regular' (i.e normal) in one file and`
			`'Bold' in the other file.`

			`With this information applications are able to group all 'Times' together and additionally choose the`
			`'Bold' font if the user pushes the 'B' button on the font style dialog in that application.`

			`### Motivation`

			`Quite a number of patched fonts have inconsistent or simply wrong font grouping. The naming in`
			`general is sometimes surprising and not following naming conventions. This is in part due to`
			`the font-patcher, but in part the source fonts are already strange.`
			`This results in invisible (but installed) fonts in some applications, inconsistent naming`
			`(Familyname differs from Fullname) and not correctly working bold/italic selectors in some applications.`

			`And we would like to have the information within the names sorted in a consistent way.`
			`usually a font name consists of these parts (in this order):`

			1. Name base (e.g. `Noto`)
			2. Variant (e.g. `Sans`)
			3. Subvariant (e.g. `Display`)
			4. Weight (e.g. `Black`)
			5. Style (e.g. `Italic`)

			This is important because we want to add subvariant information, namely the `Nerd Font` part.

			`Example:`

			* (old) `Iosevka Term Light Italic Nerd Font`
			* (new) `Iosevka Term Nerd Font Light Italic`

			`### The Plan`

			`To solve these issues the font name parts have to be analyzed more thoroughly and then categorized.`
			`These categories are then used to assemble the names in correct order. The simple (not`
			`typographically aware) applications shall always get groups of at most four styles, and these`
			`are Regular, Bold, Italic, and Bold-Italic. Other styles turn up as Families, because this is the`
			`only way they would work in these more simple applications.`

			`Typographically aware applications, on the other hand, get all styles grouped under one Family name.`

			`First experiments showed that the full information can usually be restored already from the file`
			`names that our source fonts have.`

			This new naming is complete optional (but recommended). Give the option `--parser to` font-patcher
			`and it will try to come up with reasonable grouping and naming. Leave the option out and it will`
			`work as it always did.`

			`### The Tests`

			`In this directory there are two tests.`

			`1. The first test checks the basics of the algorithm. It takes the filenames of all fonts in`
			`src/unpatched\_fonts`, then it calculates the naming and compares it to the original
			`naming in the font files. Ideally they would be equal.`
			2. The second test does a 'production run'. It patches each font in `src/unpatched_fonts/`
			and patches it two times: Once without `--parser` and once with. Then it compares the
			`naming, and it also shows the original font naming (for comparison).`

			`All tests base on these assumptions`

			`* Fullname must be roughly equal`
			`* Fontname must be roughly equal`
			`* Familyname must roughly equal, order of all words does not matter`
			`_(Order of words is ignored with test 2 only)_`
			`* SubFamilyname must be equal, order of words does not matter`
			`_(First word must be equal, order of other words is ignored with test 2 only)_`
			`* Typographic names can be empty if the correct typographic name would be equal to the ordinary name`
			`* Tests are done case insensitive`
			* Some special exemptions are made (see `lenient_cmp()` in test scripts)

			`#### Test 1`

			`fontforge name_parser_test1 ../../../src/unpatched-fonts/*/.[ot]tf 2>/dev/null`

			`This test takes the filename of a font, parses it and generates names from it. Then the actual`
			`font is opened and the generated names are compared with the stored names. This test is used`
			`to test the algorithm itself. Of course no SIL table is active as we want to preserve the original`
			`names.`

			`The output shows all the names, always two lines: first the generated names, then the readout`
			names. If there are differences the generated names are tagged with `+` and the readout ones
			with `-`. If there are differences the actually different name part is marked with an `X`.

			`The differences have reasons, and there is a file with textual explanations for them. So far`
			`all differences are 'ok'. A new run of the script will compare all differences with the stored`
			`ones and alert the user if a new difference is detected (or a difference vanished). In this`
			`way changes of the algorithm can be tested with a wide base of inputs.`

			`#### Test 2`

			`fontforge name_parser_test2 ../../../src/unpatched-fonts/*/.[ot]tf 2>/dev/null`

			This test compares actually patched fonts. Every font in `src/unpatched_fonts/` is patched two
			times: First with the 'old/classic' `font-patcher` naming, and second with the new naming
			algorithm in action (by specifying `--parser`). Again the name parts are compared with some
			`lenience and an output generated like test 1 does.`

			`Also again a file with known differences (with explanations) is read, and any new or vanished`
			differences are reported. In the report an additional line is given, tagged with `>`, that
			`contains the names of the original font, for human interpretation (often the reason`
			for a difference is obvious, because the classic `font-patcher` dropped information.

			_Note: Fonts `NotoColorEmoji` and `Lilex-VF` are not patchable, and thus ignored_
			_Note: Fonts `iosevka-heavyoblique`, `iosevka-term-heavyoblique`, `iosevka-mediumoblique` crash my machine and are ignored_

			`### Differences`

			The naming of the patched fonts, if `--parse` is applied, will be different. Of course, that is the goal.
			`What are the differences in particular:`

			* `Nerd Font` is not added in the end, but after the extended base name before the style
			`* The SubFamily contains only 4 Styles max: Regular, Bold, Italic, Bold-Italic`
			`* The Noto fonts retain their abbreviated style names in the Family information`
			* `Nerd Font Mono` fonts get a `M` in windows mode (I believe that has been left out accidentally before)

			Apart from these general things, all changes are documented in detail in the `name_parser_test2` issues file.
			`Here is an overview over all the things that get renamed and why:`

			`\| Occurences \| Description \|`
			`\|------------\|-------------\|`
			`\| 511 \| Add weight/style to family \|`
			`\| 43 \| The fonts name is M+ not Mplus \|`
			`\| 36 \| Drop unneeded Typogr.Family/Typogr.Style \|`
			`\| 26 \| 'Term' is missing from Family \|`
			`\| 22 \| Change regular-equivalent name to Regular \|`
			`\| 19 \| Put Oblique into own SubFamily (and mark it as italic) \|`
			`\| 5 \| Drop Regular from Style \|`
			`\| 4 \| We handle (TTF) as sub-name \|`
			`\| 4 \| Fullname has been missing 'Nerd Font' \|`
			`\| 4 \| Bold / Bold-Italic are just a styles of Regular \|`
			`\| 2 \| Original font broken (Light in Family) \|`
			`\| 2 \| Classify Medium as own weigt and not Bold \|`
			`\| 2 \| Bold and Italic are styles of a basefont \|`
			`\| 1 \| Weight Condensed does not belong to base name \|`
			`\| 1 \| Use only Regular/Bold/Italic in SubFamily \|`
			`\| 1 \| Handle Retina as Weight and not Style \|`
			`\| 1 \| Do not call Semibold Light-Bold \|`

			`From the count we see that almost all fonts are affected by incorrect Family naming.`

			`### Further steps`

			One can examine all the (current) naming differences in the `name_parser_test2.known_issues`
			`file. The Explanation is followed by three lines of names: source-file, patched-with-parser,`
			`and patched-classic.`

			`The Explanation sorts most differences into common groups. This helps to weed out`
			`explanations that might do not need much attention.`

			`### Helper scripts`

			`There are some helper scripts that help examining the font files. Of course there are other,`
			`more professional tools to dump font information, but here we get all we need in a concise`
			`way:`
			* `query_names` `font_name [font_name ...]`
			* `query_panose` `font_name`
			* `query_sftn` `[<sfnt-name>] font_name`
			* `query_version` `font_name`

			They can be invoked like this `$ fontforge query_sfnt foo.ttf`.

			### Appendix: The `name_parser_test*.known_issues` files

			`All differences of 'old' to 'new' naming (if not one of the very general kind like resorting of`
			the words) are documented in the `known_issues` files. For each difference a reason is given.

			`The files consist of entries that spans 3 (for test 1) or 4 (for test 2) lines.`
			`\| Line starts with \| Contents \|`
			`\|------------------\|----------\|`
			\| # \| Reson for the difference (or `AUTOGENERATED`) \|
			`\| > \| Naming fo the original/source font (only test 2) \|`
			\| + \| Naming with `--parser` (new naming) \|
			`\| - \| Naming classically generated by font-patcher \|`

			After any test run a `known_issues.new` file is generated. It contains all the issues
			from the `known_issues` file that were detected. Original issues that are not
			`existing anymore are at the bottom of the new file, clearly marked as such. If new`
			(previously unexplained) issues were detected they show up with the `AUTOGENERATED`
			`reason.`

			`After adding new fonts or replacing font files the test can be rerun. If there are issues`
			in the `.new` file they should be documented there, and the `.new` file replace the
			original `known_issues` file (after removing possible 'obsolete' issues that are listed in
			`the bottom of the new file).`

			`In this way one can tweak the parser code and compare very easily what a change`
			`means for all the fonts, which will break or be repaired.`