Hyphenation
A common problem in languages that use compound words is that they'll unintentionally break user interfaces that don't expect long words to appear. This is especially common on mobile phone interfaces or smart watches.
A classic example of a compound word in German is the noun Donaudampfschifffahrtsgesellschaftskapitän. In English this would be translated as: "Danube steamship transport company captain".
The problem
Let's use a familiar Engish phrase as an example, and see how a constrained user interface might affect how the text is displayed:
The way the phrase overflows is something quite atrocious!
A partial solution
We can partially fix this by using the CSS property word-break:break-word
:
<div style="word-break:break-word;">
supercalifragilisticexpialidocious
</div>
This is a lot better, but there's one thing that still doesn't look right. The word breaks off at inconvenient moments. The break at "fragili–stic" is okay, but "doci–ous" is not good. It's not how you would sound it out. The correct hyphenation break points for the word are: su‧per‧cal‧i‧frag‧i‧lis‧tic‧ex‧pi‧a‧li‧do‧cious
.
This problem is known as syllabification. According to Wikipedia, "Is there any perfect syllabification algorithm in English language?" is an unsolved problem in computer science.1
The issue probably wouldn't bother most English speakers because the chance of a long word appearing is quite low. But a lot of languages don't have this privilege, like German.
A manual solution
Soft hyphens can be used to control how the words are broken and hyphenated.
Soft hyphens are characters that are inserted in the middle of words. They are normally invisible, except when a word is broken onto a new line where it is then rendered as a hyphen. It can be used in a couple of different ways:
- As a raw character,
, which appears invisible here. - In JSON, in an escaped unicode sequence
"\u00AD"
- In HTML, as the entity
­
<div style="word-break:break-word;">
super­cali­fragilistic­expiali­docious
</div>
Now the phrase breaks much more appropriately, with "fragilistic" and "docious" appearing unbroken. Note also how the soft hyphen used between "Super" and "cali" was not rendered.
The soft hyphen is supported on a lot of platforms, like browsers, desktop environments, and text layout engines.
There are some downsides:
- Manually inserting soft hyphens is very time consuming.
- Hyphenation will be different for each language and region, so an editor can't apply English hyphenation rules to Portuguese text without it looking odd.
3rd party software libraries exist to automate this on various platforms, like Hyphenology for JavaScript. They'll even have the ability to provide a language tag to choose which rules to use. But it is such a shame that a library is necessary to get this functionality.
A better solution
Since 2021 the situation has improved slightly with the addition of a new2 CSS property hyphens:auto
. Together with the lang
attribute, the syllabification, or hyphenation, can be configured to match the locales.
<div lang="en" style="word-break:break-word; hyphens:auto;">
supercalifragilisticexpialidocious
</div>
While the automatic breaking of words is not quite what I'd prefer, I'm glad at least that "docious" is split as "do–cious" rather than "doci–ous".
To demonstrate how subtle the hyphenation rules are between languages, below is a comparison of the same word in English (lang="en"
) then in Spanish (lang="es"
).
Different hyphenation rules between browsers
One issue that I discovered while writing this is that Firefox and Chrome do not hyphenate "proper nouns" whereas Safari does. This, according to the W3C CSS Text 3 spec, is not a problem:
The UA may use language-tailored heuristics to exclude certain words from automatic hyphenation. For example, a UA might try to avoid hyphenation in proper nouns by excluding words matching certain capitalization and punctuation patterns. Such heuristics are not defined by this specification. (Note that such heuristics will need to vary by language: English and German, for example, have very different capitalization conventions.)
In other words, the user-agent (the browser) has their own rules for words that are capitalized.
I think this is wrong, because I think it is perfectly acceptable to hyphenate words that are long in English. It is also completely wrong to assume that the first word in the sentence is a proper noun, which browsers like Firefox and Chrome do.
In the hyphenation example below, this is what happens in the three browsers when English hyphenation (lang="en"
) rules are used:
- In Safari, both words below are hyphenated.
- In Chrome, strangely, the capitlized word is not hyphenated on the first line, but is on the second line. The lowercase word is hyphenated.
- In Firefox, the capitalized word is not hyphenated at all. The lowercase word is hyphenated.
The good news is if German hyphenation rules are used (lang="de"
), hyphenation does work for capitlised words. So in the example below, all browsers should hyphenate the word below across all three lines.
-
Where Wikipedia got this claim is not clear. As of 1st of September 2024, the Wikipedia page has no citation for this claim, and has a bunch of "citation needed"s. ↩
-
Chrome and Edge only added support across all platforms for the value
auto
in 2021. Firefox and Safari have had it since 2011. However, as I cover later on, I think the way hyphenation was implemented in Firefox and Chrome is terrible, making this feature useless. ↩