In Antwort auf Sprachwechsel in WordPress von Robert Lender

Die Sprachauszeichnung von Textausschnitten in WordPress zu ändern, ist in der Tat schwierig. Ich mache das immer manuell. Sprachunterstützung scheint generell davon auszugehen, dass man immer nur eine Sprache verwendet, oder wenn man die Sprache wechselt, dass man immer nur eine Sprache zugleich verwendet. Der Gedanke, dass man zwei oder mehr Sprachen austauschbar verwenden kann scheint unvorstellbar zu sein. Ganz zu schweigen davon dass man dies innerhalb eines einzigen Beitrags tut oder dass man ein einziges Wort oder einen kurzen Satz in einer anderen Sprache verwendet als in denen man normalerweise schreibt.

Die Einstellung einer Sprache in WordPress ist eine seitenweite Einstellung. Das manuelle auszeichnen der Sprache ist der einzig zuverlässige Weg.

Damit ist es aber noch nicht getan. Wenn man sich ansieht wie z. B. Google Translate Inhalte analysiert, wird es schnell merkwürdig. Google achtet nur auf die für die ganze Webseite angegebene Sprache und berechnet eine algorithmische Wahrscheinlichkeit bez. die Sprache des Textes. Alle anderen Sprachauszeichnungen werden einfach ignoriert. Und selbst dann liegt Google oft falsch. Wenn ich auf einer individuellen Page, die als niederländisch deklariert ist, einen kurzen Beitrag auf Niederländisch schreibe, wird er trotzdem als Englisch interpretiert, weil die Menüstruktur auf meinen Seiten auf Englisch ist und mehr Gewicht zugewiesen bekommt als der Text selbst. Nur wenn es einen längeren niederländischen Text gibt, wird der Beitrag trotz korrekter Sprachauszeichnung der kürzere Texte, korrekt als niederländisch erkannt.

Vor einiger Zeit habe ich mein WordPress-Theme so angepasst, dass es Sprachen besser berücksichtigt:

  • Wenn ein Beitrag auf Deutsch oder Niederländisch ist, und einzeln betrachtet wird, wird die richtige Sprache als Sprache für die ganze Webseite angezeigt.
  • Wenn ein Beitrag auf Deutsch oder Niederländisch ist, wird er in einer Übersicht zusammen mit englischen Beiträge mit einer Sprachauszeichnung auf Post-ebene versehen. Beides mache ich, indem ich für jede Sprache eine eigene Kategorie verwende.
  • Kurze Sätze in einer anderen Sprache markiere ich mit einer Sprachauszeichnung, die ich leicht mit einer Tastenkombination hinzufügen kann.
  • In meinem RSS-Feed füge ich Links zur Maschinenübersetzung deutscher und niederländischer Beiträge ins Englische hinzu, erstelle sie aber als spezifische Links: Man kann sich nicht darauf verlassen dass die automatische Erkennungsfunktion von Google Translate dies richtig macht.

Kurz gesagt, ich bin nicht zuversichtlich, dass WordPress es richtig machen wird. Diese Diskussion ist kein neues Phänomen, und ein branchenweites Problem in der Tech Industrie. Vielleicht sollten gerade Coder in Europa mit unserer Sprachenvielfalt sich auf ein gut funktionierendes WP Plugin konzentrieren.

Die Auszeichnung des Sprachwechsels ist eigentlich keine Kleinigkeit. Denn immerhin geht es darum, dass z.B. Screenreader sehbehinderten Menschen Wörter richtig vorlesen und diese somit überhaupt verständlich sind. .[Was man] bis heute für einzelne Worte nicht machen kann. …da hoffe ich in Zukunft auf mehr Einsicht bei den WordPress Entwicklern.

Robert Lender

I had thought I had the language stuff all sorted, also because I had tested it. As it turns out, Google Translate works slightly different than my earlier conclusion.
It does look at the declared language in the <html> tag, but it doesn’t do so exclusively. Even if the language is declared it seems to still also look at the machine learning model.
This has as an effect that when a posting here in Dutch or German is very short, tweet-like, it will still detect that most of the page is in English (navigation structure, sidebar etc.), and treat the entire page as English. This makes even less sense than my earlier notion it follows the declared language, and machine learning if nothing’s declared, as it seems to actively distrust even the little bit of language mark-up it bothers to check in the first place.

It does mean that adding machine translation links at the end of Dutch and German posting is a good service to provide. Here I can’t trust the auto-detect feature of Google Translate (see above), so I must force the correct source language in the link I provide. This doubles the code needed (once for Dutch, once for German), but it works. The code is in the same function I previously adopted from Frank and Jan. I’ve added the translation links only to the RSS-feed, not to the website. My reasoning is that most of my regular readers do so through RSS, and that it’s them that might be interested in also reading my non-English postings.


A posting in Dutch as it appears on the site


The same posting in Dutch as it appears in the RSS feed, with added link to machine translation

For now I’m done with language adaptations. Although, having looked at some of the older conversations concerning multilingualism I’ve had over the years, I also considered how Stephanie Booth adds English excerpts at the start of a posting in French and vice versa. That might be something to emulate. However, it should not clutter up the postings or feeds too much, so likely should be another field. As I’m already using the excerpt field for other things (posting to Twitter and Mastodon mostly), that’s something to figure out in the future.

Last week I changed this site to provide better language mark-up. However, even though it changed mark-up correctly, it didn’t solve the issue that made me look into it in the first place: that if you click a link to a posting in my rss-feed, your browser would not detect the right language and translate the posting for you.

As it turns out, Google Translate doesn’t make any real effort to detect the language or languages of a page. It only ever checks if there is a default language indicated in the very first <html> tag of a page (which my WordPress sets to English for the entire website), and only if there is no such default set it uses a machine learning model (CLD2) to detect what language likely was used, and then only picks the most likely one. It never checks for language mark-up. It also never contemplates if multiple languages were used in a page, even though the machine learning model returns probabilities for more than one language if present in a page.

This is surprising on two levels. One, it disregards usable information even when provided (either the language mark-up, or probabilities from the ML model). Two, it makes an entire family of wrong assumptions, of which that something or someone will always be monolingual is only the first. While discussing this in a conversation with Kevin Marks, he pointed to Stephanie Booth‘s presentation at Google that he helped set up 12 years ago, listing all that is wrong with the simplistic monolingual world-view of platforms and tech silos. A dozen years on it is still all true and relevant, nothing’s changed. No wonder Stephanie and I have been talking about multi-lingual blogging off and on for as long as we’ve been blogging.

Which all goes to say that my previous changes weren’t very useful. I realised that to make auto-translation of clicked links from my feed work, I needed to set the language attribute for an entire page in the <html> tag, and not try to mark-up only the sections that aren’t in English. (Even if it is the wrong thing to do because it also means I am saying that everything that isn’t content, menu’s, tags etc, are in the declared language. And that isn’t the case. When I write postings in Dutch or German, the entire framework of my site is still in English.). After some web searching, I found a reference to writing a small function to change the default language setting, and calling that when writing the header of a page, which I adapted. The disadvantage is this gets called for every page, regardless if needed (it’s only ever needed for a single post page, or the overview pages of Dutch and German postings). The advantage is, almost all language adaptations are now in a single spot in my theme. I’ve rolled back all previous changes to the single and category templates. Only the changes to the front page template I’ve kept, so that there is still the correct language mark-up around front page postings that are not in English.


The function I added to functions.php in my child theme.


An example of changed page language setting (to German), for a posting in German. (if you follow that link and do view source, you’ll see it)

My site until now didn’t indicate very well in which language my postings are written. I write here mostly in English, but also sometimes use two other languages, Dutch and German.

My friend Peter pointed out to me that if he reads Franks blog in his feedreader and clicks on the link his browser automatically translates it into English. As Peter suggested, this is most likely because Frank’s site declares Dutch as its language, and mine declares English. I decided to look into it and see if I could change that.

The language declaration Peter pointed to is the very first statement in the source code for this page:

Frank’s site in the same space says his site is in Dutch.

Frank also publishes in English sometimes, and then the language setting would be factually incorrect. Peter just wouldn’t notice as he wouldn’t attempt to translate English, his native language.

My company’s website in contrast declares three languages, by giving a different url for English and German, next to the regular Dutch. However in this case it is about the same or similar pieces of content made available in different languages. Which is not the same use case as my blog, where there is different content in different languages.

I concluded I needed to figure out how to a) for the category archive pages for Dutch and German postings declare the right language (because I mark any posting not in English with a separate category corresponding to its language), and b) for individual postings not in English declare the right language.

First I looked at what the W3C says about indicating content languages. It turns out Frank and I both do it right, the html statement is the place to declare the default language of a website. In Frank’s case Dutch, in my case English. The W3C goes on to say that any other languages should be indicated in the location where they are used. This e.g. would allow me to indicate the correct language even if I use a non-English phrase in the middle of an otherwise English text, hetgeen een mooie oplossing is voor automatische vertaalsoftware. Which looks like this in html:

This means that what I needed to do was for the category archive pages for Dutch and German, as well as for individual postings, find the right spot in the source of a page to declare the correct language. I did this in the WordPress Theme I am using, or rather in the child theme (which allows you to specify any deviations from the original theme, while keeping the rest of the theme as it was).

For both the Dutch and German category pages I created separate templates, called category-nederlands.php and category-deutsch.php, which corresponds with the name of the category in my WordPress instance. At the top of those pages I added a language indicator where the main part of the page starts.

For individual blogposts it is a bit more difficult, as you need to be able to determine first if a posting is in another language than English. I adapted the single.php template, which renders individual postings. There I added a line of code to see if the posting is in Dutch or German, by checking if it is in the corresponding category.

This results in either adding lang=”nl-nl” or lang=”de-de” to postings in those languages, in the same location as for the category archive pages shown above.

Hopefully this now allows browsers to correctly detect the language of content on my site.
I’m not entirely done yet. Because in some overviews, like the front page, individual postings that are not in English are not rightly marked with the correct language yet. Only if you go to that posting itself, will the language be correctly set. But this can be solved in a similar way, I assume. [UPDATE 2019-10-14] I’ve also edited the index.php and category.php templates to check if a posting is in the Dutch or German language category, and add a language declaration using a <div lang="nl-nl"> around the posting. For the index.php I do that only for the home page. This works, but as far as I can tell e.g. Google Translate for ‘detect language’ only checks the default language of a page. As I am not here to facilitate Google, I am currently satisfied that I at least do now provide clear meta-data about the language of postings I publish.[/UPDATE]
A final step I’d like to add is automatically insert machine translation links into my rss feed items, although I’m still not entirely sure that would be useful.

Also see Adding Better Language Support II and Adding Better Language Support III