Page MenuHomePhabricator

Hyphenated langtags in Thumbor/7.3.2 and librsvg 2.44.10 do not show any text
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

*: https://commons.wikimedia.org/w/index.php?lang=az-latn&title=File%3AIPv6_header-en.svg

What happens?:
az-latn shows no text

What should have happened instead?:
az-latn should show the az-latn text

Software version (skip for WMF-hosted wikis like Wikipedia):
Thumbor/7.3.2
librsvg 2.44.10

Other information (browser name/version, screenshots, etc.):
Thumbor URLs

The az-latn Thumbor URL shows no text:

The az Thumbor URL shows az-latn text

T261192
T335361

librsvg 2.40 only matched langtags up to the first hyphen.
librsvg 2.44 does not even match the default.

May be problem with az-latn not being a Unix locale string.

Possible workaround for old librsvg 2.40 behavior is to truncate hyphenated langtags.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 20 2023, 6:05 PM

Change 923368 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] svg: attempt to build valid locales from hyphenated languages

https://gerrit.wikimedia.org/r/923368

Thanks for the report and the test cases. This change attempts to build valid locales to fix both of the issues. I am curious as to whether our approach of using these languages in Thumbor when we used LANG rather than LC_ALL would have ever worked for these language tags.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

Thanks for the report and the test cases. This change attempts to build valid locales to fix both of the issues. I am curious as to whether our approach of using these languages in Thumbor when we used LANG rather than LC_ALL would have ever worked for these language tags.

Generally, there is not a one-to-one mapping between IETF langtags and locale strings. Consequently, I believe it is not a good idea to map a langtag to a Unix locale string and then hope that the Rust crate will map that locale string back to the original langtag. The screwiness is too complicated to go into right now. IIRC, converting the langtag en-us to the locale string en_US will match the langtag en-GB in an SVG file. The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Some background.

Early versions of librsvg just read $LANG and (mis)treated it as an IETF langtag. Environment variables are just strings. That would have worked well except that Gnome got the IETF langtag string matching routine wrong. The matching bug meant that Gnome only matched the first subtag and ignored the rest. That is why zh-Hans matches zh-Hant in librsvg 2.40.

Gnome then decided to use a Rust crate to do the IETF matching. That had some nice features for guessing a reasonable IETF langtag from the user's environment, but it ran into trouble because users could no longer set a particular langtag. For example, the Unix and the Rust crate may understand the locale string zh_CN but not zh_Hans, so it could generate the langtag zh-CN but not zh-Hans. Even more troubling: what if the user wanted zh-CN-Hans. And, of course, WMF would be toast with its non-compliant sr-ec (which IETF would interpret as Serbian as spoken in Ecuador). Unix locale strings and IETF langtags are not invertible.

Gnome fixed that problem with addition of --accept-languages, but that change comes after librsvg 2.44.10.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

Do not return an error. SVG wants IETF language tags, but it does not require them to be valid. Avoid thinking about converting between langtags and locales.

For right now, I would suggest the following approach.

If lang is not specified, then set it to en. I'm hoping that will fix the non-English default case. It would enforce the current WMF semantics that English is the default.

If lang contains a hypen, then truncate lang to the first subtag. For example, modify zh-hans to zh. This step is not ideal, but it should match the (mis)behavior of librsvg 2.40.

Set the environment variable $LC_ALL to the modified lang. (I might also set `$LANG'.)

If librsvg 2.44.10 does not complain when given --accept-languages (it probably will complain), then I would consider passing it with the original value of lang. That way we are ready for the next upgrade.

(Side thought: can lang be used for script injection?)

Add unit tests to check whether lang works for non-English default and hyphenated langtags.

The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Packaging our own librsvg version or backporting from bullseye should be more possible than it was on stretch, since most (all?) of the rust buildchain dependency issues should be solved. I don't know if anyone's looked at the feasibility of doing that recently though.

The right thing to do is avoid locales entirely and pass the langtag through --accept-languages, but that must wait for next version of the operating system.

Packaging our own librsvg version or backporting from bullseye should be more possible than it was on stretch, since most (all?) of the rust buildchain dependency issues should be solved. I don't know if anyone's looked at the feasibility of doing that recently though.

+5 for ACN.

librsvg 2.44.10 is old and broken, so the right thing is to use a much more recent version. Clearly 2.44.10 is broken for hyphenated language tags and other issues. Upgrade to a modern version of librsvg and pass the langtag through accept-languages.

Gnome fixed that problem with addition of --accept-languages, but that change comes after librsvg 2.44.10.

More broadly I wonder whether we should return an error when we can't get a valid locale and a language is requested rather than just serve an empty generated image. The language-tags library exists for managing RFC-compliant language strings, but doesn't seem to offer us anything particularly useful for mapping between languages

For right now, I would suggest the following approach.

If lang is not specified, then set it to en. I'm hoping that will fix the non-English default case. It would enforce the current WMF semantics that English is the default.

If lang contains a hypen, then truncate lang to the first subtag. For example, modify zh-hans to zh. This step is not ideal, but it should match the (mis)behavior of librsvg 2.40.

Set the environment variable $LC_ALL to the modified lang. (I might also set `$LANG'.)

If librsvg 2.44.10 does not complain when given --accept-languages (it probably will complain), then I would consider passing it with the original value of lang. That way we are ready for the next upgrade.

librsvg 2.44.10 does not accept this flag. Version 2.50.3 (which is packaged with bullseye) also doesn't so we will have to package our own version. I'll start the work on that.

While the solution in the patch is imperfect, it will get the rendering of images somewhat unblocked for the time being and will address issues like the ones mentioned in the cited images while also adding some support for distinctions between language variations like zh_hk and zh_tw.

(Side thought: can lang be used for script injection?)

The change above will only allow the environment variable to be set if it is a valid locale and otherwise will default to en.

Add unit tests to check whether lang works for non-English default and hyphenated langtags.

I've added a test for File:IPv6_header-en.svg to the linked CR to ensure that the correct image is generated when lang is set

I'm not shure if it helps if I provide simple SVG-examples with systemLanguage=

The hypen Problem can be seen in e.g. https://commons.wikimedia.org/wiki/File:SystemLanguage.svg

Sometimes artificial lang-tags are used for including several images into one SVG: e.g. https://commons.wikimedia.org/wiki/File:Unicode_Geschlechtersymbole.svg

Looks like it's not just hyphenated codes, the "simple" has the same problem.

Looks like it's not just hyphenated codes, the "simple" has the same problem.

Simple should be mapped to "en-simple", there are more cases like this on https://meta.wikimedia.org/wiki/Special_language_codes which shows what they should be mapped to.
This is WMF specific, so I doubt bullseye is going to cover this aspect.

Change 923368 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] svg: attempt to build valid locales from hyphenated languages

https://gerrit.wikimedia.org/r/923368

Change 930641 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: attempt to render hypenated svg languages better

https://gerrit.wikimedia.org/r/930641

Change 930641 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: attempt to render hypenated svg languages better

https://gerrit.wikimedia.org/r/930641

The temporary fix for hyphenated languages has mitigated some of the issues highlighted in this ticket. However, the correct solution is to build and deploy a more modern version of rsvg-convert, which will be done in the coming weeks.

@hnowlan
There is a related problem at T337199.

Consider the file

It should display "en" (the systemLanguage="en" English translation clause), but instead it displays "other" (the default clause).

That means that rsvg-convert does not know it is supposed to render English.

The relevant wiki URL is

That URL does not have an explicit language parameter, so I presume

will be entered without a lang parameter: That is

if hasattr(self.context.request, 'lang'):

will not be true. The consequent is LC_ALL is never added to env, so the rasterizer uses some language that does not match "en".

MediaWiki semantics wants a default URL (one with no lang parameter) to default to English.

When Thumbor processes a URL that has not set self.context.request.lang, then it should either force lang to "en" before further processing or explicitly set

env = {'LC_ALL': 'en'}

so rsvg_convert knows to use the preferred language "en".

Change 962563 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] svg: default to "en" when a language is not specified

https://gerrit.wikimedia.org/r/962563

Change 962563 merged by jenkins-bot:

[operations/software/thumbor-plugins@master] svg: default to "en" when a language is not specified

https://gerrit.wikimedia.org/r/962563

The latest change appears to have improved the default on many of the supplied cases.

Closing this task for now, please reopen if needed or if this was done in error.

Change #1042203 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] svg: use rsvg-convert's language parameter

https://gerrit.wikimedia.org/r/1042203

I think users need to make some fixes here. The conclusion is clearly to use standards.

Suggested message:

SVG's using WMF specific language codes need to change their language code over to BCP47. For example, Simple English needs to move from "simple" to "en-basiceng".

@Snaevar Is there an ideal listing we can link to, which provides those exact codes?
I.e. The link in your draft doesn't contain the strings en-basiceng or basiceng.

I searched and found the large canonical listing at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry but it seems a bit confusing to me at first glance, too. E.g. it contains an entry for both

%%
Type: variant
Subtag: simple
Description: Simplified form
Added: 2015-12-29
%%

and also

%%
Type: variant
Subtag: basiceng
Description: Basic English
Added: 2015-12-29
Prefix: en
%%

I can see how the 2nd entry gets the Prefix+Subtag combined, but I wonder if that needs to be explained in the Tech News entry...?

(Or, maybe this would be clearer with a different example, if en-basiceng is an edge-case within these other edge-cases?!
Or perhaps we just need to add a note about en-basiceng into the original link, and all other languages are self-explanatory at that location?)

p.s. Thank you for the draft! It is always appreciated. :)

I noticed another task that mentions BCP47 at T366623: Create a parser function to get the BCP47 code for a language -- I'm not sure if that's related at all, but possibly?

The way I see it "simple" is for all languages, so it could be simple German, simple French, simple Spanish, and so on. Using it for simple English only would not be appropriate. I do agree that using "simple" as an example is not that great. I got "en-basiceng" from https://en.wikipedia.org/wiki/Codes_for_constructed_languages, but I have also seen "en-simple".

T366623 is helpful but not the same. SVG's are vector images and parser functions can not be used in images. But the user could preview the function on a wiki-page and the get the code they need to change to. Also the {{#bcp47|sr-ec}} usage is not correct, it is actually used as {{#bcp47:sr-ec}} - I tested it on test.wikipedia.org. While I was on test.wikipedia.org I also found out that most of these special language codes are different in BCP47.

So how about dropping the example and writing this like so:

SVG's using WMF specific language codes need a different language code. You can find the right one by previewing {{#bcp47:langcode}} on a wikipage and copying it to your SVG file.

Oh! Right, that makes sense for the simple variant, thanks.
Re: the new parser function, I cannot reproduce a successful test. (testwiki and beta cluster). But also, the patch isn't merged yet, so I'm also guessing that maybe you tested it locally?
I will delay including this entry in Tech News until next week, once everything is clearer (to me at least!). Thanks again for the details, and the drafts.

Hi @Snaevar (or anyone), given that the proposed parser function hasn't been merged yet, I'm wondering if you have any suggestions on how to most clearly help the editors who might need some guidance?

Ideally, I think we could write something like this, but I'm not sure if it's accurate enough, or detailed enough:

Editors who work with multilingual SVG files can add language tags to the SVG's labels. If the SVG contains labels that use a hyphenated Wikimedia specific language code that doesn't match the BCP47 standard, such as zh-classical, then those labels need to be changed to use the BCP47 code instead, such as lzh.

P.s. @hnowlan I see there's an unmerged patch connected to this task, yet the task is marked as resolved. Does that status need to be changed?

Hi @Snaevar (or anyone), given that the proposed parser function hasn't been merged yet, I'm wondering if you have any suggestions on how to most clearly help the editors who might need some guidance?

Ideally, I think we could write something like this, but I'm not sure if it's accurate enough, or detailed enough:

Editors who work with multilingual SVG files can add language tags to the SVG's labels. If the SVG contains labels that use a hyphenated Wikimedia specific language code that doesn't match the BCP47 standard, such as zh-classical, then those labels need to be changed to use the BCP47 code instead, such as lzh.

zh-classical is not an IETF langtag; classical cannot be a legal subtag. lzh or zh-lzh would be a rare choice.

For Chinese, the simple choice is use zh-Hans (simplified Chinese script) or 'zh-Hant (traditional Chinese script). Websites will often use zh-CN (mainland China/simplified) or zh-TW(Taiwan/traditional). Those are the four that I would recommend. One could also specify Cantonese (yue) or Mandarin (cmn).

There are also issues with Serbian: sr-Cyrl (Cyrillic script) and sr-Latn (Latin script). There are also the faux langtags sr-ec (= Serbian as spoken in Ecuador/Cyrillic) and sr-el (= Serbian as spoken in an undefined region/Latin) that have leaked into SVG files.

Many editors have added hyphenated langtags to SVG files on Commons. I do not think they find it difficult to do that. The problem is getting the desired result. That will continue to be a problem for Chinese and Serbian.

Editors can also use SVG Translate to add languages. SVG Translate was used extensively on

That file is actually big enough that people will view the SVG in their browser, and that all works out because browsers have correctly handled hyphenated langtag preferences for years.

Due to the hyphenated langtag bug, hyphenated langtags just match the first tag and ignore the rest. That is, zh-Hans and zh-Hant behave just like zh. That means that files with simplified and traditional Chinese scripts may display a mix of those scripts rather than a single, consistent, script.

Now that that bug has been resolved, we will see more problems appear.

If you go to the zh.Wiki, the default langtag is zh. For example,

includes the 2022_Russian_Invasion_of_Ukraine.svg with the URL .../langzh-350px.... That means Thumbor will use zh and continue to use a random selection of Chinese scripts. Not the best alternative.

If I elect a specific Chinese:

the image inclusion still uses .../langzh-350px....

Ideally, the default SVG langtag should vary depending on the script/region choice of the wiki.

Be careful what you wish for. That will create other issues. If the default langtag becomes zh-Hans (.../langzh-hans-350px...), then the Russian Invasion map will not display any Chinese. It should be fed an Accept-Languages specification of zh-hans, zh to give Hans if available but any zh otherwise.

I don't think we can get that functionality because the wiki script change is a post-process of the page.

Same goes for Serbian.

Some editors avoid the issue by using both Cyrillic and Latin for Serbian:

IIRC, problems with media wiki's language matching will also arise.

And then there is the wiki-semantics issue if small SVG files are served directly.

Snaevar removed a project: User-notice.

Hi @Snaevar (or anyone), given that the proposed parser function hasn't been merged yet, I'm wondering if you have any suggestions on how to most clearly help the editors who might need some guidance?

Ideally, I think we could write something like this, but I'm not sure if it's accurate enough, or detailed enough:

Editors who work with multilingual SVG files can add language tags to the SVG's labels. If the SVG contains labels that use a hyphenated Wikimedia specific language code that doesn't match the BCP47 standard, such as zh-classical, then those labels need to be changed to use the BCP47 code instead, such as lzh.

It is not necessarily hyphenatenated language codes. Yes, the task is about hyphenated language codes, but this specific part of it really is not. IKhitron started this discussion and I have kept it alive. Probably this should have been split from this task awhile ago. So, split into T368128, before that confusion escalates further. User-notice tag moved.