Last modified: 2014-08-27 22:06:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60784, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 58784 - jsub and utf8


Summary:	jsub and utf8

Status:	RESOLVED WORKSFORME

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	tools (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Marc A. Pelletier

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-12-21 08:03 UTC by sigmawp
Modified:	2014-08-27 22:06 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description sigmawp 2013-12-21 08:03:35 UTC

A Python 3 script containing this code was executed with jsub:

    import sys
    print(sys.stdout.encoding)

The resulting .out file contained "ANSI_X3.4-1968".
Normally, people set the encoding to utf8. When people assume that the encoding is utf8, but it isn't, terrible things happen.

Another Python 3 script containing this code was executed with jsub:

    print("Talk:Gülen movement")

The resulting .err file contained this:

    Traceback (most recent call last):
      File "...", line 5, in <module>
        print("Talk:G\xfclen movement")
    UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 6: ordinal not in range(128)

jsub is written in Perl, which is perfectly capable of using utf8 as its output encoding. Unicode is important enough to all of us, which leads me to propose that jsub be edited for this.

I am not an expert with Perl, but I would try to add "use utf8;\nuse open qw/:std :utf8/;" to the top of the file, right under "use warnings;".

On a slightly related note, scripts running as regular CGI also use the "ANSI_X3.4-1968" encoding. This may be out of scope of this bug though.

Comment 1 Tim Landscheidt 2013-12-21 09:01:32 UTC

I can't reproduce either claim:

| scfc@tools-login:~$ cat > test.py && chmod +x test.py && rm -f test.{out,err} && jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
| #!/usr/bin/python3
| import sys
| print(sys.stdout.encoding)
| Your job 1933102 ("test") has been submitted
| UTF-8
| scfc@tools-login:~$ cat > test.py && chmod +x test.py && rm -f test.{out,err} && jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
| #!/usr/bin/python3
| print("Talk:Gülen movement")
| Your job 1933103 ("test") has been submitted
| Talk:Gülen movement
| scfc@tools-login:~$

Please provide a minimal example.

(Just to clear up some confusion: jsub doesn't actually execute the script; it just submits it to the job grid aka SGE/OGS.)

Comment 2 Kunal Mehta (Legoktm) 2013-12-21 09:55:15 UTC

Partially reproduced it.

Using the first script:

local-legobot@tools-login:~/$ jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
Your job 1933479 ("test") has been submitted
ANSI_X3.4-1968

Second script:

local-legobot@tools-login:~/$ jsub ./test.py && while ! job test > /dev/null; do sleep 1; done && cat test.{out,err}
Your job 1933488 ("test") has been submitted
Talk:Gülen movement

Comment 3 Merlijn van Deen (test) 2013-12-21 11:22:37 UTC

I think this should be a more generic request to make sure the environment on the exec hosts is the same as what someone has when testing in the interactive shell.

In any case, the problem is the following:

valhallasw@tools-login:~$ cat > test.sh
#!/bin/bash
locale
valhallasw@tools-login:~$ chmod +x test.sh
valhallasw@tools-login:~$ ./test.sh
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

valhallasw@tools-login:~$ jsub ./test.sh
valhallasw@tools-login:~$ cat test.out
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=



Setting LANG="en_US.UTF-8" (or any other UTF-8 locale) should solve this issue.

Comment 4 Merlijn van Deen (test) 2013-12-21 11:23:49 UTC

Oh, and to reproduce the issues: compare

LANG=C python -c "print u'\xe4'"

to 

LANG=en_US.UTF-8 python -c "print u'\xe4'"

Comment 5 Tim Landscheidt 2013-12-21 13:23:50 UTC

(In reply to comment #3)
> [...]
> Setting LANG="en_US.UTF-8" (or any other UTF-8 locale) should solve this
> issue.

It apparently does, because I have:

| export LANG=de_DE.UTF-8

in ~/.profile, and for me:

| scfc@tools-login:~$ diff -u test.out <(./test.sh)
| scfc@tools-login:~$

But my test account shows "LANG=en_US.UTF-8" interactively, but "jsub locale" gives "LANG=", even after "export LANG".  The same occurs if I set the locale to non-"en_US.UTF8" before jsub with "export LANG=de_DE.UTF-8".

My assumption (and fear :-)) is that SGE sources ~/.profile before job execution, which means that there will be a *lot* of confusion on where to configure locales and how they are evaluated.

I don't want to go down that road if it can be avoided.  Is it possible to explicitely set the locale in Python?  Otherwise we could change jsub so that users can use qsub's "-v" option to set the locale in the environment:

| scfc-test@tools-login:~$ qsub -b y -N locale-en -v LANG=en_US.UTF-8 locale
| Your job 1934859 ("locale-en") has been submitted
| scfc-test@tools-login:~$ qsub -b y -N locale-de -v LANG=de_DE.UTF-8 locale
| Your job 1934865 ("locale-de") has been submitted
| scfc-test@tools-login:~$ fgrep LANG locale-*.o*
| locale-de.o1934865:LANG=de_DE.UTF-8
| locale-de.o1934865:LANGUAGE=
| locale-en.o1934859:LANG=en_US.UTF-8
| locale-en.o1934859:LANGUAGE=
| scfc-test@tools-login:~$

However that does not seem to solve the Python error:

| scfc-test@tools-login:~$ cat test.py 
| #!/usr/bin/python
| print u"\xe4"
| scfc-test@tools-login:~$ qsub -b y -N python-locale-en -v LANG=en_US.UTF-8 ./test.py 
| Your job 1934872 ("python-locale-en") has been submitted
| scfc-test@tools-login:~$ cat python-locale-en.*
| Traceback (most recent call last):
|   File "./test.py", line 2, in <module>
|     print u"\xe4"
| UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
| scfc-test@tools-login:~$

And for the dbreps tool I indeed had to use:

|     # Wrap sys.stdout into a StreamWriter to allow writing unicode.
|     sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

But that is Python 2.7.3 (cf. http://stackoverflow.com/questions/1473577/writing-unicode-strings-via-sys-stdout-in-python, http://pythonhosted.org/kitchen/unicode-frustrations.html, https://wiki.python.org/moin/PrintFails).

I don't know what the situation is for Python 3+.

Comment 6 Merlijn van Deen (test) 2013-12-21 14:19:49 UTC

Ahh, there's another catch.

valhallasw@tools-login:~$ python ./test.py | tee
Traceback (most recent call last):
  File "./test.py", line 2, in <module>
    print u"\xe4"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

valhallasw@tools-login:~$ PYTHONIOENCODING=utf-8 python ./test.py | tee
ä


but that's painful to say the least.


Python 3 has no issues -- it will just use utf-8 if the LANG says so:

(test.py: print("\xe4") -- remember, str in py3 is unicode in py2)

valhallasw@tools-login:~$ python3 ./test.py | tee
ä

Comment 7 Tim Landscheidt 2013-12-21 14:54:55 UTC

(In reply to comment #5)
> [...]
> My assumption (and fear :-)) is that SGE sources ~/.profile before job
> execution, which means that there will be a *lot* of confusion on where to
> configure locales and how they are evaluated.

> I don't want to go down that road if it can be avoided.  Is it possible to
> explicitely set the locale in Python?  Otherwise we could change jsub so that
> users can use qsub's "-v" option to set the locale in the environment:
> [...]

No, we can't as a test on my account with setting LANG to de_DE.UTF-8 in ~/.profile shows:

| scfc@tools-login:~$ qsub -b y -v LANG=it_IT.UTF-8 env
| Your job 1935416 ("env") has been submitted
| scfc@tools-login:~$ fgrep LANG env.o1935416 
| LANG=de_DE.UTF-8
| scfc@tools-login:~$

In bug #48811 we encountered a similar problem: We need "-b y" for binary programs, but "-b y" adds a (login) shell to the call stack:

| scfc@tools-login:~$ { echo '#!/usr/bin/python'; echo 'import os'; echo 'print os.environ["LANG"]'; } > env-test.py && chmod +x env-test.py
| scfc@tools-login:~$ qsub -N test-without-b-y -v LANG=it_IT.UTF-8 ./env-test.py 
| Your job 1935503 ("test-without-b-y") has been submitted
| scfc@tools-login:~$ qsub -N test-with-b-y -b y -v LANG=it_IT.UTF-8 ./env-test.py 
| Your job 1935504 ("test-with-b-y") has been submitted
| scfc@tools-login:~$ grep . test-with*-b-y.*
| test-with-b-y.o1935504:de_DE.UTF-8
| test-without-b-y.o1935503:it_IT.UTF-8

There is a configuration variable login_shells in sge_conf(5), but I'll need to whip up Toolsbeta in shape to evaluate options.

For the time being I suggest wrapper scripts.

Comment 8 Marc A. Pelletier 2013-12-21 15:48:25 UTC

Part of the difficulty is that there is a combinatorial explosion of starting environments depending on more factors than you can shake a stick at (given the gridengine's propensity to try to "guess" at what you're trying to do, and to (silently) add a shell anytime it thinks you need to evaluate shell arguments).

The best rule of thumb is "if you need something specific in your environment, set it explicitly".  I would recommend that one /always/ uses a shell wrapper that sets the environment; a simple generic one might be:

#! /bin/bash

export STUFF_I_NEED="foobar"
export PATH="/all:/the/places"
exec "$@"

This will set the STUFF_I_NEED then exec to the program given as argument without needlessly keeping a subshell around.  That same script can then be reliably used to launch everything in a reliable way.

I *could* make a globally available script that relies on sourcing, say, .bashrc:

#! /bin/bash

. ~/.bashrc
exec "$@"

Which everyone could then use.  I could even have it invoked implicitly by jsub at need.

Comment 9 Marc A. Pelletier 2014-03-25 18:04:11 UTC

Is this still a relevant issue?

Comment 10 Marc A. Pelletier 2014-08-27 22:06:46 UTC

Left without comment for >six months; reopen if the issue is still relevant.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links