Edit detail for #691 Wrong WikiWords with UTF-8 international characters revision 1 of 1

1
Editor: frank
Time: 2004/02/28 09:31:49 GMT+0
Note: revert

changed:
-
When you type a word that begins with Uppercase and contains an UTF-8 international character, it is recognized as a WikiWord, in a wrong fashion. For instance:

Alguém 

This happened to me using ZWiki 0.27rc1, Plone 2.0 and Zope 2.7. Using Zwiki inside Plone.

However, as I need a quick solution I decided to fix it myself!

So here is my fix to Regexps.py (line 135)::

  import locale
  (lang, encoding) = locale.getlocale()
  #locale.setlocale(locale.LC_ALL, '')
  if locale.getlocale() != (None,None):
    Ul = [x.encode('utf8') for x in unicode(string.uppercase, encoding)]
    Ll = [x.encode('utf8') for x in unicode(string.lowercase, encoding)]

    U = reduce(lambda x,y:x+'|'+y, Ul)
    L = reduce(lambda x,y:x+'|'+y, Ll)

    wikiname1 = r'(?L)\b(%s)+(%s)+(%s)(%s)*[0-9]*' % (U,L,U,U+'|'+L)
    wikiname2 = r'(?L)\b(%s)(%s)+(%s)(%s)*[0-9]*'  % (U,U,L,U+'|'+L)

Remember to setup Zope's locale correctly also.

I hope this helps!

traldar

From SimonMichael Thu Jan 29 00:16:59 -0800 2004
From: SimonMichael
Date: Thu, 29 Jan 2004 00:16:59 -0800
Subject: very interesting
Message-ID: <[email protected]>

It helps a lot, thanks. I have some questions - answer any you care to!

- I simplified this to, eg::

  U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)])

 Is this ok, or is the reduce required for tricky i18n string handling ?

- why did you comment out the setlocale - was it resetting your locale ? Is that because you don't have a LANG environment variable ? I put it there because the locale docs say to start your app that way to set locale according to LANG, but maybe zope does this already

- do you have any sense of a performance impact from changing the regexp this way ?

- would it make sense to utf8-encode the default international characters also, in the case where there is no locale ? Something like::

    # maybe these should be utf8-encoded too
    #U='|'.join([x.encode('utf8') for x in unicode('A-Z\xc0-\xdf',encoding)])
    #L='|'.join([x.encode('utf8') for x in unicode('a-z\xe0-\xff',encoding)])
    #b = '(?<!(U|L|0|1|2|3|4|5|6|7|8|9))' % (U+L)
    #wikiname1 = r'%s(%s)+(%s)+(%s)(%s|%s)*[0-9]*' % (b,U,L,U,U,L)
    #wikiname2 = r'%s(%s)(%s)+(%s)(%s|%s)*[0-9]*'  % (b,U,U,L,U,L)

- and a more general one: are we being overly restrictive to recognize only characters defined in the locale ? should we be able to recognize any and all international characters, more or less, regardless of locale ?

From SimonMichael Thu Jan 29 11:16:49 -0800 2004
From: SimonMichael
Date: Thu, 29 Jan 2004 11:16:49 -0800
Subject: status
Message-ID: <[email protected]>

This server is now running my version of your patch (see http://zwiki.org/zwikidir/Regexps.py ) for testing.. I have no locale set on this server though, so what we're seeing here is the no-locale case, which I've tried to update for UTF-8. 

It's not yet right though (and getting hard to follow!). Although it recognizes utf-8 Alguém as a whole word, it links it when it shouldn't.

From Samotnik Thu Jan 29 11:30:25 -0800 2004
From: Samotnik
Date: Thu, 29 Jan 2004 11:30:25 -0800
Subject: So let's try
Message-ID: <[email protected]>

Śpiąca Królewna łka, a wróble ćwierkają

AlguémPage

From SimonMichael Thu Jan 29 22:29:56 -0800 2004
From: SimonMichael
Date: Thu, 29 Jan 2004 22:29:56 -0800
Subject: checking in
Message-ID: <[email protected]>

The no-locale case seems to be working now also, recognizing western european characters in links and not linking things it shouldn't. The regexps just got more complicated, I haven't noticed any blatant speed change. More testing needed. I think this needs to go into 0.27. 

From SimonMichael Thu Jan 29 22:37:31 -0800 2004
From: SimonMichael
Date: Thu, 29 Jan 2004 22:37:31 -0800
Subject: property change
Message-ID: <[email protected]>

Status: open => closed 


From unknown Tue Feb 3 03:35:46 -0800 2004
From: 
Date: Tue, 03 Feb 2004 03:35:46 -0800
Subject: Problem if locale encoding is utf-8
Message-ID: <[email protected]>

i get a broken Product Zwiki if the locale is set to de_DE.UTF-8 (on FreeBSD with utf8locale port.
the traceback is::

 Import Traceback
 
 Traceback (most recent call last):
  File "/usr/local/www/z27c1/lib/python/OFS/Application.py", line 654, in import_product
    product=__import__(pname, global_dict, global_dict, silly)
  File "/usr/local/www/z27c1inst/Products/ZWiki/__init__.py", line 10, in ?
    import ZWikiPage, ZWikiWeb, Permissions, Defaults
  File "/usr/local/www/z27c1inst/Products/ZWiki/ZWikiPage.py", line 63, in ?
    from Regexps import url, bracketedexpr, doublebracketedexpr, \
  File "/usr/local/www/z27c1inst/Products/ZWiki/Regexps.py", line 71, in ?
    U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)])
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 26-27: invalid data


From unknown Tue Feb 3 03:56:51 -0800 2004
From: 
Date: Tue, 03 Feb 2004 03:56:51 -0800
Subject: some more info
Message-ID: <[email protected]>

the versions are: zwiki 0.27 , zope 2.7.0rc1, python 2.3.3

the last "File" line in the traceback should be line 70 in the original source because i added a line to set the encoding to "iso8859_15" for testing but commented it out to get the traceback back.

From unknown Sat Feb 28 09:31:49 -0800 2004
From: 
Date: Sat, 28 Feb 2004 09:31:49 -0800
Subject: very interesting
Message-ID: <[email protected]>
In-reply-to: <[email protected]>

Hello!

Sorry for the late answer, but here it goes:
1. I guess the list comprehension is more comprehensive, better than the reduce version.

2. the setlocale was ressenting my locale, because I use zope's own locale setting, and not rely on the LANG environment var.

3. I don't know how the performance would be impacted, but for my tests it does not seem to be any absurd performance issues...

4. It is a good idea to utf8 encode the default international characters.

5. In my case, the ZWiki I installed will only be used by folks speaking Portuguese, so I only need to encode the chars in ISO-8859-15, but for a
more i18n ZWiki it would be nice to encode all i18n characters to utf8.
I don't know how much performance it would suck up, however...




Submitted by : 192.168.1.21 at: 2004-01-28T17:57:25+00:00 (17 years ago)
Name :
Category : Severity : Status :
Optional subject :  
Optional comment :

When you type a word that begins with Uppercase and contains an UTF-8 international character, it is recognized as a WikiWord?, in a wrong fashion. For instance:

Alguém

This happened to me using ZWiki 0.27rc1, Plone 2.0 and Zope 2.7. Using Zwiki inside Plone.

However, as I need a quick solution I decided to fix it myself!

So here is my fix to Regexps.py (line 135):

import locale
(lang, encoding) = locale.getlocale()
#locale.setlocale(locale.LC_ALL, '')
if locale.getlocale() != (None,None):
  Ul = [x.encode('utf8') for x in unicode(string.uppercase, encoding)]
  Ll = [x.encode('utf8') for x in unicode(string.lowercase, encoding)]

  U = reduce(lambda x,y:x+'|'+y, Ul)
  L = reduce(lambda x,y:x+'|'+y, Ll)

  wikiname1 = r'(?L)\b(%s)+(%s)+(%s)(%s)*[0-9]*' % (U,L,U,U+'|'+L)
  wikiname2 = r'(?L)\b(%s)(%s)+(%s)(%s)*[0-9]*'  % (U,U,L,U+'|'+L)

Remember to setup Zope's locale correctly also.

I hope this helps!

traldar

very interesting --SimonMichael, Thu, 29 Jan 2004 00:16:59 -0800 reply

It helps a lot, thanks. I have some questions - answer any you care to!

  • I simplified this to, eg:

    U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)]?)

Is this ok, or is the reduce required for tricky i18n string handling ?
  • why did you comment out the setlocale - was it resetting your locale ? Is that because you don't have a LANG environment variable ? I put it there because the locale docs say to start your app that way to set locale according to LANG, but maybe zope does this already

  • do you have any sense of a performance impact from changing the regexp this way ?

  • would it make sense to utf8-encode the default international characters also, in the case where there is no locale ? Something like:

    # maybe these should be utf8-encoded too
    #U='|'.join([x.encode('utf8') for x in unicode('A-Z\xc0-\xdf',encoding)])
    #L='|'.join([x.encode('utf8') for x in unicode('a-z\xe0-\xff',encoding)])
    #b = '(?<!(U|L|0|1|2|3|4|5|6|7|8|9))' % (U+L)
    #wikiname1 = r'%s(%s)+(%s)+(%s)(%s|%s)*[0-9]*' % (b,U,L,U,U,L)
    #wikiname2 = r'%s(%s)(%s)+(%s)(%s|%s)*[0-9]*'  % (b,U,U,L,U,L)
    
  • and a more general one: are we being overly restrictive to recognize only characters defined in the locale ? should we be able to recognize any and all international characters, more or less, regardless of locale ?

status --SimonMichael, Thu, 29 Jan 2004 11:16:49 -0800 reply

This server is now running my version of your patch (see http://zwiki.org/zwikidir/Regexps.py ) for testing.. I have no locale set on this server though, so what we're seeing here is the no-locale case, which I've tried to update for UTF-8.

It's not yet right though (and getting hard to follow!). Although it recognizes utf-8 Alguém as a whole word, it links it when it shouldn't.

So let's try --Samotnik, Thu, 29 Jan 2004 11:30:25 -0800 reply

Śpiąca Królewna łka, a wróble ćwierkają

AlguémPage

checking in --SimonMichael, Thu, 29 Jan 2004 22:29:56 -0800 reply

The no-locale case seems to be working now also, recognizing western european characters in links and not linking things it shouldn't. The regexps just got more complicated, I haven't noticed any blatant speed change. More testing needed. I think this needs to go into 0.27.

property change --SimonMichael, Thu, 29 Jan 2004 22:37:31 -0800 reply

Status: open => closed

Problem if locale encoding is utf-8 --Tue, 03 Feb 2004 03:35:46 -0800 reply

i get a broken Product Zwiki if the locale is set to de_DE.UTF-8 (on FreeBSD? with utf8locale port. the traceback is:

Import Traceback

Traceback (most recent call last):
 File "/usr/local/www/z27c1/lib/python/OFS/Application.py", line 654, in import_product
   product=__import__(pname, global_dict, global_dict, silly)
 File "/usr/local/www/z27c1inst/Products/ZWiki/__init__.py", line 10, in ?
   import ZWikiPage, ZWikiWeb, Permissions, Defaults
 File "/usr/local/www/z27c1inst/Products/ZWiki/ZWikiPage.py", line 63, in ?
   from Regexps import url, bracketedexpr, doublebracketedexpr, \
 File "/usr/local/www/z27c1inst/Products/ZWiki/Regexps.py", line 71, in ?
   U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)])
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 26-27: invalid data

some more info --Tue, 03 Feb 2004 03:56:51 -0800 reply

the versions are: zwiki 0.27 , zope 2.7.0rc1, python 2.3.3

the last "File" line in the traceback should be line 70 in the original source because i added a line to set the encoding to "iso8859_15" for testing but commented it out to get the traceback back.

very interesting --Sat, 28 Feb 2004 09:31:49 -0800 reply

Hello!

Sorry for the late answer, but here it goes: 1. I guess the list comprehension is more comprehensive, better than the reduce version.

  1. the setlocale was ressenting my locale, because I use zope's own locale setting, and not rely on the LANG environment var.
  2. I don't know how the performance would be impacted, but for my tests it does not seem to be any absurd performance issues...
  3. It is a good idea to utf8 encode the default international characters.

5. In my case, the ZWiki I installed will only be used by folks speaking Portuguese, so I only need to encode the chars in ISO-8859-15, but for a more i18n ZWiki it would be nice to encode all i18n characters to utf8. I don't know how much performance it would suck up, however...