When you type a word that begins with Uppercase and contains an UTF-8 international character, it is recognized as a WikiWord?, in a wrong fashion. For instance:
Alguém
This happened to me using ZWiki 0.27rc1, Plone 2.0 and Zope 2.7. Using Zwiki inside Plone.
However, as I need a quick solution I decided to fix it myself!
So here is my fix to Regexps.py (line 135):
import locale (lang, encoding) = locale.getlocale() #locale.setlocale(locale.LC_ALL, '') if locale.getlocale() != (None,None): Ul = [x.encode('utf8') for x in unicode(string.uppercase, encoding)] Ll = [x.encode('utf8') for x in unicode(string.lowercase, encoding)] U = reduce(lambda x,y:x+'|'+y, Ul) L = reduce(lambda x,y:x+'|'+y, Ll) wikiname1 = r'(?L)\b(%s)+(%s)+(%s)(%s)*[0-9]*' % (U,L,U,U+'|'+L) wikiname2 = r'(?L)\b(%s)(%s)+(%s)(%s)*[0-9]*' % (U,U,L,U+'|'+L)
Remember to setup Zope's locale correctly also.
I hope this helps!
traldar
very interesting --SimonMichael, Thu, 29 Jan 2004 00:16:59 -0800 reply
It helps a lot, thanks. I have some questions - answer any you care to!
I simplified this to, eg:
U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)]?)
Is this ok, or is the reduce required for tricky i18n string handling ?
why did you comment out the setlocale - was it resetting your locale ? Is that because you don't have a LANG environment variable ? I put it there because the locale docs say to start your app that way to set locale according to LANG, but maybe zope does this already
do you have any sense of a performance impact from changing the regexp this way ?
would it make sense to utf8-encode the default international characters also, in the case where there is no locale ? Something like:
# maybe these should be utf8-encoded too #U='|'.join([x.encode('utf8') for x in unicode('A-Z\xc0-\xdf',encoding)]) #L='|'.join([x.encode('utf8') for x in unicode('a-z\xe0-\xff',encoding)]) #b = '(?<!(U|L|0|1|2|3|4|5|6|7|8|9))' % (U+L) #wikiname1 = r'%s(%s)+(%s)+(%s)(%s|%s)*[0-9]*' % (b,U,L,U,U,L) #wikiname2 = r'%s(%s)(%s)+(%s)(%s|%s)*[0-9]*' % (b,U,U,L,U,L)
and a more general one: are we being overly restrictive to recognize only characters defined in the locale ? should we be able to recognize any and all international characters, more or less, regardless of locale ?
status --SimonMichael, Thu, 29 Jan 2004 11:16:49 -0800 reply
This server is now running my version of your patch (see http://zwiki.org/zwikidir/Regexps.py ) for testing.. I have no locale set on this server though, so what we're seeing here is the no-locale case, which I've tried to update for UTF-8.
It's not yet right though (and getting hard to follow!). Although it recognizes utf-8 Alguém as a whole word, it links it when it shouldn't.
So let's try --Samotnik, Thu, 29 Jan 2004 11:30:25 -0800 reply
Śpiąca Królewna łka, a wróble ćwierkają
AlguémPage
checking in --SimonMichael, Thu, 29 Jan 2004 22:29:56 -0800 reply
The no-locale case seems to be working now also, recognizing western european characters in links and not linking things it shouldn't. The regexps just got more complicated, I haven't noticed any blatant speed change. More testing needed. I think this needs to go into 0.27.
property change --SimonMichael, Thu, 29 Jan 2004 22:37:31 -0800 reply
Status: open => closed
Problem if locale encoding is utf-8 --Tue, 03 Feb 2004 03:35:46 -0800 reply
i get a broken Product Zwiki if the locale is set to de_DE.UTF-8 (on FreeBSD? with utf8locale port. the traceback is:
Import Traceback Traceback (most recent call last): File "/usr/local/www/z27c1/lib/python/OFS/Application.py", line 654, in import_product product=__import__(pname, global_dict, global_dict, silly) File "/usr/local/www/z27c1inst/Products/ZWiki/__init__.py", line 10, in ? import ZWikiPage, ZWikiWeb, Permissions, Defaults File "/usr/local/www/z27c1inst/Products/ZWiki/ZWikiPage.py", line 63, in ? from Regexps import url, bracketedexpr, doublebracketedexpr, \ File "/usr/local/www/z27c1inst/Products/ZWiki/Regexps.py", line 71, in ? U='|'.join([x.encode('utf8') for x in unicode(string.uppercase,encoding)]) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 26-27: invalid data
some more info --Tue, 03 Feb 2004 03:56:51 -0800 reply
the versions are: zwiki 0.27 , zope 2.7.0rc1, python 2.3.3
the last "File" line in the traceback should be line 70 in the original source because i added a line to set the encoding to "iso8859_15" for testing but commented it out to get the traceback back.
very interesting --Sat, 28 Feb 2004 09:31:49 -0800 reply
Hello!
Sorry for the late answer, but here it goes: 1. I guess the list comprehension is more comprehensive, better than the reduce version.
- the setlocale was ressenting my locale, because I use zope's own locale setting, and not rely on the LANG environment var.
- I don't know how the performance would be impacted, but for my tests it does not seem to be any absurd performance issues...
- It is a good idea to utf8 encode the default international characters.
5. In my case, the ZWiki I installed will only be used by folks speaking Portuguese, so I only need to encode the chars in ISO-8859-15, but for a more i18n ZWiki it would be nice to encode all i18n characters to utf8. I don't know how much performance it would suck up, however...