Submitted by : hideo at yokohama at: 2006-10-25T09:05:09+00:00 (11 years ago)
Name :
Category : Severity : Status :
Optional subject :  
Optional comment :

The summary() method found in Utils.py is used to produce the Description(), that is used for search result listings in Plone sites.

Under certain conditions, the 0.57 implementation chops the Wiki content (which is in utf8) at an arbitrary byte position. This premature byte sequence is later concatenated with other utf-8 strings. This results in a erroneous utf-8 sequence.

One way to do it is change the content data into Unicode instead of utf-8. However that will make data conversion necessary to migrate from existing sites.

The quick and dirty way I chose to fix, is to convert the whole content to a unicode string, and do string truncation in unicode. This conversion happens every time summary() is called, so its not optimal in terms of CPU time.

http://my.opera.com/hideo_at_yokohama/blog/2006/10/25/summary-chops-utf-8-string-at-non-boun-2

will submit to repository --betabug, Wed, 14 Feb 2007 01:31:44 -0800 reply

Thank you!

I've been bitten by the same thing, so I think I will submit your patch to the darcs repository. It might not be ideal, but I didn't come up with another quick fix, so for the moment I'll submit your code - giving credit of course. Still thinking about some optimization though...

should be in repo now --betabug, Wed, 14 Feb 2007 02:11:26 -0800 reply

Status: open => closed

I also did a quick test on performance... summary is a tiny bit slower now, but I think it's still acceptable. The patch not only avoids illegal utf-8 in summaries, but now also the length of the summary is right for non-ascii content. Closing this and thanks again!