Where do these UnicodeDecodeErrors come from?

During processing of a page template (rendering), the various parts of it are collected in a StringIO "buflist". A call to StringIO.getvalue joins this "buflist", which can lead to an UnicodeDecodeError:

Products/PageTemplates/PageTemplate.py:

  def pt_render(self, source=0, extra_context={}):
      """Render this Page Template"""

      [...]

      output = StringIO()

      TALInterpreter(self..., output)

      return output.getvalue()     # can raise UnicodeDecodeError

This is, because StringIO (schematically) is:

class StringIO:

  def __init__(self, buf = ''):
      self.buflist = []

  def write(self, s):
      self.buflist.append(s)

  def getvalue(self):
      """
      The StringIO object can accept either Unicode or 8-bit strings,
      but mixing the two may take some care. If both are used, 8-bit
      strings that cannot be interpreted as 7-bit ASCII (that use the
      8th bit) will cause a UnicodeError to be raised when getvalue()
      is called.
      """
      if self.buflist:
          self.buf += ''.join(self.buflist)
          self.buflist = []
      return self.buf

As the docstring says, mixing string objects and unicode objects can cause Errors. The rendering of a page template without UnicodeDecodeError boils down to the join of a list of the various (evaluated) template parts.

In the following, "string" means a python string object (<type 'str'>), in contrast to an "unicode object" (<type 'unicode'>).

Some examples

We assume the defaultencoding is set to 'ascii':

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

We can mix strings with unicode objects if the strings are 7-bit ascii:

>>> ''.join(['abc', u'def', 'xxx'])
u'abcdefxxx'

The unicode parts can have special characters (xe4: a with diaeresis):

>>> ''.join(['abc', u'\xe4'])
u'abc\xe4'

But not the other way round (8-bit string + unicode object):

>>> ''.join(['\xe4', u'abc'])
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

However, special characters in strings can be joined, if there is no unicode object in the list (the result is a string then):

>>> ''.join(['\xe4', 'abc'])
'\xe4abc'

Its even possible to join a latin-1 encoded text with an utf-8 encoded text:

>>> c = u'\u00e4'  # LATIN SMALL LETTER A WITH DIAERESIS

>>> c.encode('latin-1')
'\xe4'
>>> c.encode('utf-8')
'\xc3\xa4'

>>> ''.join(['\xe4', '\xc3\xa4'])
'\xe4\xc3\xa4'

However, the resulting string, doesnt make sense. No browser will be able to display the corresponding text.

Summary

If one part of the page template evaluates to an unicode object, then the other parts have to be either unicode objects or 7-bit ascii strings. More precisely, the strings must be decodable (translateable into unicode objects) with the default encoding.

If one of the strings can't be decoded, an UnicodeDecodeError is raised.

But how does this help me with those Errors?

Unfortunatly the plain Zope Tracebacks do not give much hints: it says only UnicodeDecodeError. Ok, we now know, that there must be some mixing of unicode and 8-bit string parts, but what are those parts?

Patching Zope

You can use the following patch to decorate the Zope tracebacks. First, copy the File StringIO.py from the Python Distribution to your SOFTWARE_HOME:

cp PYTHON_FOR_ZOPE_LIB/StringIO.py ZOPE_HOME/lib/python

Then apply the following Patch (tested on Zope 2.7, 2.8 and 2.9):

--- /usr/local/lib/python2.3/StringIO.py      2004-06-05 18:27:07.000000000 +0200
+++ Zope-2.7.2devel/lib/python/StringIO.py    2007-05-09 22:16:37.000000000 +0200
@@ -200,10 +200,35 @@
         is called.
         """
         if self.buflist:
-            self.buf += ''.join(self.buflist)
+            try:
+                self.buf += ''.join(self.buflist)
+            except UnicodeDecodeError:
+                # decorate zope traceback with reason for unicode error
+                __traceback_info__ = self.pt_parts()
+                raise
             self.buflist = []
         return self.buf

+    def pt_parts(self):
+        sl = ['unicode and 8-bit string parts of above page template']
+        for x in self.buflist:
+            if type(x) == type(''):
+                maxcode = 0
+                for c in x:
+                    maxcode = max(ord(c), maxcode)
+            # show only unicode objects and non-ascii strings
+            if type(x) == type('') and maxcode > 127:
+                t = '****NonAsciiStr: '
+            elif type(x) == type(u''):
+                t = '*****UnicodeStr: '
+            else:
+                t = None
+            if t:
+                sl.append(t + repr(x))
+        s = '\n'.join(sl)
+        return s
+
+

 # A little test suite

(Ok, that should be a monkey patch, but i had not enough time to figure it out...)

Example

After Upgrading zwiki to 0.59darcs (on Zope-2.7.2), i got a lot of UnicodeDecodeErrors. To find out, whats the reason for this, i applied the above patch, and got the following traceback:

Site Error

An error was encountered while publishing this resource.

UnicodeDecodeError
Sorry, a site error occurred.

Traceback (innermost last):

  * Module ZPublisher.Publish, line 180, in publish_module_standard
  * Module ZPublisher.Publish, line 131, in publish
  * Module Zope.App.startup, line 204, in zpublisher_exception_hook
  * Module ZPublisher.Publish, line 101, in publish
  * Module ZPublisher.mapply, line 88, in mapply
  * Module ZPublisher.Publish, line 39, in call_object
  * Module Products.ZWiki.ZWikiPage, line 256, in __call__
  * Module Products.ZWiki.ZWikiPage, line 269, in render
  * Module Products.ZWiki.pagetypes.rst, line 69, in render
  * Module Products.ZWiki.Views, line 712, in addSkinTo
  * Module Shared.DC.Scripts.Bindings, line 306, in __call__
  * Module Shared.DC.Scripts.Bindings, line 343, in _bindAndExec
  * Module Products.PageTemplates.ZopePageTemplate, line 222, in _exec
  * Module Products.PageTemplates.PageTemplate, line 97, in pt_render
    <ZopePageTemplate at /wikis/wikipage used for /wikis/BachIntern/BlitzGespraechRanking>
  * Module StringIO, line 204, in getvalue
    __traceback_info__: unicode and 8-bit string parts of above page template

    ****NonAsciiStr: '<meta name="description"\n content="Hier ein Ranking der Blitzgespr\xc3\xa4che (Lightning Talks) gerankt nach Originalit\xc3\xa4t respektive technischer Abgehobenheit. Erstellt von der Rating Agentur MF. Preisverleihung ist im Rahmen der ..." />'

    *****UnicodeStr: u'<small><ul class="outline expandable">\n <li><a href="https://bach.wu-wien.ac.at/11080/wikis/BachIntern/FrontPage" name="FrontPage">FrontPage</a>\n<ul class="outline expandable">\n <li><h1 style="display:inline;"><a href="https://bach.wu-wien.ac.at/11080/wikis/BachIntern/BlitzGespraechRanking/backlinks" title="which pages link to this one ?" name="BlitzGespraechRanking">BlitzGespraechRanking</a></h1></li>\n</ul>\n </li>\n</ul>\n</small>'

    *****UnicodeStr: u'last edited <a href="https://bach.wu-wien.ac.at/11080/wikis/BachIntern/BlitzGespraechRanking/history" title="show last edit" >1 week</a> ago by <b>wlang</b>'

    *****UnicodeStr: u'<div class="document">\n<p>Hier ein Ranking der Blitzgespr\xe4che (Lightning Talks) gerankt nach Originalit\xe4t respektive technischer Abgehobenheit. Erstellt von der Rating Agentur MF. Preisverleihung ist im Rahmen der Weihnachtsfeier.</p>\n<p>== 2007 ==</p>\n<p>1. Willis Talk zum Thema BACH-Wikis, 2. Feb 2007\n\n</p>\n</div>\n'

UnicodeDecodeError

Now its easy to see: the main document is a unicode object (besides other unicode objects). But the "description" Meta Tag is an 8-bit string. Mixing those two gives the error.

Proposed solution: we have to convert the meta tag part to unicode!

thanx! great stuff! --betabug, Thu, 10 May 2007 02:33:48 -0700 reply

There is a product out there that attempts to show the source of Unicode errors, but it's not as good as this. IIRC it just leaves an indicator for the error in the page (replacing the "faulty" content) and displays it, leaving you to figure out where it came from.

UnicodeEncodeErrorTestPage