Submitted by : simon at: 2007-05-07T11:06:55-07:00 (13 years ago)
Name :
Category : Severity : Status :
Optional subject :  
Optional comment :

We've pretty much agreed that it would be better to store content as unicode strings internally, instead of utf8-encoded strings as at present. This should make it easier to support alternate encodings if that is worthwhile, and might result in less frequent unicode/encoding related bugs. I'll be working on this this week.

My notes:

** links
 - DarcsRepos:
** clarify text input/output, implement consistent handling
*** strategy:
    - all incoming text is decoded to unicode
    - all internal text handling is in unicode
    - all outgoing text is encoded
    - only utf-8 encoding exists for now
    - all templates use unicode (what about zope < 2.10 ? dtml methods ?)
    - where is the boundary between external/internal text ?
    - which are external methods, which should expect and return encoded text ?
    - which are internal methods, which should expect and return unicode ?
*** categorize text handling methods
**** text enters zwiki via: (these should convert to unicode)
***** create(page(,title,pagename),text,log)
***** edit(page(,title),text,log,subjectSuffix)
***** comment(text,username,note,subject_heading)
***** rename(pagename)
***** reparent(parents,pagename)
***** usernameFrom(REQUEST)
***** manage_edit(data,title)
***** PUT
***** setText, cleanupText
**** text leaves zwiki via: (these should convert to encoded)
***** TODO ? text(,read,src)
***** rendering: all the view methods and templates, eg in
***** formatters: stx, rst, moin, wwml, latexwiki, mathaction, etc.
***** zodb: commit (does not need encoding), get_transaction().note (does)
***** catalog: index_object and the metadata methods it calls
***** logging: BLATHER, etc.
***** summary, renderedSummary, excerptAt
***** SearchableText, Description, getPageTitle
***** manage_FTPget, manage_main
***** and some ascii-only methods:
****** pageId, pageUrl, wikiUrl, getPath
**** everything else is internal and should handle unicode, including:
***** handleEditText, handleRename
***** index_object
***** cook, markLinksIn, renderLinksIn
***** render, renderText, format
***** replaceLinks
***** setCreator, setLastEditor, lastLog, setLastLog
***** addedText, lasttext, diff, textDiff
***** sendMailTo
***** subscribe
***** discussionPart, documentPart, comments, mailbox
***** fixEncoding
***** pages_rss, changes_rss, title_quote
***** cleanupBody
*** implement text conversions
*** categorize string properties/attributes
**** old
***** ascii or encoded
****** title
****** creator
****** creation_time
****** last_editor
****** last_edit_time
****** last_log
****** parents
****** subscriber_list
**** new
***** ascii
***** encoded
***** unicode
****** title
****** creator
****** creation_time
****** last_editor
****** last_edit_time
****** last_log
****** parents
****** subscriber_list

** test, work through the bugs
*** DONE switch to unicode pageName()
*** DONE rename
*** DONE renderNesting
*** DONE renderMidsectionIn, rst page type renders as utf8
*** DONE fixEncoding should clear utf8 prerendered also
*** DONE comment, Message can't hold unicode
*** DONE makeCommentHeading, can't urlquote unicode
*** DONE preRenderMessage, convert from message to unicode
*** DONE zodb transaction note can't be unicode
*** DONE can't index unicode text with default catalog
**** DONE replace text index with encoded SearchableText
**** remove or skip text index to avoid warnings after upgrade ?
*** DONE deleting page under a non-ascii page (FrontPage)
*** DONE renaming page
*** DONE make editform work
**** make properties unicode
***** make accessors for last_editor, last_editor_ip
****** make templates use accessors
*** DONE fix gross test errors
*** DONE fix other test failures
*** TODO misc
**** can't index unicode parents with default catalog
**** non-ascii search not working
***** try zctextindex
**** convert regexps ?
**** convert moin regexps ?
*** TODO test on content
**** TODO import error
2007-08-25 02:26:05 ERROR Zope.SiteErrorLog
Traceback (innermost last):
  Module ZPublisher.Publish, line 119, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 42, in call_object
  Module OFS.ObjectManager, line 609, in manage_importObject
  Module OFS.ObjectManager, line 631, in _importObjectFromFile
  Module OFS.ObjectManager, line 347, in _setObject
  Module zope.event, line 23, in notify
  Module zope.component.event, line 26, in dispatch
  Module zope.component._api, line 130, in subscribers
  Module zope.component.registry, line 294, in subscribers
  Module zope.interface.adapter, line 535, in subscribers
  Module zope.component.event, line 33, in objectEventNotify
  Module zope.component._api, line 130, in subscribers
  Module zope.component.registry, line 294, in subscribers
  Module zope.interface.adapter, line 535, in subscribers
  Module OFS.subscribers, line 121, in dispatchObjectMovedEvent
  Module, line 182, in dispatchToSublocations
  Module zope.component._api, line 130, in subscribers
  Module zope.component.registry, line 294, in subscribers
  Module zope.interface.adapter, line 535, in subscribers
  Module OFS.subscribers, line 118, in dispatchObjectMovedEvent
  Module OFS.subscribers, line 151, in callManageAfterAdd
  Module Products.ZWiki, line 197, in manage_afterAdd
  Module Products.ZWiki.Outline, line 98, in add
  Module Products.ZWiki.Outline, line 84, in update
  Module Products.ZWiki.Outline, line 65, in updateChildmap
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 0: ordinal not in range(128)

**** tests
***** viewing
****** basic pages
****** pages with non-ascii content
****** pages with non-ascii names
***** recent changes
***** contents
***** upgradeAll
***** searching for
****** basic pages
****** pages with non-ascii content
****** pages with non-ascii names
**** TODO copy, test on zope1
**** TODO get zexp, test locally

unicode notes, comments welcome --simon, Tue, 08 May 2007 19:35:42 -0700 reply

update --simon, Mon, 14 May 2007 12:13:30 -0700 reply

I have been fixing the various issues that came up from switching to unicode storage. Most things seem to be working pretty well. Here's one problem I'm not sure how to deal with: pages containing non-ascii text are not indexed by the default catalog:

2007-05-12 02:25:36 BLATHER ZWiki failed to index X_c3_89
Traceback (most recent call last):
 File "/zope1/Products/ZWiki/", line 110, in index_object
 File "/zope-2.10/lib/python/Products/ZCatalog/", line 535, in catalog_object
 File "/zope-2.10/lib/python/Products/ZCatalog/", line 360, in catalogObject
   blah = x.index_object(index, object, threshold)
 File "/zope-2.10/lib/python/Products/PluginIndexes/TextIndex/", line 314, in index_object
   for word in list(splitter(source,encoding=encoding)):
 File "/zope-2.10/lib/python/Products/PluginIndexes/TextIndex/", line 167, in Splitter
   return self.SplitterFunc(astring, words)

UnicodeEncodeError?: 'ascii' codec can't encode character u'xc9' in position 4: ordinal not in range(128)

text() now returns unicode, and when the catalog tries to index it, TextIndex?/ calls a default splitter function, which fails. I think it would split unicode if it passed an encoding argument, but it won't do this unless a special lexicon or vocabulary is configured. Which I don't think is even possible with a default zope install.

Right now the alternatives to enable indexing of non-ascii text seem to be:

unicode support of textindex? --wlang, Mon, 14 May 2007 16:01:45 -0700 reply

No solution, but looking at the source of TextIndex?, it seems that it supports Unicode splitting (selectable via the useSplitter argument):

class Lexicon(Persistent, Implicit):
  def __init__(self, stop_syn=None,useSplitter=None,extra=None):
      self.useSplitter = Splitter.splitterNames[0]
      if useSplitter: self.useSplitter=useSplitter
      self.splitterParams = extra
      self.SplitterFunc = Splitter.getSplitter(self.useSplitter)


availableSplitters = (
  ("ZopeSplitter" , "Zope Default Splitter"),
  ("ISO_8859_1_Splitter" , "Werner Strobls ISO-8859-1 Splitter"),
  ("UnicodeSplitter" , "Unicode-aware splitter")

Another possibility could be to use ZCTextIndex?. It is in the Zope distribution at least since 2.7 (havn't checked previous versions).

notes updated --simon, Mon, 11 Jun 2007 10:58:17 -0700 reply

Still in progress.

next steps ? --simon, Thu, 20 Sep 2007 10:33:08 -0700 reply

Time to restart this. Current status of the ZWiki-unicode branch: tests pass, it seems to mostly work though there are unresolved issues with cataloging, and I want to test it with this wiki's content in a safe way. I have some options:

Anyone else testing ZWiki-unicode could be useful too. It is current with the main Zwiki repo.