Submitted by : simon at: 2008-04-30T09:34:19-07:00 (9 years ago)
Name :
Category : Severity : Status :
Optional subject :  
Optional comment :

Since the upgrade to Zwiki-unstable, a minority of pages (those with non-ascii content) are not indexed, with consequences for recent changes, backlinks, search etc. Eg IE6JP日本語:

2008-04-30 09:29:28 BLATHER ZWiki ('failed to index', 'IE6JP_65e5_672c_8a9e', '\n',
'Traceback (most recent call last):
 File "/zope2/Products/ZWiki/Catalog.py", line 112, in index_object
   self.catalog().catalog_object(self,self.url(),idxs)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/ZCatalog.py", line 535, in catalog_object
   update_metadata=update_metadata)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/Catalog.py", line 360, in catalogObject
   blah = x.index_object(index, object, threshold)
 File "/zope-2.10.5/lib/python/Products/PluginIndexes/TextIndex/TextIndex.py", line 314, in index_object
   for word in list(splitter(source,encoding=encoding)):
 File "/zope-2.10.5/lib/python/Products/PluginIndexes/TextIndex/Lexicon.py", line 167, in Splitter
   return self.SplitterFunc(astring, words)
 UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

')

Also, recent changes is not showing the correct last editor for recent edits (all pages).

update --simon, Wed, 30 Apr 2008 13:00:06 -0700 reply

The last editor problem is a breakage from the unicode changes. It has been fixed on zwiki.org and commited to unstable, though it raises a larger issue of whether to replace simple field names with the modern field accessors in our catalog metadata.

ZCTextIndex? also breaks with unicode content:

Traceback (most recent call last):
File "/zope2/Products/ZWiki/Catalog.py", line 112, in index_object
  self.catalog().catalog_object(self,self.url(),idxs)
File "/zope-2.10.5/lib/python/Products/ZCatalog/ZCatalog.py", line 535, in catalog_object
  update_metadata=update_metadata)
File "/zope-2.10.5/lib/python/Products/ZCatalog/Catalog.py", line 360, in catalogObject
  blah = x.index_object(index, object, threshold)
File "/zope-2.10.5/lib/python/Products/PluginIndexes/common/UnIndex.py", line 235, in index_object
  res += self._index_object(documentId, obj, threshold, attr)
File "/zope-2.10.5/lib/python/Products/PluginIndexes/common/UnIndex.py", line 262, in _index_object
  self.insertForwardIndexEntry(datum, documentId)
File "/zope-2.10.5/lib/python/Products/PluginIndexes/common/UnIndex.py", line 207, in insertForwardIndexEntry
  indexRow = self._index.get(entry, _marker)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 5: ordinal not in range(128)

new traceback --simon, Thu, 01 May 2008 11:46:09 -0700 reply

For the record. This is ZCTextIndex? with a UnicodeLexicon? installed:

2008-05-01 11:44:08 BLATHER ZWiki failed to index IE6JP_65e5_672c_8a9e
Traceback (most recent call last):
 File "/zope2/Products/ZWiki/Catalog.py", line 112, in index_object
   self.catalog().catalog_object(self,self.url(),idxs)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/ZCatalog.py", line 535, in catalog_object
   update_metadata=update_metadata)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/Catalog.py", line 360, in catalogObject
   blah = x.index_object(index, object, threshold)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 238, in index_object
   res += self._index_object(documentId, obj, threshold, attr)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 265, in _index_object
   self.insertForwardIndexEntry(datum, documentId)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 214, in insertForwardIndexEntry
   if indexRow is _marker:
 UnboundLocalError: local variable 'indexRow' referenced before assignment

update --simon, Thu, 01 May 2008 13:12:59 -0700 reply

Now using betabug's patch and text indexes configured with a ZwikiLexicon?. Still broken:

2008-05-01T13:11:18 BLATHER ZWiki failed to index IE6JP_65e5_672c_8a9e
Traceback (most recent call last):
 File "/zope2/Products/ZWiki/Catalog.py", line 112, in index_object
   self.catalog().catalog_object(self,self.url(),idxs)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/ZCatalog.py", line 535, in catalog_object
   update_metadata=update_metadata)
 File "/zope-2.10.5/lib/python/Products/ZCatalog/Catalog.py", line 360, in catalogObject
   blah = x.index_object(index, object, threshold)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 234, in index_object
   res += self._index_object(documentId, obj, threshold, attr)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 261, in _index_object
   self.insertForwardIndexEntry(datum, documentId)
 File "/zope/lib/python/Products/PluginIndexes/common/UnIndex.py", line 207, in insertForwardIndexEntry
   indexRow = self._index.get(entry, _marker)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 5: ordinal not in range(128)

progress! --simon, Thu, 01 May 2008 13:40:08 -0700 reply

I found it necessary to delete the whole catalog and run setupCatalog/setupTracker again. Now using the built-in ZwikiLexicon it can index every page. Yay!

For reference: ZCTextIndex add form is hard coded to see only ZCTextIndex Lexicons. To experiment with different lexicons, you need to patch it like so:

diff -rN old-zope-2.10.5/lib/python/Products/ZCTextIndex/dtml/addZCTextIndex.dtml new-zope-2.10.5/lib/python/Products/ZCTextIndex/dtml/addZCTextIndex.dtml
64c64
<     <dtml-in expr="superValues('ZCTextIndex Lexicon')">
---
>     <dtml-in expr="superValues(['ZCTextIndex Lexicon','ZCTextIndex Unicode Lexicon','ZwikiLexicon'])">

fixed --simon, Thu, 01 May 2008 17:10:16 -0700 reply

Status: open => closed

We are still polishing this feature in Zwiki, but zwiki.org is now cataloging everything correctly I believe.