BLog

ImprintImpressum
PrivacyDatenschutz
DisclaimerHaftung
Downloads 

Using the Zettair Search Engine locally on a Site

In order to provide a search facility on a site, two distinct tasks need to be implemented. The content of the site needs to be indexed, and a search query must be compared with the index. Basically, the Zettair Search Engine handles both tasks in a very efficient manner. However, for a fully functional search system a few helper tools need to be provided.

Zettair development seems to have been stopped in 2009, and the latest public version 0.94.09_03 (March 2009) can be downloaded from the official site. In the course of the implementation of the search facility for my BLog, a few quirks needed to be fixed, and therefore, I provide a fork with the fixes on my GitHub site - Cyclaero/Zettair.

Content Indexing

Zettair comes with a built-in HTML parser and other document formats must be converted to HTML before indexing. In addition, Zettair is not aware of text encodings with multibyte characters like in Unicode. This means utilization of Zettair is limited to sites whose contents can be mostly encoded by the various single character encodings, on my site this would be ISO-8859-1. For example, all the content on my BLog is encoded by UTF-8, however before feeding this into Zettair’s indexing machine, it must be converted to ISO-8859-1. On said GitHub site, I provide a number of Shell and Python scripts which handle a certain choice of conversions.

Another obstacle arises from the requirement, that we want to have the index updated whenever some content on our site has changed. I solved this by the way of a cron job which checks every minute for the presence of a "change token" in Zettair's index directory, and in that case launches the spider shell script which performs all the necessary tasks for indexing the content:

#!/bin/sh
#
# spider script for synchronizing a documents directoy with a text/html-file store for indexing with zettair
# requirements:
#  clone     -- for hard linking the store to a temperary directory
#  pdftotext -- for converting PDF to bare HTML files (tool in the poppler package)
#  iconv     -- for converting UTF-8 to ISO-8859-1
#  zet       -- for indexing (tool in the zettair package)

export LANG="en_US.UTF-8"
export MM_CHARSET="UTF-8"

DOCUMENTS_DIR="$1"
if [ ! -d "$DOCUMENTS_DIR" ]; then
   echo "The documents directory does not exist"
   exit
fi

ZETTAIR_DIR="/var/db/zettair"
if [ ! -d "$ZETTAIR_DIR" ]; then
   echo "The directory which holds the Zettair search index does not exist"
   exit
fi

cd "$ZETTAIR_DIR"

if [ ! -e "token" ] && [ "$2" != "force" ] || [ -e "running" ]; then
   if [ "$2" == "force" ]; then
      echo "$0 is running already."
   else
      exit
   fi
else
   touch "running"
   rm -f "token"
fi

SYNCHRON="synchron"
ARTICLES="articles"

if [ -d "$SYNCHRON" ] && [ -d "$ARTICLES" ]; then
   /usr/local/bin/clone -iv0 "$DOCUMENTS_DIR" "$SYNCHRON"
   /usr/bin/find "$SYNCHRON" -links 2 -and \! -type d -delete >/dev/null 2>&1
else
   rm -rf "$SYNCHRON" "$ARTICLES"
   /usr/local/bin/clone -lv0 "$DOCUMENTS_DIR" "$SYNCHRON"
fi

/usr/bin/find "$SYNCHRON" -type d | while read -r dname ; do
   /bin/mkdir -p `echo "$dname" | /usr/bin/sed "s|$SYNCHRON|$ARTICLES|"`
done


cd "$SYNCHRON"

/usr/bin/find . -iname "*.html" -or -iname "*.htm" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /bin/cat "$DOCUMENTS_DIR/$fname" | /usr/bin/sed -n 's/.*\(<[Tt][Ii][Tt][Ll][Ee]>.*<\/[Tt][Ii][Tt][Ll][Ee]>\).*/<HTML>\1<BODY>/p;/<\!--e-->/,/<\!--E-->/{/<\!--e-->/d;/<\!--E-->/d;p;}' | /usr/bin/iconv -s -f UTF-8 -t ISO-8859-1//TRANSLIT//IGNORE > "../$ARTICLES/$fname.iso.html"
      echo "</BODY></HTML>" >> "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.pdf" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /usr/local/bin/pdftotext -enc Latin1 -htmlmeta "$DOCUMENTS_DIR/$fname" "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.txt" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      echo "<HTML><TITLE>$fanme</TITLE><BODY>" > "../$ARTICLES/$fname.iso.html"
      /usr/bin/iconv -s -f UTF-8 -t ISO-8859-1//TRANSLIT//IGNORE "$DOCUMENTS_DIR/$fname" >> "../$ARTICLES/$fname.iso.html"
      echo "</BODY></HTML>" >> "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.rtf" -or -iname "*.rtfd.zip" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /usr/local/bin/rtftotext.py "$DOCUMENTS_DIR/$fname" > "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.docx" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /usr/local/bin/docxtotext.py "$DOCUMENTS_DIR/$fname" > "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.pptx" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /usr/local/bin/pptxtotext.py "$DOCUMENTS_DIR/$fname" > "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

/usr/bin/find . -iname "*.xlsx" | while read -r fname ; do
   if [ -e "$DOCUMENTS_DIR/$fname" ]; then
      /usr/local/bin/xlsxtotext.py "$DOCUMENTS_DIR/$fname" > "../$ARTICLES/$fname.iso.html"
   else
      rm -f "../$ARTICLES/$fname.iso.html"
   fi
done

cd ..


rm -rf "$SYNCHRON"
/usr/local/bin/clone -lv0 "$DOCUMENTS_DIR" "$SYNCHRON"

rm -f index*
/usr/bin/find "$ARTICLES" -name "*.iso.html" | /usr/local/bin/zet -i >/dev/null 2>&1

rm -f running

RTF(d), DOCX, XLSX and PPTX files are converted to text files using custom Pythons scripts. Here I show only the RTF one. All scripts are present on said GitHub site, see: rtftotext.py, docxtotext.py, xlsxtotext.py, pptxtotext.py.

#!/usr/local/bin/python
# coding: utf-8

import sys, os, codecs, re, zipfile

def strict_handler(exception):
    return u"", exception.end
codecs.register_error("strict", strict_handler)

def rtf2text(path):

    if path.endswith('.rtfd.zip'):
        document = zipfile.ZipFile(path)
        rtf = document.read(os.path.basename(path).replace('.rtfd.zip', '.rtfd') + '/TXT.rtf')
        document.close()
    else:
        with open(path, 'rb') as file:
            rtf=file.read()
        file.close()

    pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
    # control words which specify a "destination".
    destinations = frozenset((
        'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
        'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
        'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
        'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
        'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
        'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
        'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
        'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
        'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
        'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
        'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
        'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
        'listoverridetable','listpicture','liststylename','listtable','listtext',
        'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
        'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
        'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
        'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
        'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
        'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
        'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
        'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
        'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
        'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
        'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
        'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
        'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
        'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
        'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
        'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
        'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
        'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
        'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
        'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
        'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
        'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
        'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
        'svb','tc','template','themedata','title','txe','ud','upr','userprops',
        'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
        'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
        'xmlopen','NeXTGraphic',
    ))

    # Translation of some special characters.
    specialchars = {
        'par': '\n',
        'sect': '\n\n',
        'page': '\n\n',
        'line': '\n',
        'tab': '\t',
        'emdash': u'\u2014',
        'endash': u'\u2013',
        'emspace': u'\u2003',
        'enspace': u'\u2002',
        'qmspace': u'\u2005',
        'bullet': u'\u2022',
        'lquote': u'\u2018',
        'rquote': u'\u2019',
        'ldblquote': u'\201C',
        'rdblquote': u'\u201D', 
    }

    stack = []
    ignorable = False           # Whether this group (and all inside it) are "ignorable".
    ucskip = 1                  # Number of ASCII characters to skip after a unicode character.
    curskip = 0                 # Number of ASCII characters left to skip
    out = []                    # Output buffer.
    for match in pattern.finditer(rtf):
        word, arg, hex, char, brace, tchar = match.groups()

        if brace:
            curskip = 0
            if brace == '{':    # Push state
                stack.append((ucskip, ignorable))
            elif brace == '}':  # Pop state
                ucskip, ignorable = stack.pop()

        elif char:              # \x (not a letter)
            curskip = 0
            if char == '~':
                if not ignorable:
                     out.append(u'\xA0')
            elif char in '{}\\\n':
                if not ignorable:
                    out.append(char)
            elif char == '*':
                ignorable = True

        elif word:              # \foo
            curskip = 0
            if word in destinations:
                ignorable = True
            elif ignorable:
                pass
            elif word in specialchars:
                out.append(specialchars[word])
            elif word == 'uc':
                ucskip = int(arg)
            elif word == 'u':
                c = int(arg)
                if c < 0:
                    c += 0x10000
                if c > 127:
                    out.append(unichr(c))
                else:
                    out.append(chr(c))
                curskip = ucskip

        elif hex:               # \'xx
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
                c = int(hex,16)
                if c > 127:
                    out.append(unichr(c))
                else:
                    out.append(chr(c))

        elif tchar:
            if curskip > 0:
                curskip -= 1
            elif not ignorable:
                out.append(tchar)

    return ''.join(out)


try:
    print '<HTML><BODY>'
    print rtf2text(str(sys.argv[1])).replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').encode('ISO-8859-1', 'ignore')
    print '</BODY></HTML>'
except:
    sys.exit(1)

sys.exit(0)

Zettair’s indexing machine is very fast, and in my use case, it is not worth the hassle to implement incremental indexing. Therefore, from the point of view of Zettair, re-indexing means, delete the old index and create a new one. On the other hand, format conversions may take it’s time. Specially time consuming is the conversion of large PDF files using the tool pdftotext(1) which is part of the poppler-utils. For this reason, we want to convert only new or changed files.

For this I utilize a special feature of my clone(1) tool - see Cyclaero/Clone on GitHub. The -l option lets clone create a cloned directory tree consisting of only hard links to the original files. Now, in the case, a file in the original content directory has been removed, added or exchanged, the link count of the respective file in either of the directory trees would be only 1, while unchanged files (since the last run) keep-on having a link count of 2. Above spider shell script uses the find(1) tool for recursively identifying the changes and calling the respective conversion scripts/tools.

Handling Search Queries

Search queries for my BLog are handled by the search-delegate plugin of my ContentCGI daemon - see also Cyclaero/ContentCGI on GitHub. The Zettair search engine is compiled into the binary of said plugin. Search queries are passed seamlessly into Zettair’s index_search() function and a web page with the search result is then dynamically generated and returned. Here, I show only a respective excerpt of this C routine - the whole code can be seen on GitHub:

...
   if ((request->POSTtable && (node = findName(request->POSTtable, "search", 6))
     || request->QueryTable && (node = findName(request->QueryTable, "tag", 3)))
     && node->value.s && node->value.s[0] != '\0')
   {
      struct stat st;
      if (stat(ZETTAIR_DB_PATH"index.v.0", &st) == no_error)
      {
         if (st.st_mtimespec.tv_sec != index.stat.st_mtimespec.tv_sec || st.st_mtimespec.tv_nsec != index.stat.st_mtimespec.tv_nsec)
         {
            index_delete(index.idx);
            index.idx = index_load(ZETTAIR_DB_PATH"index", 50*1024*1024, INDEX_LOAD_NOOPT, &index.lopt);
         }

         struct index_search_opt opt = {.u.okapi={1.2, 1e10, 0.75}, 0, 0, INDEX_SUMMARISE_TAG};
         struct index_result    *res = allocate(100*sizeof(struct index_result), default_align, false);
         iconv_t  utfToIso, isoToUtf;
         if (index.idx && res && (utfToIso = iconv_open("ISO-8859-1//TRANSLIT//IGNORE", "UTF-8")))
         {
            size_t origLen = strvlen(node->value.s), convLen = 4*origLen;
            char  *orig = node->value.s, *conv = alloca(convLen + 1), *siso = conv;
            iconv(utfToIso, &orig, &origLen, &conv, &convLen); *conv = '\0';
            iconv_close(utfToIso);

            unsigned i, k, n;
            double   total;
            int      estim;
            if (index_search(index.idx, siso, 0, 100, res, &n, &total, &estim, INDEX_SEARCH_SUMMARY_TYPE, &opt)
             && (isoToUtf = iconv_open("UTF-8//TRANSLIT//IGNORE", "ISO-8859-1")))
            {
               response->content = newDynBuffer().buf;
               dynAddString((dynhdl)&response->content, SEARCH_PREFIX, SEARCH_PREFIX_LEN);
               dynAddString((dynhdl)&response->content, ((node = findName(request->serverTable, "CONTENT_TITLE", 13)) && node->value.s && *node->value.s) ? node->value.s : "Content", 0);
               dynAddString((dynhdl)&response->content, SEARCH_BODY_FYI, SEARCH_BODY_FYI_LEN);

               for (i = 0, k = 0; i < n; i++)
               {
                  char *href, *hend;
                  if ((href = strstr(res[i].auxilliary, ZETTAIR_DB_PATH))
                   && (hend = strstr(href += ZETTAIR_DB_PLEN, ".iso.html")))
                  {
                     dynAddString((dynhdl)&response->content, "<H1><A href=\"", 13);
                     dynAddString((dynhdl)&response->content, href, hend-href);
                     dynAddString((dynhdl)&response->content, "\">", 2);

                     size_t origLen, convLen;
                     char  *orig, *conv, *utf8;
                     if (res[i].title[0] != '\0')
                     {
                        origLen = strvlen(res[i].title);
                        convLen = 4*origLen;
                        orig = res[i].title;
                        utf8 = conv = alloca(convLen + 1);
                        iconv(isoToUtf, &orig, &origLen, &conv, &convLen); *conv = '\0';
                        dynAddString((dynhdl)&response->content, utf8, conv-utf8);
                     }
                     else
                        dynAddString((dynhdl)&response->content, href, hend-href);
                     dynAddString((dynhdl)&response->content, "</A></H1>\n", 10);

                     if (res[i].summary[0] != '\0')
                     {
                        origLen = strvlen(res[i].summary);
                        convLen = 4*origLen;
                        orig = res[i].summary;
                        utf8 = conv = alloca(convLen + 1);
                        iconv(isoToUtf, &orig, &origLen, &conv, &convLen); *conv = '\0';
                        dynAddString((dynhdl)&response->content, "<P>", 3);
                        dynAddString((dynhdl)&response->content, utf8, conv-utf8);
                        dynAddString((dynhdl)&response->content, "</P>\n", 5);
                     }

                     k++;
                  }
               }

               if (k == 0)
                  dynAddString((dynhdl)&response->content, SEARCH_NORESULT, SEARCH_NORESULT_LEN);

               dynAddString((dynhdl)&response->content, SEARCH_SUFFIX, SEARCH_SUFFIX_LEN);

               iconv_close(isoToUtf);

               response->contdyn = -true;
               response->contlen = dynlen((dynptr){response->content});
               response->conttyp = "text/html; charset=utf-8";
            }

            deallocate(VPR(res), false);
         }
      }

      return (response->contlen) ? 200 : 500;
   }
...

Pay attention, how UTF-8 to/from ISO-8859-1 conversion is done on the fly. Let’s try it:

https://obsigna.com/_search?tag=WordPress

Copyright © Dr. Rolf Jansen - 2018-10-08 18:58:26

PROMOTION