I’ve just released the first version of a lojban tokenizer. It is intended for use in machine learning applications and therefore is a bit different from a linguistic tokenizer. In particular, it does sub-word tokenization.

Additionally, there is a lexer, which can be used to develop alternative tokenizers.

Home page: https://github.com/olpa/lojban-mt/tree/master/tokenizer/

Fast start:

$ VERSION=1.0.0
$ pip3 install https://github.com/olpa/lojban-mt/releases/download/tokenizer-v${VERSION}/jbotokenizer-${VERSION}.tar.gz

$ echo 'coirodo' | jboparse.py
coi ro do

$ jboparse.py coi ro do
coi ro do

$ jboparse.py coi ro do --lex
(<TokenClass.CMAVO: 2>, 'coi') (<TokenClass.SKIP: 1>, ' ')
(<TokenClass.CMAVO: 2>, 'ro') (<TokenClass.SKIP: 1>, ' ')
(<TokenClass.CMAVO: 2>, 'do')

$ jboparse.py lojbangirz
logji## bangu## girzu

$ python3
>>> from jbotokenizer import text_to_tokens
>>> text_to_tokens('ma nuzba')
['ma', 'nuzba']

Hosted on GitHub Pages — Theme by mattgraham