TinySegmenter in Python

[back to top]

What is this?

"TinySegmenter in Python" is a Python ver. of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above.

"TinySegmenter in Python"'s interface is compatible with NLTK's TokenizerI, although the distribution file below does not directly depend on NLTK. If you'd like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:

import nltk
import re
from nltk.tokenize.api import *

class TinySegmenter(TokenizerI):

Download and Usage

Download the source code from here: tinysegmenter.py (Hosted by Google Code) . TinySegmenter in Python is freely distributable under the terms of a new BSD licence.

No need to install it - just copy it anywhere, import it, and use it as the follow example:

from tinysegmenter import *

segmenter = TinySegmenter()
print ' | '.join(segmenter.tokenize(u"私の名前は中野です"))

私 | の | 名前 | は | 中野 | です

Features (from the original TinySegmenter)

Acknowledgement

I really thank Mr. Kudo for his effort on this kind of wonderful software.

[back to top]