TinySegmenter in Python

What is this?

“TinySegmenter in Python” is a Python ver. of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above.

“TinySegmenter in Python”‘s interface is compatible with NLTK’s TokenizerI, although the distribution file below does not directly depend on NLTK. If you’d like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:

	import nltk 
	import re 
	from nltk.tokenize.api import * 

	class TinySegmenter(TokenizerI):

Download and Usage

Download the source code from here: tinysegmenter.py (Hosted by Google Code) . TinySegmenter in Python is freely distributable under the terms of a new BSD licence.

No need to install it – just copy it anywhere, import it, and use it as the follow example:

	from tinysegmenter import * 

	segmenter = TinySegmenter() 
	print ' | '.join(segmenter.tokenize(u"私の名前は中野です")) 

	私 | の | 名前 | は | 中野 | です 

Features (from the original TinySegmenter)

  • Around 95% segemntation precision for Japanese news articles.
  • Segmentation units are compatible with MeCab + ipadic.
  • Only 23KB of source code. Just copy it anywhere and no other things are required.
  • No dependency on any dictionaries – character-based segmentation (Features: character, character N-grams, character types).
  • Feature selection by L1-norm regularization + Boosting.

Acknowledgment

I really thank Mr. Kudo for his effort on this kind of wonderful software.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>