TinySegmenter in Python
What is this?
“TinySegmenter in Python” is a Python ver. of TinySegmenter, which is an extremely compact (23KB) Japanese tokenizer originally written in JavaScript by Mr. Taku Kudo. It works on Python 2.5 or above.
“TinySegmenter in Python”‘s interface is compatible with NLTK’s TokenizerI, although the distribution file below does not directly depend on NLTK. If you’d like to use it as a tokenizer in NLTK, you have to modify the first few lines of the code as below:
import nltk import re from nltk.tokenize.api import * class TinySegmenter(TokenizerI):
Download and Usage
Download the source code from here: tinysegmenter.py (Hosted by Google Code) . TinySegmenter in Python is freely distributable under the terms of a new BSD licence.
No need to install it – just copy it anywhere, import it, and use it as the follow example:
from tinysegmenter import * segmenter = TinySegmenter() print ' | '.join(segmenter.tokenize(u"私の名前は中野です")) 私 | の | 名前 | は | 中野 | です
Features (from the original TinySegmenter)
- Around 95% segemntation precision for Japanese news articles.
- Segmentation units are compatible with MeCab + ipadic.
- Only 23KB of source code. Just copy it anywhere and no other things are required.
- No dependency on any dictionaries – character-based segmentation (Features: character, character N-grams, character types).
- Feature selection by L1-norm regularization + Boosting.
Acknowledgment
I really thank Mr. Kudo for his effort on this kind of wonderful software.
About the Author
Masato Hagiwara currently works for Rakuten Institute of Technology in New York, as a Senior Scientist. Have worked on search technologies at Google, Microsoft Research, and Baidu in the past. Expert in Natural Language Processing (NLP). Also a lead translator of the O'Reilly book "Natural Language Processing in Python." A native speaker of Japanese. Good command of English and Chinese (Mandarin). For more information, see About Me.Pages
- 100 NLP Papers
- About Me
- iconlang – new ideographic writing system for better visibility and legibility
- iconlang – 視認性・識別性向上のための新しい表意文字体系
- Music
- Music for Language Fans
- NLTK Japanese Corpora – NLTKで使える日本語コーパス
- Python/Romkan ローマ字とひらがなを相互に変換する Python用のライブラリ
- TinySegmenter in Python
- 中国語学習完全ガイド | 1年以内にマスターする中国語
- 巻き舌クリニック – みんなで巻き舌を克服するサイト