What kind of language technologies would the “World Government” require in 30 years from now?
— And why don’t we just start now?
『世界政府』というものがあるとして、そこで30年後に必要になる言語技術は何だろう
— そして、今から始められることは何だろう
To contact me, follow me at twitter: @mhagiwara, or shoot me an email at hagisan [at] gmail.com
You can find my current CV (resume): English version / Japanese version .
Research Interests
- “Un” natural Language Processing
- Lexical Knowledge Acquisition using Machine Learning and Graph-Theoretic Approaches
- Japanese Transliteration and Query Alteration
UnNatural Language Processing (UNLP) is one research field of NLP, which deals with “real” and “noisy” language data which cannot be captured by conventional “text-book” NLP techniques. Targets of UNLP include, but not limited to: twitter, emoticons, noisy data, irregular NEs, unknown words, informal languages, and so on. The projects I’ve worked on so far are:
— Emoticon processing for mobile search engines
— The First Unnatural Language Processing Contest hosted by Baidu Japan
— The second Unnatural Language Processing Thematic Session at NLP2011
— ANPI_NLP (Safety Information Mining Project for 2011 Tohoku Region Pacific Coast Earthquake in Japan)
— worked on the use of latent semantic models in acquiring lexical knowledge from large corpora. Recently focusing on the use of graph-kernels for knowledge extraction from unsegmented Japanese text
— focusing on multil-lingual latent semantic transliteration models and query alteration
Work Experience
- Oct. 2010 – Present: Senior Scientist – Rakuten Institute of Technology (in New York)
- Apr. 2009 – Sep. 2010: Research and Development Engineer – Baidu Japan, Inc. (worked in Shanghai / Beijing / Tokyo)
- Planned and acted as a lead developer in various projects including Unnatural language processing contest, Baidu Mobile Corpus and Timed Corpus.
- Worked on the ranking and page analytical algorithms including spam detection for Baidu mobile search. Also worked on the mobile emoticon search using various NLP semantic analysis techniques.
- Also worked on various NLP topics including – word / sentence analysis technologies, synonym mining and dictionary construction, proper noun detection, Japanese Input Method BaiduType, etc.
- Apr. 2008 – Jul. 2008 : Research Intern – Microsoft Research, WA, USA. (Mentor: Hisami Suzuki)
- Proposed a state-of-the-art method for Japanese query alteration, which corrects misspellings and normalizes the spelling/transliteration variants, with higher accuracy than conventional systems.
- Implemented the system using Visual C#, SQL Server, and Ruby, with tens of gigabytes of query log. This system is being integrated into Microsoft Live Search (http://www.live.com/).
- Developed a method to automatically and efficiently generate query re-writing pairs from session log.
- Presented the project at the 3rd NLP Symposium for Young Researchers and was awarded the outstanding presentation award. International conference papers are being submitted as well.
- Nov. 2006 – Aug. 2007 : Developer – IPA, JAPAN: Exploratory Software Project. (Project Manager: Prof. David J. Farber)
- Accepted as the Exploratory Software Project “Serendi: A Location-Aware Social Networking Platform” (http://serendi.org/), a location-aware meta social networking service targeted at mobile devices with GPS.
- Developed the “compatibility” analysis module, which recommends users in real time based on natural language processing and network analysis. Used PHP, JavaScript, Ruby, MySQL, and ActiveRecord.
- Conducted an extensive user test with more than 50 users and confirmed the reliability of the system.
- Aug. 2005 – Sep. 2005 : Intern (Software Engineer), Google Inc., CA, USA. (Mentors: Dekang Lin and Jun Wu)
- Participated in the two-month internship program, as one of the few interns chosen from Japan, as it was only the second year since the internship was started.
- Worked on Japanese query suggestion, which is currently used as the basis for the query suggestion shown at the top and bottom of the Google search result.
- Fully used the parallel distributed computation algorithms such as MapReduce and the large network cluster infrastructure which Google offers.
- Apr. 2006 – Mar. 2007 : Research Assistant, Nagoya University
- Worked on some research projects related to the 21st Century COE Program “Intelligent Media Integration for Social Information Infrastructure” at Nagoya University.
- Proposed and implemented some extension and selection methods of context for lexical similarity computation, to increase the performance of linguistic resources construction such as thesauri.
- Published several papers at the top-tier international conferences as well as in journals. (see the “Publications” section)
- Sep. 2005 – Mar. 2006, Sep. 2006 – Mar. 2007 : Teaching Assistant, Nagoya University
Taught “Linear Algebra” and “Automata and Formal Language Theory” to undergraduate students.
Education
- Apr. 2006 – Mar. 2009: Ph.D. Candidate, Department of Information Engineering,
Graduate School of Information Science, Nagoya University, Japan
Doctoral Thesis: “Modeling and Selection of Context for Better Synonym Acquisition” - Apr. 2004 – Mar. 2006 : Master’s Program in Department of Information Engineering,
Graduate School of Information Science, Nagoya University, Japan
* Entered using the grade-skipping system. Overall GPA: 3.8
Master’s Thesis: “Utilization of Probabilistic Latent Semantics for Automatic Thesaurus Construction” - Apr. 2001 – Mar. 2004 : Information Engineering Course, School of Engineering,,
Nagoya University, Japan. Computer Science GPA: 3.9
Publications (Selected)
- Steven Bird, Ewan Klein, Edward Loper. 萩原正人 (Masato Hagiwara), 中山敬広 (Takahiro Nakayama), 水野貴明(Takaaki Mizuno) (translation). 入門 自然言語処理 (Natural Language Processing with Python). O’Reilly Japan, 2010. O’Reilly Japan – 入門 自然言語処理
- 萩原正人,小川泰弘,外山勝彦: グラフカーネルを用いた非分かち書き文からの漸次的語彙知識獲得, 人工知能学会誌, Vol.26, No.3, pp.- (2011. 3.)
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Supervised Synonym Acquisition Using Distributional Features and Syntactic Patterns. Journal of Natural Language Processing, Vol. 16, Num. 2, pp. 59-83, 2009.
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. A Comparative Study on Effective Context Selection for Distributional Similarity. Journal of Natural Language Processing, Vol. 5, Num. 5, pp. 119-150, 2008.
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Effective Use of Indirect Dependency for Distributional Similarity. Journal of Natural Language Processing, Vol. 15, Num. 4, pp. 19-42, 2008.
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. New Frontiers in Artificial Intelligence: JSAI 2008 Conference and Workshops, Revised Selected papers, Lecture Notes in Computer Science, Vol. 5447, pp. 213-227, 2009.
- Graham Neubig, Yuichiroh Matsubayashi, Masato Hagiwara, Koji Murakami. Safety Information Mining — What can NLP do in a disaster —, Proc. of IJCNLP 2011. [pdf]
- Masato Hagiwara and Satoshi Sekine. Latent Class Transliteration based on Source Language Origins. Proc. of ACL-HLT 2011 [pdf]
- Masato Hagiwara and Hisami Suzuki. Japanese Query Alteration Based on Lexical Semantic Similarity. Proc. of NAACL HLT 2009, pp. 191-199, 2009.
- Nobuyuki Shimizu, Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama and Hiroshi Nakagawa. Metric learning for synonym acquisition. Proc. of COLING 2008, pp. 793-800, 2008.
- Masato Hagiwara. A Supervised Learning Approach to Automatic Synonym Identification based on Distributional Features. Proc. of ACL 2008 Student Research Workshop, pp. 1-6, 2008. [pdf] [link]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Bootstrapping-based Extraction of Dictionary Terms from Unsegmented Legal Text. Proc. of JURISIN 2008, pp. 63-72, 2008. [ppt]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Context Feature Selection for Distributional Similarity. Proc. of IJCNLP 2008, pp. 553-560, 2008. [pdf] [link]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Effective Proximity Distance for Word-Based Context. Proc. of SNLP 2007, pp. 105-110, 2007. [ppt] [link]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Effectiveness of Indirect Dependency for Automatic Synonym Acquisition. Proc. of CoSMo 2007, pp. 1 – 8, 2007. [pdf] [ppt]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. Selection of Effective Contextual Information for Automatic Synonym Acquisition. Proc. of COLING/ACL 2006, pp. 353 – 360, 2006. [pdf] [link]
- Masato Hagiwara, Yasuhiro Ogawa, Katsuhiko Toyama. PLSI Utilization for Automatic Thesaurus Construction. Proc. of IJCNLP 2005, pp. 334 – 345, 2005. [link]
Books and Articles
Journal Papers
Conference Papers
Softwares and Projects
- NLTK Japanese Corpora – NLTKで使える日本語コーパス
– introductions and corpus readers for freely available Japanese corpora for NLTK - iconlang – 視認性・識別性向上のための新しい表意文字体系
– a new ideographic writing system for better legibility and visibility - TinySegmenter in Python
– an extremely compact Japanese tokenizer written in Python - Python/Romkan – ローマ字とひらがなを相互に変換する Python用のライブラリ
– a Romaji/Kana conversion library for Python - frippa (http://www.frippa.com/)
- Developed the entire system of this community-based classified ads service, one of the most active peer-to-peer trading communities in Japan with more than 2,000 users.
- Runs on an original MVC framework based on Linux, MySQL, ActiveRecord, Ruby, etc.
- Implemented a functionality to provide users with related items using natural language processing.
- Provided the item database in the joint project with the Reuse Market for furniture and appliances at Nagoya University in 2007, as a social contribution activity.
Also worked on user interface utilizing Ajax and Flash, as a temporary developer at a few IT start-up companies including RINEN.inc (http://rinen.cc/) and Anchor (http://anchor.vc/)
- Japanese summary translation of “The Complete Lojban Language”
– Started a study group and a translation project of the artificial logical language “Lojban”‘s reference grammar.
Awards & Professional Activities
- Outstanding Presentation Award at the 17th Annual Meeting of the Association for Natural Language Processing (言語処理学会第17回年次大会 優秀発表賞). Presentation: “Latent Class Transliteration based on Source Language Origins” (原言語の起源に基づく潜在クラス翻字モデル)
- Leading editorial member of special issue on “UnNatural Language Processing, ” Journal of Natural Language Processing, 2011.
- Panelist at the joint workshop “Relationship between industrial, students, universities, and students in the NLP field (自然言語処理における企業と大学と学生の関係),” at the 17th Annual Meeting of the Association for Natural Language Processing
- Best Paper Award at the 15th Annual Meeting of the Association for Natural Language Processing (言語処理学会第15回年次大会 最優秀発表賞). Presentation: “Semantic Category Extraction from Unsegmented Text using Graph Kernels” (グラフカーネルに基づく非分かち書き文からの意味的語彙カテゴリの抽出)
- Outstanding Presentation Award at the 3rd NLP Symposium for Young Researchers. Presentation: “A Unified Approach to Japanese Query Alteration based on Semantic Similarity”
- Outstanding Presentation Award at the 22nd IMI Seminar of the 21st Century COE Program. Presentation: “Utilization of Probabilistic Latent Semantics for Automatic Thesaurus Construction”
- Program Committee of the Student Research Workshop (SRW) at ACL-IJCNLP 2009 and ACL 2012.
Computer Skills
- Languages : C, C++, C#, Clojure, Python, Ruby, JavaScript, (D)HTML
- Applications: Solr, MongoDB, MySQL, NLTK
- Platforms: Windows, Linux
5+ years of Web application development experience, including LAMP architecture
Natural Language Skills
- Japanese : Native
- English : Fluent – TOEIC score 960 (2007)
- Chinese (Mandarin) : Advanced – New HSK (汉语水平考试) Grade 6 (Dec. 2010)
About the Author
Masato Hagiwara currently works for Rakuten Institute of Technology in New York, as a Senior Scientist. Have worked on search technologies at Google, Microsoft Research, and Baidu in the past. Expert in Natural Language Processing (NLP). Also a lead translator of the O'Reilly book "Natural Language Processing in Python." A native speaker of Japanese. Good command of English and Chinese (Mandarin). For more information, see About Me.Pages
- 100 NLP Papers
- About Me
- iconlang – new ideographic writing system for better visibility and legibility
- iconlang – 視認性・識別性向上のための新しい表意文字体系
- Music
- Music for Language Fans
- NLTK Japanese Corpora – NLTKで使える日本語コーパス
- Python/Romkan ローマ字とひらがなを相互に変換する Python用のライブラリ
- TinySegmenter in Python
- 中国語学習完全ガイド | 1年以内にマスターする中国語
- 巻き舌クリニック – みんなで巻き舌を克服するサイト