The paper we submitted to IJCNLP2011 has been accepted, and will be presented soon at the conference which will be held in a few weeks from now.
The paper describes the #ANPI_NLP project, a voluntary relief project focusing on text and safety information mining in the wake of The East Japan Earthquake in March, 2011.
Here’s the full paper PDF (which is kindly uploaded by the leading co-author Mr. Graham Neubig).
In the paper, we not only describe how the project was started and evolved and what kind of tasks we dealt with, but also focused on the lessons we learned from the project experience.
Even after the submission we have received some useful feedback from colleagues and peer researchers. In retrospect, we could have done more things during the relief effort and even BEFORE any disasters happen.
Please read the paper if you are interested, and give us back any feedback. (Floods in Thailand still continue as I write this article — I hope the conference is held without any problems)
My wife and I decided to provide a program named “Intensive Chinese Weekend Stay” in New York. In this program, we invite a learner of the Chinese language for free to our home and provide an intensive learning course.
Intensive Chinese Weekend Stay Program (New York City)
Part of the reasons why we started this kind of program is that we recently signed up for CouchSurfing, which we found very interesting (we are actually hosting an American girl next weekend just two weeks after signing up!). We decided to impose a particular condition when hosting somebody that the guests should at least speak or learning one of the CJK (i.e., Chinese, Japanese, Korean) languages, so that the guests can deepen their understanding in East Asian languages and cultures.
This “Intensive Chinese Weekend Stay” is the extension of the above concept. In this program we are going to provide comprehensive pronunciation and grammar review so if you are interested in the details please go to the above page and apply!
On Labor day weekend, my wife and I paid a visit to Penn State University, which is located at State College, in the middle of the state of Pennsylvania. It was a four-and-a-half-hour bus ride from New York City, taking Megabus first and Gotobus for the return trip, which was not very comfortable.
Our purpose is to pay a visit to a professor there whose major is Confucianism and East Asian history. This is especially useful for my wife to have a deeper glimpse of the field of East Asian Philosophy, and to set the future research direction. (The Hongkong-born professor and we have talked in Mandarin, which was interesting to me, too).
Visiting such researchers in the country reminds me of a blog post “Conferences: Costs and Benefits ” in natural language processing blog, where the author Hal Daumé III claims that inviting famous type researchers to one’s own university and visiting labs in the country and having deep in-office conversation can compensate for the large amount of money we usually spend on domestic and/or international conferences every year.
I feel more positive about this idea as I keep working here at Rakuten Institute of Technology, New York. It cannot be underestimated to be able to work in a hub-like place which lots of top-tier researchers keep visiting. That’s one of the reasons why places like Google and Microsoft Research stay as competitive places all the time, where a lot of researchers and top engineers have “tech-talks.” That could be much more important than simply attending every conference, from good ones and not-so-good ones. I would also like to increase this kind of opportunity personally, hopefully starting from this year.
Just for my convenience, I’ve listed up best papers of major NLP conferences (ACL / COLING / NAACL / EMNLP / CoNLL) for the past 7 years or so. If you find anything wrong or mistaken, please let me know. Thanks~
ACL
- 2005: David Chiang A hierarchical phrase-based model for statistical machine translation
- 2006: Rion Snow, Daniel Jurafsky, and Andrew Ng. Semantic Taxonomy Induction from Heterogenous Evidence
- 2007: Y. W. Wong and R. J. Mooney Learning synchronous grammars for semantic parsing with lambda calculus
- 2008: Liang Huang. Forest Reranking: Discriminative Parsing with Non-Local Features
Libin Shen, Jinxi Xu and Ralph Weischedel A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model - 2009: Andre Martins, Noah Smith and Eric Xing. Concise Integer Linear Programming Formulations for Dependency Parsing
S.R.K. Branavan, Harr Chen, Luke Zettlemoyer and Regina Barzilay. Reinforcement Learning for Mapping Instructions to Actions
Adam Pauls and Dan Klein. K-Best A* Parsing - 2010:
Matthew Gerber and Joyce Chai. Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates - 2011: Dipanjan Das, Slav Petrov. Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
COLING
- 2006: See ACL 2006
- 2008: Bill MacCartney and Christopher D. Manning. Modeling semantic containment and exclusion in natural language inference
- 2010: Fan Bu, Xiaoyan Zhu and Ming Li, Measuring the Non-compositionality of Multiword Expressions
NAACL
- 2006: Mehryar Mohri and Brian Roark Probabilistic Context-Free Grammar Induction Based on Structural Zeros
- 2006: Aria Haghighi and Dan Klein Prototype-Driven Learning for Sequence Models
- 2007: Antti-Veikko Rosti, Bing Xiang, Spyros Matsoukas, Richard Schwartz, Necip Fazil Ayan and Bonnie Dorr Combining Outputs from Multiple Machine Translation Systems
- 2009: Hoifung Poon, Colin Cherry and Kristina Toutanova Unsupervised Morphological Segmentation with Log-Linear Models
- 2009: David Chiang, Kevin Knight and Wei Wang 11,001 New Features for Statistical Machine Translation
- 2010: Aria Haghighi and Dan Klein Coreference Resolution in a Modular, Entity-Centered Model
EMNLP
- 2005 Ryan McDonald, Fernando Pereira, Kiril Ribarov and Jan Hajic Non-Projective Dependency Parsing using Spanning Tree Algorithms
- 2006: NO AWARD
- 2007: James Clarke and Maria Lapata Modelling Compression with Discourse Constraints
- 2008: NO AWARD
- 2009: Hoifung Poon and Pedro Domingos Unsupervised semantic parsing
- 2010: Automata Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola and David Sontag Dual Decomposition for Parsing with Non-Projective Head
- 2011: Wei Lu and Hwee Tou Ng A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
CoNLL
- 2006: Rie Kubota Ando Applying Alternating Structure Optimization to Word Sense Disambiguation
- 2007: James Clarke and Mirella Lapata. Modelling Compression with Discourse Constraints
- 2009: Roi Reichart, Ari Rappoport. Sample Selection for Statistical Parsers: Cognitively Driven Algorithms and Evaluation Measures
- 2010: Alexander Clark Efficient, correct, unsupervised learning for context-sensitive languages
- 2011 Wen-tau Yih, Kristina Toutanova, John Platt, and Chris Meek Learning Discriminative Projections for Text Similarity Measures
2008: Xavier Carreras, Michael Collins and Terry Koo TAG, Dynamic Programming, and the Perceptron for Efficient, Feature-Rich Parsing
The special issue “Unnatural Language Processing” of Journal of Natural Language Processing, for which I’m a leading editorial member, has started its call for paper a few weeks ago.
This special issue, subtitled “Processing of Out-of-the-box Language
Expressions” is the sequel to the past two events of “Unnatural Language Processing” last year. The topics include not only normal academic papers but also papers describing systems and data regarding the theme.
Although we’ve prepared the CFP only in Japanese, which you can see here, this doesn’t mean that we are excluding any submissions in English.
By the way, I’ve observed some arguments on twitter about the title, claiming that the word “unnatural” is not suitable for the theme because the language phenomena which this special issue focuses on are exactly the examples of human “natural” language activity.
Let me explain a little about this — we (and ANLP, too) have absolutely no intention to declare or define them as “unnatural.” You can see that we are consistently using the word “out-of-the-box” in the CFP. We suppose that the title is just an alias to this field of domain, which targets at the processing of language phenomena which have not been gathering much attention so far because they are irregular and/or new. Further academic discussion should follow in the near future.
Anyway, the submission deadline is March 23rd, 2012. We all welcome your submission!
The Japanese morphological analyzer MeCab can also be directly called from Clojure, too, by using its Java binding. I have, however, come across some pitfalls related to JNI in the process, so I’ll describe how I’ve overcome them in the following so that everyone else doesn’t have to stumble over the same issues.
The first thing you have to do is to install MeCab’s Java binding, which is rather straightforward. Download mecab-java-0.98pre3.tar.gz (which is the latest version at the time of writing) from here, untar & make it. (Be sure to set the INCLUDE variable to an appropriate path if you are using non-typical environment, such as OpenJDK.)
One issue I encountered here is that JVM dies from SIGSEGV when tried to run the sample program in Java:
#
# An unexpected error has been detected by Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000003091c7b59b, pid=32167, tid=1106934080
#
I found one blog article which describes exactly the same problem. As the article suggests, adding the following two lines after line 710 of MeCab_wrap.cxx did the trick for me:
char work[128] ; // add this line sprintf(work,"result:%0x\n",result); // add this line
Now you are ready to use MeCab from Clojure code. Make sure that MeCab.jar is in CLASSPATH and libMeCab.so is loadable, both of which are created after running “make.”
The thing is, even after importing org.chasen.mecab MeCab + Tagger + Node and running (System/loadLibrary “MeCab”), the Clojure code will complain with “UnsatisfiedLinkError,” which basically means that the necessary native code library cannot be loaded appropriately.
The reason was, as I found out after a full hour of struggling, try and error, that the library is not loaded appropriately because of Clojure’s classloader. The solution, provided here, is to call “Runtime/loadLibrary0″ method directly using wall-hack-method so that the library is loaded in the same classLoader which the caller specifies:
(use '[clojure.contrib.java-utils : only (wall-hack-method)])
(defn load-lib [class lib]
(wall-hack-method java.lang.Runtime "loadLibrary0" [Class String]
(Runtime/getRuntime) class lib))
(load-lib MeCab "MeCab")
You can now call MeCab via Clojure’s typical Java interop functions/macros:
(println (MeCab/VERSION))
(let [tagger (new Tagger)
sent "太郎は二郎にこの本を渡した。"]
(println (. tagger (parse sent)))
(loop [node (. tagger (parseToNode sent))]
(when node
(println (str (. node getSurface) "\t" (. node getFeature)))
(recur (. node getNext))
)
)
)
0.98pre3
太郎 名詞,固有名詞,人名,名,*,*,太郎,タロウ,タロー
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
二郎 名詞,固有名詞,人名,名,*,*,二郎,ジロウ,ジロー
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
この 連体詞,*,*,*,*,*,この,コノ,コノ
本 名詞,一般,*,*,*,*,本,ホン,ホン
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
渡し 動詞,自立,*,*,五段・サ行,連用形,渡す,ワタシ,ワタシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。 記号,句点,*,*,*,*,。,。,。
EOSBOS/EOS,*,*,*,*,*,*,*,*
太郎 名詞,固有名詞,人名,名,*,*,太郎,タロウ,タロー
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
二郎 名詞,固有名詞,人名,名,*,*,二郎,ジロウ,ジロー
に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
この 連体詞,*,*,*,*,*,この,コノ,コノ
本 名詞,一般,*,*,*,*,本,ホン,ホン
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
渡し 動詞,自立,*,*,五段・サ行,連用形,渡す,ワタシ,ワタシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。 記号,句点,*,*,*,*,。,。,。
BOS/EOS,*,*,*,*,*,*,*,*
We’ve been on a short trip to Toronto over the weekend, visiting my wife’s old friends, one of whom is now spending a week in her hometown.
We’ve been to Niagara falls, downtown Toronto (ex-world-tallest CN Tower was amazing, and Cosa Loma castle was fun), had a BBQ at their wonderful house, and even enjoyed Cantonese style Dimsum, too!
I didn’t know that the Chinese culture brought by a large number of immigrants has penetrated so deep into the city, finding a lot of Chinese-style restaurants and supermarkets on the streets.
Bot of our friends actually have Cantonese roots, which makes their cultural background very diverse. The conversation and languages were also diverse, ranging from Cantonese and English, and even to Mandarin and some Japanese, switching from one language to anther even during single sentences. It’s a pity that I’m all thumbs when it comes to Cantonese, and always motivated by linguistic diversity but never had time to master it. It’s still pleasant to my ears just listening to them speaking and enjoying the tonal language’s melodies and exotic vowels.
Anyway, we really thank Diana and Niki for their hospitality and fun. We’ll definitely be back soon!
I recently found out that tatoeba.org is a pretty nice resource for collecting parallel text in many languages. The major reason why I love it is that the whole data is downloadable as a dump file, with all the sentences being under the creative commons license (although there are some mistakes in the sentences).
Specifically, you can just go to the Downloads Page and find the dump files. All you need is basically the sentence file in which all the sentences are stored in a tab-separated format with their IDs and languages. If you want parallel sentences, you need to use the links file, which is just a list of tab-separated “source sentence id [tab] target sentence id” rows.
Once you download the file, you can use your favorite way to “join” the files and get the parallel sentences. One of my favorite way would be to store everything into MongoDB and issue a few queries and, in a second, you’d get the result.
This time, however, I used Clojure to join the files for my learning. The trickiest part would be computing the one-level transitive relation of the graph (that is, to compute an edge A -> C from A -> B and B -> C), because the dump links file does not contain indirect translations, which tatoeba.com search does. The following Clojure code snippet does this:
(defn get-transitive [g]
(into {} (filter second (map (fn [[k v]] [k (reduce union (map #(g %) v))]) g)))
)
(let [links (reduce (fn [x y] (merge-with union x y))
(map #(let [v (split #"\t" %)]
{(first v) #{(second v)}}) (read-lines *in*)))
tlinks (merge-with union links (get-transitive links))]
(doseq [[k v] tlinks elm v]
(println (str k "\t" elm))
)
)
The output is the original graph augmented with step-1 indirect transitive edges.
I won’t show the latter part, i.e, extraction of parallel sentence from the joined file because it’s pretty much straightforward. The result after extracting parallel sentences in English, Japanese, and Chinese looks like the following:
How would you like your steak? ステーキの焼き方はどうなさいますか。 您的牛排要几分熟?
Even though he was tired, he went on with his work. 疲れていたけれど、彼は仕事を続けた。 他雖然很累,但是也繼續工作。
That was all Greek to me. 私にはちんぷんかんぷんでした。 我完全看不懂。
He can both speak and write Russian. 彼はロシア語が話せるし書くことができる。 他會說俄語,也會寫俄文。
You must consider it before you answer. 答える前によく考えねばならない。 在回答之前必须要考虑清楚。
…
You can extract Lojban-English parallel sentences as well:
599308 jbo le karce cu bredi 46819 eng The car is ready.
599316 jbo le mlatu cu nelci le nu sipna ne’a mi 44558 eng The cat likes to sleep beside me.
599317 jbo ti du le mi karce 56205 eng This is my car.
599321 jbo dei na jufra 547389 eng This is not a sentence.
599329 jbo lo nanla pu smaji 47434 eng The boy remained silent.
…
which I think is really useful for my Lojban study!
Enjoy~
Last Wednesday, we held our first meet-up meeting of East Asian Language Learning through Interpretation Methods. The purpose is to brush up your language skills (we target at East Asian languages, namely Chinese, Japanese, and Korean) through interpretation methods.
Although this was our first time to even set up a meetup group, it was quite a success, I believe, with 11 memebers participating the first meetup. After organizing the meetup, we organizers found several things which we should have done or shouldn’t have done. Here are some tips:
Read “Organizer Tips”
Meetup is equipped with nice tools for organizers. One of my favorite is “organizer tips,” where you can find DOs and DON’Ts when setting up meetup group or meetings as an organizer. Read them through and it’ll definitely help you make your meetup group a better place.
Get Prepared
Since we thought that the first meet-up was the best opportunity ever to let other participants know our goals, thoughts and purpose, we prepared a complete presentation. We also practiced how to demonstrate the language practice through “interpretation methods,” and collected materials which we can use for the demonstration.
The preparation itself was actually a fun, and it’ll help you organizers put your ideas into shape.
Details Matter
Details matter when it comes to letting participants feel more comfortable. Here are some small items we have prepared:
- iPhone x 2 (or any portable music devices) — to play audio language materials
- Battery-powered portable speakers — to make the sound louder (that wasn’t loud enough actually)
- Printed meet-up group logo — for participants to easily recognize where we are
- Printed participant list — for organizers to take attendance
- Printed name tags — this was the best of all (many participants said it was nice.) Meetup has a functionality to export a PDF file containing name tags which you can just print out. This was particularly useful to talk about people’s names because we focus on language learning.
Bring Your Camera
Remember to take your camera to the meeting (we did, but accidentally forgot taking any pictures, what a pity). This not only serves as a record, but is also a good way to let potential participants know what the meeting was like.
Send Out Remind and Thank-You Emails
This goes without saying. It is also a great opportunity to hear the feedback from the participants.
—
After all, we decided to hold the meeting regularly. And by the way, we are hiring another co-organizer. We appreciate your interests in our group!
Since Clojure is based on JVM, you can easily pick a publicly available library for Java (machine learning, multimedia processing, or whatever) and call it. Calling Java libraries is normally straightforward thanks to Clojure’s inter-operation functionalities, but you could spend hours reading the library’s API document and tweaking around your code accordingly, especially if you are not used to the whole Java architecture very much (like me).
In the following, I’ll show briefly how to use the Java-based SVM libarary JLIBSVM from Clojure.
First off, we’ll be using following two helper macro and functions — set-all! sets multiple fields of a Java objects at the same time, and into-sparsevec converts a map into a SparseVector, which will be used to represent vectors for JLIBSVM. And do not forget to “import” all the objects you need.
(import '(java.util HashSet Vector))
(import '(edu.berkeley.compbio.jlibsvm.kernel LinearKernel))
(import '(edu.berkeley.compbio.jlibsvm ImmutableSvmParameterGrid))
(import '(edu.berkeley.compbio.jlibsvm.binary C_SVC MutableBinaryClassificationProblemImpl))
(import '(edu.berkeley.compbio.jlibsvm.util SparseVector))
(defmacro set-all! [obj m]
`(do ~@(map (fn [e] `(set! (. ~obj ~(key e)) ~(val e))) m) ~obj))
(defn into-sparsevec [m]
(let [sv (new SparseVector (count m))
sm (sort-by first m)]
(set-all! sv {indexes (int-array (map first sm))
values (float-array (map second sm))})
sv)
)
The rest of the process is easy:
1. Create an SVM (this time we solve a binary classification problem) and parameters for training.
(def svm (new C_SVC))
(def builder (ImmutableSvmParameterGrid/builder))
(set-all! builder {eps 1.0e-3
Cset (doto (new HashSet) (.add (float 1.0)))
kernelSet (doto (new HashSet) (.add (new LinearKernel)))})
(def param (. builder build))
2. Create a problem — which consists of training examples and their classes.
(def x1 (into-sparsevec {1 1.0}))
(def x2 (into-sparsevec {1 -1.0}))
(def vx (new Vector [x1 x2]))
(def vy (new Vector [1 -1]))
(def prob (new MutableBinaryClassificationProblemImpl String (count vy)))
(doseq [x (map list vx vy)]
(. prob (addExample (first x) (second x)))
)
3. Train an SVM. This will returned a trained model.
(def model (. svm (train prob param)))
4. Then you are ready to classify new test examples using the model.
(println (. model (predictLabel x1))) (println (. model (predictLabel x2)))
(this will produce 1 and -1, respectively)
And that’s it. You are ready to use SVM for any problems (the overall process is the same for other SVMs, e.g., regression). One drawback is that saving and loading of learned models are not implemented in JLIBSVM yet. In order to do that, you have to write a Clojure code which directly writes or reads SVM parameters (which is actually not so difficult), or you can write a Java patch to implement it.
About the Author
Masato Hagiwara currently works for Rakuten Institute of Technology in New York, as a Senior Scientist. Have worked on search technologies at Google, Microsoft Research, and Baidu in the past. Expert in Natural Language Processing (NLP). Also a lead translator of the O'Reilly book "Natural Language Processing in Python." A native speaker of Japanese. Good command of English and Chinese (Mandarin). For more information, see About Me.Recent Posts
- What can NLP do in a Disaster?
- Providing a Free Intensive Chinese Weekend Stay Program in New York City
- Something More Important Than Just Attending Conferences
- List of Past NLP Conference Best Papers
- Call for Papers: Special Issue on “Unnatural Language Processing” (Journal of Natural Language Processing)
Calender
January 2012 M T W T F S S « Oct 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Pages
- 100 NLP Papers
- About Me
- iconlang – new ideographic writing system for better visibility and legibility
- iconlang – 視認性・識別性向上のための新しい表意文字体系
- Music
- Music for Language Fans
- NLTK Japanese Corpora – NLTKで使える日本語コーパス
- Python/Romkan ローマ字とひらがなを相互に変換する Python用のライブラリ
- TinySegmenter in Python
- 中国語学習完全ガイド | 1年以内にマスターする中国語
- 巻き舌クリニック – みんなで巻き舌を克服するサイト

