I recently found out that tatoeba.org is a pretty nice resource for collecting parallel text in many languages. The major reason why I love it is that the whole data is downloadable as a dump file, with all the sentences being under the creative commons license (although there are some mistakes in the sentences).

Specifically, you can just go to the Downloads Page and find the dump files. All you need is basically the sentence file in which all the sentences are stored in a tab-separated format with their IDs and languages. If you want parallel sentences, you need to use the links file, which is just a list of tab-separated “source sentence id [tab] target sentence id” rows.

Once you download the file, you can use your favorite way to “join” the files and get the parallel sentences. One of my favorite way would be to store everything into MongoDB and issue a few queries and, in a second, you’d get the result.

This time, however, I used Clojure to join the files for my learning. The trickiest part would be computing the one-level transitive relation of the graph (that is, to compute an edge A -> C from A -> B and B -> C), because the dump links file does not contain indirect translations, which tatoeba.com search does. The following Clojure code snippet does this:

(defn get-transitive [g]
  (into {} (filter second (map (fn [[k v]] [k (reduce union (map #(g %) v))]) g)))
  )

(let [links (reduce (fn [x y] (merge-with union x y))
                    (map #(let [v (split #"\t" %)]
                            {(first v) #{(second v)}}) (read-lines *in*)))
      tlinks (merge-with union links (get-transitive links))]
  (doseq [[k v] tlinks elm v]
    (println (str k "\t" elm))
    )
  )

The output is the original graph augmented with step-1 indirect transitive edges.

I won’t show the latter part, i.e, extraction of parallel sentence from the joined file because it’s pretty much straightforward. The result after extracting parallel sentences in English, Japanese, and Chinese looks like the following:

How would you like your steak? ステーキの焼き方はどうなさいますか。 您的牛排要几分熟?
Even though he was tired, he went on with his work. 疲れていたけれど、彼は仕事を続けた。 他雖然很累,但是也繼續工作。
That was all Greek to me. 私にはちんぷんかんぷんでした。 我完全看不懂。
He can both speak and write Russian. 彼はロシア語が話せるし書くことができる。 他會說俄語,也會寫俄文。
You must consider it before you answer. 答える前によく考えねばならない。 在回答之前必须要考虑清楚。

You can extract Lojban-English parallel sentences as well:

599308 jbo le karce cu bredi 46819 eng The car is ready.
599316 jbo le mlatu cu nelci le nu sipna ne’a mi 44558 eng The cat likes to sleep beside me.
599317 jbo ti du le mi karce 56205 eng This is my car.
599321 jbo dei na jufra 547389 eng This is not a sentence.
599329 jbo lo nanla pu smaji 47434 eng The boy remained silent.

which I think is really useful for my Lojban study!

Enjoy~

 

5 Responses to Extracting Multilingual Parallel Sentences from tatoeba.org

  1. Allan SIMON says:

    Hi,

    glad you like the project, I’m actually one of the main developer behind Tatoeba, to ease all this graph-related request, we’ve developed a dedicated database library that I will release in open-source soon or later (actually it’s already available, but as I haven’t found time to write documentation nor to make binding (as it’s written in C)). The new version of tatoeba I’m developing is build on it, and it permits to do a BFS in a very efficient time and with theoretically no limit. The new version will also provide an API (for the moment I’ve put an alpha version of it on my personal server accessible from here http://tato.sysko.fr/eng/api/sentences/show-random-in/jbo for example to get a random sentence in lojban with all the indirect translation up to a depth of 5 )

    As the new version is written in C/C++ and does have an embedded webserver, once I will write some doc, or better, have time to package it, it should be easy for anyone to run his own tatoeba API server (which is actually no more no less than a tatoeba-specific database server)

    Well a lot is still to be made, but as I’m reading your blog, I think I’d better have to drop you a message now than to forget.

  2. hagiwara says:

    Hi Allan,

    Thank you for dropping a comment. I feel really excited to know that a new version of tatoeba and its API is available soon! I’ll definitely stay tuned.

    By the way, some of the UI messages on the tatoeba website are not translated into Japanese, or, even if they are, some look a little bit strange to me as a native speaker. How can I contribute as a translator to the website?

  3. Allan SIMON says:

    Sorry I didn’t see you replied to me,

    for the UI messages, we use Launchpad https://translations.launchpad.net/tatoeba

  4. tom jones says:

    Your title says “Extracting Multilingual Parallel Sentences from tatoeba.com”

    However, it’s actually a “.org”

    Feel free to delete this message after you make the correction.

  5. hagiwara says:

    Thanks tom, I modified to title accordingly

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>