iconlang
– new ideographic writing system for better visibility and legibility
= “word”
= “god”
= “begin”
= “in”
= “the”
= “was”
= “and”
= “with”
= “-ing”
Exanple:
,
,
= “In the begining was the Word, and the Word was with God, and the Word was God.”
What is iconlang?
“iconlang” is a new ideographic writing system which converts any language words written in alphabets (like English) into a single geometric image (like Chinese characters) and is based on “identicon,” which represents hash values by images. The above example sentence represents every English word (and verb inflection suffixes) by a single image. This improves writing system’s legibility, saves the page space, and shortens the reading time. After sufficient training, you’ll be able to read English (or any other languages) sentences very rapidly.
The background of “iconlang” invention is the computerized sciety, growth of the Internet, and the change of character usage situation, which changes the requirements for “characters” quite dramatically. For example, there’s a Japanese word “きれい,” which used to be mainly spelled as “奇麗” in Kanji with less strokes. Now, in turn, because the conversion cost using IME is the same, the word is more and more often spelled as “綺麗” which has more strokes but also accompanies better impression. (according to a research by the National Institute for Japnaese Language).
Ultimately, “characters” should not necessarily be hand-writable, as long as they can be typed and be read. This situation is more true than ever, now that ebook is more and more popular compared to paper books. The only reason characters are written in black-and-white strokes is that they have been hand-written, and justification for that is more and more questionable, with little discussion regarding whether they are the best forms in terms of visibility and legibility. “iconlang” is sort of a virtual experiment to demonstrate this fact.
Here are some examples — a traffic sign with
printed on it is probably much more recognizable than written letters “stop.” Instead of hanging a plate with “open” or “close” at the door in order to show whether a shop is open or close, the shopowner can just hang a plate with
or
, which may be easily identified from a distant place. You can just use
or
to represent long words like “yesterday” or “tomorrow” and save space.
iconlang pros
- It can convert any word strings consisting of latten alphabets (A-Z) into geometric patterns. This includes all alphabet-written languages (probably a large portion of major languages on earth).
- It excells in visibility, enabiling more rapid identification than “conventional letters” with black-and-white strokes. Of course, it is more beautiful.
- It can automatically convert strings into geometric patterns, which means that you don’t have to devise the whole vocabuly of a language like conventional sign(or icon) based artificial languages.
What is identicon, by the way?
The idea of “iconlang” is based on “identicon,” a system for visually representing hash values by quilt-like pattens. The identicon concept is introduced in the following page:
http://www.radiumsoftware.com/0702.html
A Python implementation of “identicon” can be found here, which depends on PIL (Python Image Library) http://coderepos.org/share/browser/lang/python/misc/identicon.py
identicon details
The identicon algorithm divide a 32-bit hash value into bit fragments as the following table shows. Each bit fragments has different roles to render an image:
| bit range (width) | role |
|---|---|
| 1- 2 (2) | middleType |
| 3 (1) | middleInvert |
| 4- 7 (4) | cornerType |
| 8 (1) | cornerInvert |
| 9-10 (2) | cornerTurn |
| 11-14 (4) | sideType |
| 15 (1) | sideInvert |
| 16-17 (2) | sideTurn |
| 18-22 (5) | blue |
| 23-27 (5) | green |
| 28-32 (5) | red |
| Total: 32 bits | |
By that way the Python implementation of identicon introduced above has a bug. The decode() function should look like the following:
def decode(self, code): # decode the code middleType = self.MIDDLE_PATCH_SETmiddleInvert= (code >> 2) & 0x01
cornerType = (code >> 3) & 0x0F
cornerInvert= (code >> 7) & 0x01cornerTurn = (code >> 8 ) & 0x03
sideType = (code >> 10) & 0x0F
sideInvert = (code >> 14) & 0x01
sideTurn = (code >> 15) & 0x03
blue = (code >> 17) & 0x1Fgreen = (code >> 22) & 0x1F
red = (code >> 27) & 0x1F
...
A single identicon consists of 9 (3 by 3) “patches.” Each patch is rendered as a simple geometric pattern such as triangle and rectangle. The “middle” patch is a horizontally and vertically symmetric patten (the patches with numbers 1, 5, 9, and 16 in the above identicon introduction page), whose type and color are controlled by “middleType” and “middleInvert” bit fragments. The “corner” and “side” patches are controlled by Type, Invert, and Turn. The whole color of a patch is determined by the RGB values calculated from the blue, green and red bit fragments. The original identicon implementation uses 5 bits each.
iconlang details
Here is the description how iconlang maps a word to a geometric patten based on the above identicon. The algorithm consists of the following two steps:
- Calculate the hash value for a words
- Generate the quilt patten from the hash value
Calculate the hash value for a words
When iconlang calculates the hash value from a word, it tries to assign similar hash values for similar spelling words, while avoiding asigning the same hash values for different words. Specifically, it follows these steps based on character n-grams in the input word:
- Generate lists of character unigram, bigram, and trigram from the input word, with special “starting” and “terminal” characters attached at the beginning and at the end of the word. For example, after converting it to “^dog$” by attaching starting and terminal characters, a total of 10 n-grams, namely “d” “o” “g” “^d” “do” “og” “g$”, “^do” “dog” “og$” are generated. These n-grams are all numbered in the order of generation.
- Calculate the “value” for each character n-gram, by considering each n-gram as a 28-base number (consisting of 26 alphabets, starting, and terminal characters). The mapping between characters and numbers is given as below. This is randomly generated so that the collision (different words have the same hash value or the the same quilt patten) is minimized. This mapping may be improved.
- Let h the hash value which we want to generate. For each character n-gram generated from the input word, the “the value of the n-gram plus the number of the n-gram”-th bit of h is inverted. The “the value of the n-gram plus the number of the n-gram” exceeds the hash width (which is 34 in this case), modulo 34 of the number is used. The value of h after this operation is the final hash value.
CHAR_LOOKUP = {“k”:24, “v”:12, “w”:22, “a”:20, “l”:9, “b”:18, “m”:14, “x”:11, “y”:13, “n”:10, “c”:4, “z”:19, “o”:17, “d”:8, “$”:5, “e”:21, “p”:6, “f”:2, “q”:0, “g”:27, “r”:1, “^”:23, “h”:15, “s”:3, “t”:26, “i”:25, “j”:16, “u”:7}
Generate the quilt patten from the hash value
The generation algorithm of quilt pattens is basically the same as identicon, with two slight difference in order to improve the legibility as letters. The first one is the masks applied to “corner” patches, making the pattens not always point symmetric. The other is to decrease the bit width for the patch color, reducing the confuison caused by very similar colors.
iconlang interprets each bit as follows. The difference is shown in bold face.
| bit range (width) | role |
|---|---|
| 1- 2 (2) | middleType |
| 3 (1) | middleInvert |
| 4- 7 (4) | cornerType |
| 8 (1) | cornerInvert |
| 9-10 (2) | cornerTurn |
| 11-14 (4) | sideType |
| 15 (1) | sideInvert |
| 16-17 (2) | sideTurn |
| 18-20 (3) | blue |
| 21-23 (3) | green |
| 24-26 (3) | red |
| 27-34 (8) | corner masks |
| Total: 34 bits | |
8-bit corner masks are new fragments newly introduced by iconlang. Each of 8 bits correspond to 8 patches at the corners and the sides. The patch is not drawn (left missing) when the corresponding bit is 0.
iconlang Python implementation
Here I release a Python script which generate iconlang images based on the above algorithm.
iconlang.py (Hosted by Google Code)
https://code.google.com/p/mhagiwara/source/browse/trunk/misc/iconlang/iconlang.py
word2code() function converts words into hash values, which in turn used by render_identicon() function to generate images. The generated image is a PIL image object and can be saved by save() method.
iconlang image files
Here I distribute the iconlang image files for a list of 1000 English important words, which is chosen considering basic word list like LDV (Longman Defining Vocabulary) and their frequency.
iconlang image list (hosted by lilyx.net)
iconlang image list (hosted by Google Code)
iconlang samples
Here are some samples of geometric patterns generated by the iconlang algorithm. The iconlang patterns generally have the following characteristics:
- Distinct pattens are generated for different words.
- Similar patterns are generated for words with similar spellings.
- Complicated pattens are generated for long words with complicated spellings.
Numbers:
=zero
=one
=two
=three
=four
=five
=six
=seven
=eight
=nine
=ten
Articles:
=a
=an
=the
Pronouns:
=I
=you
=he
=she
=we
=they
=it
be verbs:
=be
=been
=am
=are
=is
=was
=were
About the Author
Masato Hagiwara currently works for Rakuten Institute of Technology in New York, as a Senior Scientist. Have worked on search technologies at Google, Microsoft Research, and Baidu in the past. Expert in Natural Language Processing (NLP). Also a lead translator of the O'Reilly book "Natural Language Processing in Python." A native speaker of Japanese. Good command of English and Chinese (Mandarin). For more information, see About Me.Pages
- 100 NLP Papers
- About Me
- iconlang – new ideographic writing system for better visibility and legibility
- iconlang – 視認性・識別性向上のための新しい表意文字体系
- Music
- Music for Language Fans
- NLTK Japanese Corpora – NLTKで使える日本語コーパス
- Python/Romkan ローマ字とひらがなを相互に変換する Python用のライブラリ
- TinySegmenter in Python
- 中国語学習完全ガイド | 1年以内にマスターする中国語
- 巻き舌クリニック – みんなで巻き舌を克服するサイト