Categories
Golang NLP

How to use MeCab word segmenting, part-of-speech tagging Japanese language

MeCab is an open-source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project.[1][2] The name derives from the developer’s favorite food, mekabu (和布蕪), a Japanese dish made from wakame leaves.

wiki:https://en.wikipedia.org/wiki/MeCab

The official website of MeCab is http://taku910.github.io/mecab/.

Install steps:

1. Download source code of MeCab from here.

2. Make and install.

tar zxfv mecab-X.X.tar.gz
cd mecab-X.X
./configure
make
make check
su
make install

3. Download dictionary files from here.

4. Make and install dictionary files.

tar zxfv mecab-ipadic-2.7.0-XXXX.tar.gz
mecab-ipadic-2.7.0-XXXX
./configure --with-charset=utf8
make
su
make install

After the installation, you can use MeCab at termainl. Like this:

$ echo "益子修氏死去 三菱自動車前会長、経営立て直し奔走" | mecab
益子	名詞,固有名詞,人名,姓,*,*,益子,マシコ,マシコ
修	名詞,固有名詞,人名,名,*,*,修,オサム,オサム
氏	名詞,接尾,人名,*,*,*,氏,シ,シ
死去	名詞,サ変接続,*,*,*,*,死去,シキョ,シキョ
 	記号,空白,*,*,*,*, , , 
三菱自動車	名詞,固有名詞,組織,*,*,*,三菱自動車,ミツビシジドウシャ,ミツビシジドーシャ
前	接頭詞,名詞接続,*,*,*,*,前,ゼン,ゼン
会長	名詞,一般,*,*,*,*,会長,カイチョウ,カイチョー
、	記号,読点,*,*,*,*,、,、,、
経営	名詞,サ変接続,*,*,*,*,経営,ケイエイ,ケイエイ
立て直し	名詞,一般,*,*,*,*,立て直し,タテナオシ,タテナオシ
奔走	名詞,サ変接続,*,*,*,*,奔走,ホンソウ,ホンソー
EOS

Or,

$ echo "菅氏、自民総裁選へ意欲表明 「出馬の覚悟」"| mecab
菅	名詞,固有名詞,人名,姓,*,*,菅,カン,カン
氏	名詞,接尾,人名,*,*,*,氏,シ,シ
、	記号,読点,*,*,*,*,、,、,、
自民	名詞,固有名詞,組織,*,*,*,自民,ジミン,ジミン
総裁	名詞,一般,*,*,*,*,総裁,ソウサイ,ソーサイ
選	名詞,接尾,一般,*,*,*,選,セン,セン
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
意欲	名詞,一般,*,*,*,*,意欲,イヨク,イヨク
表明	名詞,サ変接続,*,*,*,*,表明,ヒョウメイ,ヒョーメイ
 	記号,空白,*,*,*,*, , , 
「	記号,括弧開,*,*,*,*,「,「,「
出馬	名詞,サ変接続,*,*,*,*,出馬,シュツバ,シュツバ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
覚悟	名詞,サ変接続,*,*,*,*,覚悟,カクゴ,カクゴ
」	記号,括弧閉,*,*,*,*,」,」,」
EOS

I am planning use MeCab in my Golang server-side project. So I need a MeCab binding for Golang, I found go-mecab.

First I need put flags at my sheel’s rc file.For my Mac it is ~/.zshrc, for my Linux server is ~/.bashrc. Add lines below to rf file’s end.

export CGO_LDFLAGS="`mecab-config --libs`"
export CGO_CFLAGS="-I`mecab-config --inc-dir`"

Then import go-mecab to your system:

go get github.com/shogo82148/go-mecab

And now we can use MeCab in Golang project, like this test.go:

package main

import (
	"fmt"

	"github.com/shogo82148/go-mecab"
)

func main() {

	tagger, _ := mecab.New(map[string]string{})
	defer tagger.Destroy()

	// XXX: avoid GC problem with MeCab 0.996 (see https://github.com/taku910/mecab/pull/24)
	tagger.Parse("")

	result, _ := tagger.Parse("菅氏、自民総裁選へ意欲表明 「出馬の覚悟」")
	fmt.Println(result)
}

Run it, result is:

$ go run test.go
菅	名詞,固有名詞,人名,姓,*,*,菅,カン,カン
氏	名詞,接尾,人名,*,*,*,氏,シ,シ
、	記号,読点,*,*,*,*,、,、,、
自民	名詞,固有名詞,組織,*,*,*,自民,ジミン,ジミン
総裁	名詞,一般,*,*,*,*,総裁,ソウサイ,ソーサイ
選	名詞,接尾,一般,*,*,*,選,セン,セン
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
意欲	名詞,一般,*,*,*,*,意欲,イヨク,イヨク
表明	名詞,サ変接続,*,*,*,*,表明,ヒョウメイ,ヒョーメイ
 	記号,空白,*,*,*,*, , , 
「	記号,括弧開,*,*,*,*,「,「,「
出馬	名詞,サ変接続,*,*,*,*,出馬,シュツバ,シュツバ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
覚悟	名詞,サ変接続,*,*,*,*,覚悟,カクゴ,カクゴ
」	記号,括弧閉,*,*,*,*,」,」,」
EOS

And there are all kinds of language binding of Mecab, include Python, PHP, NodeJS, web service on Docker, Ruby, Haskell, Swift, Lucene-Analyzer, VC++, Rust, Web-API.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.