mecab: utf8

要するにmecab-dict-index でurf8の辞書を作って、 "mecab -d utf8辞書パス" でいいんじゃないの?

hide@deb1:~$ dpkg -L mecab
/.
/usr
/usr/bin
/usr/bin/mecab
/usr/share
/usr/share/doc
/usr/share/doc/mecab
/usr/share/doc/mecab/bindings.html
/usr/share/doc/mecab/dic-detail.html
/usr/share/doc/mecab/dic.html
/usr/share/doc/mecab/feature.html
/usr/share/doc/mecab/format.html
/usr/share/doc/mecab/index.html
/usr/share/doc/mecab/learn.html
/usr/share/doc/mecab/libmecab.html
/usr/share/doc/mecab/mecab.html
/usr/share/doc/mecab/posid.html
/usr/share/doc/mecab/soft.html
/usr/share/doc/mecab/unk.html
/usr/share/doc/mecab/README
/usr/share/doc/mecab/AUTHORS
/usr/share/doc/mecab/README.Debian
/usr/share/doc/mecab/copyright
/usr/share/doc/mecab/changelog.Debian.gz
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/mecab.1.gz

 

hide@deb1:~$ sudo find /usr -name "mecab*" -print
/usr/bin/mecab
/usr/share/mecab
/usr/share/man/man1/mecab.1.gz
/usr/share/doc/mecab-utils
/usr/share/doc/mecab
/usr/share/doc/mecab/mecab.html
/usr/share/doc/mecab-jumandic
/usr/lib/mecab
/usr/lib/mecab/mecab-cost-train
/usr/lib/mecab/mecab-test-gen
/usr/lib/mecab/mecab-dict-info
/usr/lib/mecab/mecab-dict-gen
/usr/lib/mecab/mecab-system-eval
/usr/lib/mecab/mecab-dict-index

 

hide@deb1:~$ sudo find /etc -name "mecab*" -print
/etc/mecabrc
/etc/alternatives/mecab-dictionary

 

 

hide@deb1:~$ file /usr/lib/mecab/mecab-dict-index
/usr/lib/mecab/mecab-dict-index: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.4.1, dynamically linked (uses shared libs), for GNU/Linux 2.4.1, stripped

 

hide@deb1:~$ find /usr/share/doc/mecab -name "*" -exec grep -l utf {} \;
/usr/share/doc/mecab/dic.html
/usr/share/doc/mecab/README.Debian
/usr/share/doc/mecab/index.html

 

hide@deb1:~$ more /usr/share/doc/mecab/README.Debian
Mecab for Debian
—————-

Mecab is a morphological analysys system. It can segment and tokenize
Japanese text string, and can output with many additional informations
(pronunciation, semantic information, and others).  It will print the
result of such an operation to the standard output, so that it can be
either written to a file or further processed.

Two dictionary packages for Mecab are available: `mecab-jumandic’ and
`mecab-ipadic’.  To select a dictionary, (a) edit /etc/mecabrc, or (b)
execute the command `update-alternatives –config mecab-dictionary’.

If you want to use UTF-8 when analyzing morphologicaly, try following
commands.

  % mkdir utf8
  % cd utf8
  % iconv -f euc-jp -t utf-8 /var/lib/mecab/dic/debian/dic.csv > dic.csv
  % iconv -f euc-jp -t utf-8 /var/lib/mecab/dic/debian/connect.csv > connect.csv
  % iconv -f euc-jp -t utf-8 /var/lib/mecab/dic/debian/dicrc | sed s,euc,utf8, >
dicrc
  % mkmecabdic
  % mecab -d `pwd`

— TSUCHIYA Masatoshi <tsuchiya@namazu.org>, Mon Nov 29 18:31:54 2004

deb1:~# cd /var/lib/mecab/dic/
deb1:/var/lib/mecab/dic# ls
debian  juman
deb1:/var/lib/mecab/dic# ls juman/
char.bin  left-id.def  rewrite.def   sys.dic
dicrc     matrix.bin   right-id.def  unk.dic
deb1:/var/lib/mecab/dic# ls debian
char.bin  left-id.def  rewrite.def   sys.dic
dicrc     matrix.bin   right-id.def  unk.dic

 

deb1:/var/lib/mecab/dic# apt-cache search mkmecabdic
deb1:/var/lib/mecab/dic# apt-cache search mecab
libmecab-dev – Header files of Mecab
libmecab1 – Libraries of Mecab
mecab – Japanese morphological analysis system
mecab-jumandic – Juman dictionary compiled for Mecab
mecab-utils – Support programs of Mecab

 

deb1:/var/lib/mecab/dic# apt-get install libmecab-dev
Reading package lists… Done
Building dependency tree… Done
The following NEW packages will be installed:
  libmecab-dev
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 403kB of archives.
After unpacking 1815kB of additional disk space will be used.
Get:1 http://cdn.debian.or.jp etch/main libmecab-dev 0.93-1 [403kB]
Fetched 403kB in 1s (278kB/s)
Selecting previously deselected package libmecab-dev.
(Reading database … 46077 files and directories currently installed.)
Unpacking libmecab-dev (from …/libmecab-dev_0.93-1_i386.deb) …
Setting up libmecab-dev (0.93-1) …

deb1:/var/lib/mecab/dic# dpkg -L libmecab-dev
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/libmecab-dev
/usr/share/doc/libmecab-dev/README
/usr/share/doc/libmecab-dev/AUTHORS
/usr/share/doc/libmecab-dev/copyright
/usr/share/doc/libmecab-dev/changelog.Debian.gz
/usr/share/man
/usr/share/man/man1
/usr/share/man/man1/mecab-config.1.gz
/usr/bin
/usr/bin/mecab-config
/usr/include
/usr/include/mecab.h
/usr/lib
/usr/lib/libmecab.a
/usr/lib/libmecab.la
/usr/lib/libmecab.so

 

deb1:/var/lib/mecab/dic# dpkg -L mecab-utils
/.
/usr
/usr/share
/usr/share/doc
/usr/share/doc/mecab-utils
/usr/share/doc/mecab-utils/README
/usr/share/doc/mecab-utils/AUTHORS
/usr/share/doc/mecab-utils/copyright
/usr/share/doc/mecab-utils/changelog.Debian.gz
/usr/lib
/usr/lib/mecab
/usr/lib/mecab/mecab-cost-train
/usr/lib/mecab/mecab-dict-gen
/usr/lib/mecab/mecab-dict-index
/usr/lib/mecab/mecab-dict-info
/usr/lib/mecab/mecab-system-eval
/usr/lib/mecab/mecab-test-gen

deb1:/var/lib/mecab/dic# !find
find / -name "mkmecabdic" -print
deb1:/var/lib/mecab/dic#

 

mkmecabdicってパッケージにないんですよね?

 

Mecabのインストールと辞書のUTF-8化 – 森薫の日記

 

deb1:/var/lib/mecab/dic# find /usr/share/doc -name "*" -exec grep -l mecab-dict-index {} \;
/usr/share/doc/mecab/dic-detail.html
/usr/share/doc/mecab/learn.html
/usr/share/doc/mecab/unk.html
/usr/share/doc/mecab/dic.html
/usr/share/doc/mecab/index.html
/usr/share/doc/mecab/posid.html
/usr/share/doc/mecab-jumandic/README.Debian

WS000005

 

deb1:/var/lib/mecab/dic# ls
debian  juman
deb1:/var/lib/mecab/dic# mkdir utf8
deb1:/var/lib/mecab/dic# ls
debian  juman  utf8

 

deb1:/var/lib/mecab/dic# /usr/lib/mecab/mecab-dict-index -d /usr/share/mecab/dic/juman -o /var/lib/mecab/dic/utf8 -f euc-jp -t utf-8
reading /usr/share/mecab/dic/juman/unk.def … 37
emitting double-array: 100% |###########################################|
reading /usr/share/mecab/dic/juman/Special2.csv … 8
reading /usr/share/mecab/dic/juman/Prefix.csv … 75
reading /usr/share/mecab/dic/juman/Special.csv … 8
reading /usr/share/mecab/dic/juman/ContentW.csv … 483161
reading /usr/share/mecab/dic/juman/Noun.koyuu.csv … 29805
reading /usr/share/mecab/dic/juman/Noun.hukusi.csv … 74
reading /usr/share/mecab/dic/juman/Noun.keishiki.csv … 10
reading /usr/share/mecab/dic/juman/Postp.csv … 104
reading /usr/share/mecab/dic/juman/Noun.suusi.csv … 46
reading /usr/share/mecab/dic/juman/Demonstrative.csv … 3
reading /usr/share/mecab/dic/juman/Suffix.csv … 1163
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
reading /usr/share/mecab/dic/juman/AuxV.csv … 415
reading /usr/share/mecab/dic/juman/Assert.csv … 30
reading /usr/share/mecab/dic/juman/Rengo.csv … 916
emitting double-array: 100% |###########################################|
emitting matrix      : 100% |###########################################|

done!

 

deb1:/var/lib/mecab/dic# ls -al
total 16
drwxr-xr-x 4 root root 4096 Apr 18 04:47 .
drwxr-xr-x 3 root root 4096 Apr 16 11:17 ..
lrwxrwxrwx 1 root root   34 Apr 16 11:17 debian -> /etc/alternatives/mecab-dictionary
drwxr-xr-x 2 root root 4096 Apr 16 11:17 juman
drwxr-xr-x 2 root root 4096 Apr 18 04:49 utf8

 

deb1:/var/lib/mecab/dic# ls /etc/alternatives/mecab-dictionary
char.bin  dicrc  left-id.def  matrix.bin  rewrite.def  right-id.def  sys.dic  unk.dic

 

debian  juman  utf8

./juman:
char.bin  dicrc  left-id.def  matrix.bin  rewrite.def  right-id.def  sys.dic  unk.dic

./utf8:
char.bin  matrix.bin  sys.dic  unk.dic

 

deb1:/var/lib/mecab/dic/utf8# pwd
/var/lib/mecab/dic/utf8
deb1:/var/lib/mecab/dic/utf8# mecab -d `pwd`
tagger.cpp(133) [load_dictionary_resource(param)] tagger.cpp(133) [load_dictionary_resource(param)] tagger.cpp(133) [load_dict

 

deb1:/var/lib/mecab/dic/utf8# cp ../juman/* .
deb1:/var/lib/mecab/dic/utf8# /usr/lib/mecab/mecab-dict-index -d /usr/share/mecab/dic/juman -o /var/lib/mecab/dic/utf8 -f euc-jp -t utf-8
reading /usr/share/mecab/dic/juman/unk.def … 37
emitting double-array: 100% |###########################################|
reading /usr/share/mecab/dic/juman/Special2.csv … 8
reading /usr/share/mecab/dic/juman/Prefix.csv … 75
reading /usr/share/mecab/dic/juman/Special.csv … 8
reading /usr/share/mecab/dic/juman/ContentW.csv … 483161
reading /usr/share/mecab/dic/juman/Noun.koyuu.csv … 29805
reading /usr/share/mecab/dic/juman/Noun.hukusi.csv … 74
reading /usr/share/mecab/dic/juman/Noun.keishiki.csv … 10
reading /usr/share/mecab/dic/juman/Postp.csv … 104
reading /usr/share/mecab/dic/juman/Noun.suusi.csv … 46
reading /usr/share/mecab/dic/juman/Demonstrative.csv … 3
reading /usr/share/mecab/dic/juman/Suffix.csv … 1163
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
iconv conversion failed. skip this entry
reading /usr/share/mecab/dic/juman/AuxV.csv … 415
reading /usr/share/mecab/dic/juman/Assert.csv … 30
reading /usr/share/mecab/dic/juman/Rengo.csv … 916
emitting double-array: 100% |###########################################|
emitting matrix      : 100% |###########################################|

done!

 

hide@deb1:~$ mecab test.txt -d /var/lib/mecab/dic/utf8  -U %M -F "%pS%f[5]" -E "\n"| nkf -w
こちらよりメールアドレスをにゅうりょくしてごそうしんください。しょうかいせいなのでおりか えししょうたいじょうめーるをおくります。
—-
あおくてもあるべきものをとうがらし

 

works!

カテゴリー: 未分類 パーマリンク

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中