阿呆的blog: Regular Expression對中文Unicode的支援

最近寫電視節目表擷取程式，發現最大問題在於中文擷取，對於如何框定中文，是一件很麻煩的事情，目前我常見的方法是利用「>(.*)」類似這樣的方法來做，可是效果很差，尤其遇到有些沒斷行的網頁，完全無法判斷(.*的判斷準則是斷行)，因此在想要如何在RegEx上對中文進行判斷。
聽說有學校有學長會RegEx，當然先去問他，可是很不幸的，他並沒有相關的經驗，一提到就覺得很難搞得樣子，最後他說透過ASCII的方式，用16進位的範圍來判斷，方法很基本，可是我在想，應該不能適用在Unicode或UTF-8吧... :(
上網找之後，發現還是可以的，但是要使用Unicode的作法，最後很不幸的，在php上的作法和學長提的類似，不過能夠達成任務，就相當不錯。
以下是打的心得：
==========================================================
Regular Expression對中文Unicode的支援：

目前Unicode已經是主流，看ASCII的反而不實用了～

regex目前我會用到的，大概就是 Java 和 php，
要在 Java 上用 regex 抓出中文，如下：
Matcher matcher = Pattern.compile("\\p{InCJKUnifiedIdeographs}").matcher(字串);
while( matcher.find() )
{
String 一個中文字 = matcher.group();
}

這邊要解釋「\\p{InCJKUnifiedIdeographs}」：
在 Unicode 中，有針對各個編碼區塊做分類，它的列表可以參照下面的檔名：

Unicode 3.2 的列表：
http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt

Unicode 4.1.0 的列表：
http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt

Unicode 5.0 的列表
http://www.unicode.org/Public/5.0.0/ucd/Blocks.txt

這個表裡面列出了統一碼區塊名和相對應的 Unicode 區段，
而其中的「CJK Unified Ideographs」就是我們的中文字區段(看名稱，應該包含日文、簡體、韓文)，
而在 RegEx 中，可以透過「\p」來指定這個統一碼區塊名，
透過指定它，找出相對應的文字範圍，Java 就是這樣做的。

至於細節，可以參考：
http://www.javaworld.com.tw/confluence/display/J2SE/Regular+Expression's+Unicode+support
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

至於 php，php 的 PCRE RegEx function，它使用 PCRE 套件(原本用於 perl，移植到 php)，
PCRE 的 \p 並無法支援「CJK Unified Ideographs」，
而在 PCRE 的套件下，要能夠擷取出 Unicode，常見的作法有兩種：
1.
使用 \u 來指定 Unicode 的碼，可惜 php 的 PCRE 不支援

2.
使用 \x 來指定 Unicode 的碼，這個沒問題
參考網頁：
http://tw.php.net/manual/tw/reference.pcre.pattern.modifiers.php

原文：
=================================================================
Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them:

For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used:
preg_match('/[\x{2460}-\x{2468}]/u', $str);

Here $str is a haystack string
\x{hex} - is an UTF-8 hex char-code
and /u is used for identifying the class as a class of Unicode chars.

Hope, it'll be useful.
=================================================================

它使用「/xxx/u」，在最後加上「u」，以這方式支援 UTF-8，
而在「/xxx/」內部，使用「[\x{2460}-\x{2468}]」來框出 Unicode 的統一碼範圍，
而要使用類似方式來應用在中文，就非常容易了，
如前面所述，看看 Blocks 的內容，
CJK Unified Ideographs 的統一碼範圍在 4E00..9FFF

答案揭曉～～

下面來個 php 的範例：
=================================================================
#!/usr/bin/php -q
$a = "def:123:這是:AB=你好!";
$pattern = '/123:([\x{4e00}-\x{9fff}]+):[A-B]*/u';
preg_match($pattern, $a,$match);
echo $match[0];
?>
=================================================================

可以得出
=================================================================
123:這是:AB
=================================================================

2007年3月22日

Regular Expression對中文Unicode的支援

1 則留言:

搜尋此網誌

標籤

網誌存檔

好友連結