2016-04-21

中文分词

阅读次数:次

随着信息的飞速增长，使搜索引擎成为人们查找信息的首选工具，Google、百度、yahoo、最近新出的网易的有道等大型搜索引擎一直是人们讨论的话题。

什么是中文分词？

众所周知，英文是以词为单位的，词和词之间是靠空格隔开，而中文是以字为单位，句子中所有的字连起来才能描述一个意思。例如，英文句子I am a student，用中文则为：“我是一个学生”。计算机可以很简单通过空格知道student是一个单词，但是不能很容易明白“学”、“生”两个字合起来才表示一个词。把中文的汉字序列切分成有意义的词，就是中文分词，有些人也称为切词。我是一个学生，分词的结果是：我是一个学生。

中文分词和搜索引擎关系与影响

中文分词到底对搜索引擎有多大影响？对于搜索引擎来说，最重要的并不是找到所有结果，因为在上百亿的网页中找到所有结果没有太多的意义，没有人能看得完，最重要的是把最相关的结果排在最前面，这也称为相关度排序。中文分词的准确与否，常常直接影响到对搜索结果的相关度排序。

浅谈分词实现

这里提供了分词的两种方法

源码下载地址

一种是利用IKAnalyzer2012FF_u1.jar包，这个包是针对于Lucene4.0以上的，如果低版本请使用IKAnalyzer2012.jar；
两者在包的方法上都有差别。

这种分词的效果需要借助分词库才能精确分词，所以这里的工作量就转换成要建立一套合理的词库。
否则的话，这种方法对于专有名词分词效果很差，最差的情况下分成的全是单字，比如我搜索人名“王文路” ，
如果不加任何分词库的话，分出来的结果是： “王” 、 “文” 、“路” ，这显示不是我想要的结果，
如果我分词以后是拿着分词结果做全局检索的话，我只想搜索出包含“王文路”的数据，结果将包含“王”、“文”、“路”的内容
全部都搜索出来了，这显然比我理想中的结果集要大很多，这里的难点就是和进行实时的补充词库、本体等；

源码奉上:

public static String[] participle( String sen ){

    String res = "";

    // 1 创建Analyzer对象

    StringReader reader = new StringReader( sen );
    IKSegmenter ik = new IKSegmenter(reader, true);// 当为true时，分词器进行最大词长切分
    Lexeme lexeme = null;

    // 2 遍历分词结果
    try {
        while((lexeme = ik.next()) != null){  

            // 过滤无用分词，如 的 是 等
            if (lexeme.getLexemeText().length() < 2) {
                continue;
            }
            if( res=="" ) {
                res = lexeme.getLexemeText();
            }
            else {
                res = res + ANALYZER_SEPARATOR + lexeme.getLexemeText();
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }  

    reader.close(); 
    return res.split(ANALYZER_SEPARATOR);
}

另外一种是调用百度的APIStore 的API接口，百度实现的分词效果不错，就算连人名也会精确搜索出，
并且给出每一个分词结果的相关度
比如我调用API搜索“王文路”, 结果如下：

王文:0.222166
王文路:0.937766
文路:0.0862459

源码奉上：

public static String[] request(String httpUrl, String httpArg) {

    BufferedReader reader = null;
    String[] rets = null;

    StringBuffer sbf = new StringBuffer();
    httpUrl = httpUrl + "?" + httpArg;

    try {
        URL url = new URL(httpUrl);
        HttpURLConnection connection = (HttpURLConnection) url
                .openConnection();
        connection.setRequestMethod("GET");

        // 填入apikey到HTTP header
        connection.setRequestProperty("apikey",  "07b9e5e1a45235358b1b39f26afa2dcd");
        connection.connect();

        InputStream is = connection.getInputStream();
        reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
        String strRead = null;

        // 遍历结果集，并且按照相关度由高到低进行排序
        Map<Double , String> map = new HashMap<Double , String>();
        List<Double> keys = new ArrayList<Double>();
        while ((strRead = reader.readLine()) != null) {

            String[] ss = strRead.split(":");

            if (ss.length < 2) continue;

            map.put(Double.parseDouble(ss[1]), ss[0]);
            keys.add(Double.parseDouble(ss[1]));

            sbf.append(strRead);
            sbf.append("\r\n");
        }

        Collections.sort(keys , new Comparator<Double>() {

            public int compare(Double o1, Double o2) {

                return - o1.compareTo(o2);
            }});

        rets = new String[keys.size()];

        for (int i = 0 ; i < keys.size() ; i ++) {
            rets[i] = map.get(keys.get(i));
        }

        // gc回收
        map = null;
        keys = null;

        reader.close();

    } catch (Exception e) {
        e.printStackTrace();
    }

    return rets;
}

/**
 * 
 * @author 王文路
 * @date 2015-7-21
 * @param words :需要分词的语句
 * @return 返回分词结果
 * @throws UnsupportedEncodingException
 */
public static String[] pullword(String words) throws UnsupportedEncodingException{

    // 进行编码
    words = URLEncoder.encode(words,"UTF-8");

    String httpUrl = "http://apis.baidu.com/apistore/pullword/words";
    String httpArg = "source="+words+"&param1=0&param2=1";

    return  request(httpUrl, httpArg);
}

百度还提供了其他很多种API，可以通过一样的方法进行访问。

源码下载地址