利用sklearn构建tfidf向量

在自然语言处理中，第一步需要面对的就是词向量特征的提取。语言的特征提取在sklearn模块中有相当完善的方法和模块，本文先利用CountVectorizer提取词汇，再用TfidfTransformer计算TFIDF向量。之所以选择CountVectorizer而不自行写一个代码，是因为在使用时维度很容易超过10w，产生的bag-of-words向量特别稀疏，需要耗费极大的内存，而sklearn实现了一个稀疏矩阵的存储形式，可以极大的加速和降低消耗。

构建Bag-of-words词汇矩阵

from sklearn.feature_extraction.text import CountVectorizer

#测试用字符串list
s_l = ['Relevant words for each class or cluster are identified by computing a relevancy score rc for every word ti based on the documents in the class',
'or cluster and then selecting the highest scoring',
'words. These scores can be computed either by-89',
'aggregating the raw tf-idf features of all documents',
'in the group (Section 2.3.1), by aggregating these',
'features weighted by some classifier’s parameters',
'(Section 2.3.2), or directly by computing a score',
'for each word depending on the number of documents it occurs in from this class relative to other',
'classes (Section 2.3.3).']

#初始化CountVectorizer，token_pattern后可以自定义提取单词的规则，不写则默认取纯字母单词，这里取的是字母、数字和-的组合
count_vect = CountVectorizer(token_pattern = '[a-zA-Z0-9\-]+')

#这里传入的参数必须是iteratable的，比如这里的list
X_train_counts = count_vect.fit_transform(s_l)

X_train_counts

输出：

<9x62 sparse matrix of type '<class 'numpy.int64'>'
	with 95 stored elements in Compressed Sparse Row format>

显示出这里的X_train_counts是一个9*62的稀疏矩阵。

稀疏化显示

X_train_counts.todense()

输出：

matrix([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0,
         1, 1, 0, 1, 0, 2, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
         0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
         1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0],
        [0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int64)

查看词汇

count_vect.vocabulary_

输出：

{'relevant': 44,
 'words': 61,
 'for': 26,
 'each': 22,
...
 'occurs': 34,
 'from': 27,
 'this': 56,
 'relative': 42,
 'to': 58,
 'other': 38,
 'classes': 14}

左边是单词，右边是索引，可以用count_vect.vocabulary_.get('relative')这种形式来查找索引号，list(filter(lambda x: x[1] == 6, count_vect.vocabulary_.items()))[0][0]获取索引为6的word，其实count_vect.vocabulary_.items()本质上就是字典。

构建tfidf向量矩阵

Term Frequency-Inverse Document Frequency，词频-逆文件频率，是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法，用以评估一单词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。即一个词语在一篇文件中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文件。一般做归一化处理，公式如下：

$$ tf(t_i)=\frac{在某一文件中词t_i出现的次数}{该文件中所有的词数目} \\ \ \\ idf(t_i)=\log(\frac{数据集的文件总数}{包含词t_i的文件数})=\log\frac{|N|}{|\{k\in N:t_i\in k\}|} $$

有时为了保证$idf$分母不为零，可以在分母上加一。

使用sklearn的TfidfTransformer实现，非常方便。

from sklearn.feature_extraction.text import TfidfTransformer

tf_idf_transformer = TfidfTransformer()
tf_idf_matrix = tf_idf_transformer.fit_transform(X_train_counts) #传入bag-of-words矩阵即可

输出：

matrix([[0.        , 0.        , 0.        , 0.17690932, 0.        ,
         0.        , 0.        , 0.20945535, 0.20945535, 0.        ,
         0.13590618, 0.        , 0.        , 0.35381864, 0.        ,
         0.        , 0.17690932, 0.        , 0.17690932, 0.        ,
         0.        , 0.15381755, 0.17690932, 0.        , 0.20945535,
         0.        , 0.35381864, 0.        , 0.        , 0.        ,
         0.20945535, 0.15381755, 0.        , 0.        , 0.        ,
         0.        , 0.17690932, 0.15381755, 0.        , 0.        ,
         0.        , 0.20945535, 0.        , 0.20945535, 0.20945535,
         0.        , 0.17690932, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.24254304, 0.        ,
         0.        , 0.        , 0.20945535, 0.        , 0.        ,
         0.17690932, 0.17690932],
        ...
        [0.        , 0.35681845, 0.7136369 , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.48588431,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.35681845,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        ]])