【R语言】结巴分词与词性提取(以“提取知乎问题标题的频繁词前100个形容词”实战为例)(3月25日学习笔记)
以下内容仅为学习笔记,如表述有误,欢迎批评指正。这一次的作业是基于本人3月24日内容的进一步处理,老师布置的题目为
http://www.stutter.cn/data/attachment/forum/20250103/1735868043423_0.png
这一次问题的难点在于词性分类,本文将以此题为例,介绍如何使用结巴分词对中文词语词性进行分类。
0.包的选取
中文分词必不可少的包:jieba
<p><pre> <code class="prism language-java"><span class="token function">library</span><span class="token punctuation">(</span>jiebaR<span class="token punctuation">)</span>
<span class="token function">library</span><span class="token punctuation">(</span>jiebaRD<span class="token punctuation">)</span>#用于分词
</code></pre></p>
作图包我们选择
<p><pre> <code class="prism language-java"><span class="token function">library</span><span class="token punctuation">(</span>ggplot2<span class="token punctuation">)</span>#用于作图
</code></pre></p>
读取数据可以不额外导入包,使用基础的read.csv函数,但是这样读取效果很慢,建议采用函数,这一点在我的上一篇笔记中提到过
这个函数时读csv文件时能够把所有型的变量读成型,读取大数据的时候效率更高
类似的函数还有data.table包的fread()函数,这两个函数的异同可以在前辈的博文中阅读
说回函数,这个函数需要
<p><pre> <code class="prism language-java"><span class="token function">library</span><span class="token punctuation">(</span>readr<span class="token punctuation">)</span>#用于读取数据
</code></pre></p>
此外还需要
<p><pre> <code class="prism language-java"><span class="token function">library</span><span class="token punctuation">(</span>tidyverse<span class="token punctuation">)</span>#enframe函数需要用到
<span class="token function">library</span><span class="token punctuation">(</span>dplyr<span class="token punctuation">)</span>#用于使用过滤函数<span class="token function">filter</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre></p>
这两个包的使用将在下文提及
1.数据读入
<p><pre> <code class="prism language-java"># 工作路径
<span class="token function">setwd</span><span class="token punctuation">(</span><span class="token string">"D://1Study//R//CH05"</span><span class="token punctuation">)</span>
<span class="token function">getwd</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
# 读入数据
data_titles <span class="token operator">=</span> <span class="token function">read_csv</span><span class="token punctuation">(</span><span class="token string">"train_data.csv"</span><span class="token punctuation">,</span>col_names <span class="token operator">=</span> T<span class="token punctuation">)</span>
#col_names <span class="token operator">=</span> T也就是<span class="token punctuation">.</span>csv方法中的header<span class="token operator">=</span>T
data_titles
#另存数据
question_titles <span class="token operator">=</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data_titles<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span>#另存为数据,只保留标题一栏,使不破坏原数据
</code></pre></p>
当然,默认赋值就是T,不写也可以
2.中文分词与词性标注
<p><pre> <code class="prism language-java">seg <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">worker</span><span class="token punctuation">(</span><span class="token string">'tag'</span><span class="token punctuation">)</span>#构筑词性标注环境
seg_question <span class="token operator">=</span> <span class="token function">segment</span><span class="token punctuation">(</span>question_titles$question_title<span class="token punctuation">,</span>seg<span class="token punctuation">)</span> # 对所有的标题进行中文分词。
seg_question#显示标题中所有的词语及其词性<span class="token punctuation">,</span>这一过程需要耗时<span class="token number">15</span>秒
<span class="token function">str</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span>#查看数据类型
title_table <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">enframe</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span>
title_table
</code></pre></p>
要提取词性,(1)需要构筑词性标注环境
前一篇笔记中提到过 = ( = ".txt")可以在构筑环境的时候使用停用词,这里的seg 原理也差不多,就是构筑词性标注环境,并且这里不需要去除停用词,因为形容词不是停用词
(2)需要用函数套用前面定义的环境对标题进行分词
<p><pre> <code class="prism language-java">seg_question <span class="token operator">=</span> <span class="token function">segment</span><span class="token punctuation">(</span>question_titles$question_title<span class="token punctuation">,</span>seg<span class="token punctuation">)</span> # 对所有的标题进行中文分词。
</code></pre></p>
如果用
<p><pre> <code class="prism language-java">seg_question <span class="token operator">=</span> <span class="token function">tagging</span><span class="token punctuation">(</span>question_titles$question_title<span class="token punctuation">,</span>seg<span class="token punctuation">)</span>`
</code></pre></p>
也是可以的
无论是哪种格式,这里得到的实质上是一个带属性的向量,这样其实不是特别好用。
因此要把它变成数据框的格式,方便以后利用。
这里使用()
<p><pre> <code class="prism language-java">title_table <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">enframe</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span>
</code></pre></p>
这是包下的一个函数,作用是将数据存储为数据框,转换为具有名称和值的数据框
http://www.stutter.cn/data/attachment/forum/20250103/1735868043423_1.png
如果用前一节课的 = as.data.frame(table())普通的数据框只能生成频数表,得不到词性
这一方法借鉴了黄天元前辈的博文R语言自然语言处理:词性标注与命名实体识别
3.筛选形容词,过滤所有非形容词
<p><pre> <code class="prism language-java">#write<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span>title_table<span class="token punctuation">,</span>file <span class="token operator">=</span> <span class="token string">"title_table.csv"</span><span class="token punctuation">,</span>row<span class="token punctuation">.</span>names <span class="token operator">=</span> TRUE<span class="token punctuation">)</span>
#另存为表格,这一过程在我的电脑上需要耗时<span class="token number">10</span>秒
adj_question <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">filter</span><span class="token punctuation">(</span><span class="token punctuation">.</span>data<span class="token operator">=</span>title_table<span class="token punctuation">,</span>name <span class="token operator">==</span> <span class="token string">"a"</span><span class="token punctuation">)</span>
adj_question
</code></pre></p>
函数隶属于dplyr包,关于这一函数的使用方法可以参考前辈的博文R语言dplyr包:高效数据处理函数(、、、)
4.构筑频数表
后续步骤与前一篇笔记的步骤较为类似,在此不再进行详细记录理解。
<p><pre> <code class="prism language-java">adj <span class="token operator">=</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>adj_question<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span>#把所有形容词另存为数据,使不破坏原数据
adj
adjFreq <span class="token operator">=</span> as<span class="token punctuation">.</span>data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span><span class="token function">table</span><span class="token punctuation">(</span>adj<span class="token punctuation">)</span><span class="token punctuation">)</span>#生成频数表
#过滤出现次数过少的形容词,这一步骤可以省略
adjFreq <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token operator">-</span><span class="token function">which</span><span class="token punctuation">(</span><span class="token function">nchar</span><span class="token punctuation">(</span>as<span class="token punctuation">.</span><span class="token function">character</span><span class="token punctuation">(</span>adjFreq<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator"><</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">]</span>
adjFreq <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token function">order</span><span class="token punctuation">(</span><span class="token operator">-</span>adjFreq$Freq<span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">]</span>#排序
data <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token number">1</span><span class="token operator">:</span><span class="token number">100</span><span class="token punctuation">,</span><span class="token punctuation">]</span>#提取问题标题的频繁词前<span class="token number">100</span>个形容词
</code></pre></p>
5.绘图
依据前一步得到的频数表
http://www.stutter.cn/data/attachment/forum/20250103/1735868043423_2.png
绘图
<p><pre> <code class="prism language-java"># 对柱子的顺序进行重新排列
data$adj <span class="token operator">=</span> <span class="token function">factor</span><span class="token punctuation">(</span>data$adj<span class="token punctuation">,</span>levels <span class="token operator">=</span> data$adj<span class="token punctuation">)</span>
<span class="token function">ggplot</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span><span class="token function">aes</span><span class="token punctuation">(</span>x<span class="token operator">=</span>adj<span class="token punctuation">,</span>y<span class="token operator">=</span>Freq<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">geom_bar</span><span class="token punctuation">(</span>stat<span class="token operator">=</span><span class="token string">"identity"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">theme</span><span class="token punctuation">(</span>axis<span class="token punctuation">.</span>text<span class="token punctuation">.</span>x <span class="token operator">=</span> <span class="token function">element_text</span><span class="token punctuation">(</span>angle <span class="token operator">=</span> <span class="token number">60</span><span class="token punctuation">,</span>hjust <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">xlab</span><span class="token punctuation">(</span><span class="token string">"形容词"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">ylab</span><span class="token punctuation">(</span><span class="token string">"频数"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">labs</span><span class="token punctuation">(</span>title <span class="token operator">=</span> <span class="token string">'问题标题的频繁词前100个形容词'</span><span class="token punctuation">)</span>
</code></pre></p>
最后附上完整代码及结果柱状图
<p><pre> <code class="prism language-java">######
#根据知乎问题标签预测数据训练集(train_data<span class="token punctuation">.</span>csv)
#提取问题标题的频繁词前<span class="token number">100</span>个形容词
<span class="token function">library</span><span class="token punctuation">(</span>jiebaR<span class="token punctuation">)</span>
<span class="token function">library</span><span class="token punctuation">(</span>jiebaRD<span class="token punctuation">)</span>#用于分词
<span class="token function">library</span><span class="token punctuation">(</span>ggplot2<span class="token punctuation">)</span>#用于作图
<span class="token function">library</span><span class="token punctuation">(</span>readr<span class="token punctuation">)</span>#用于读取数据
#install<span class="token punctuation">.</span><span class="token function">packages</span><span class="token punctuation">(</span><span class="token string">"tidyverse"</span><span class="token punctuation">)</span>
<span class="token function">library</span><span class="token punctuation">(</span>tidyverse<span class="token punctuation">)</span>#enframe函数需要用到
<span class="token function">library</span><span class="token punctuation">(</span>dplyr<span class="token punctuation">)</span>#用于使用过滤函数<span class="token function">filter</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
#####
#<span class="token number">1.</span>数据导入
#####
# 工作路径
<span class="token function">setwd</span><span class="token punctuation">(</span><span class="token string">"D://1Study//R//CH05"</span><span class="token punctuation">)</span>
<span class="token function">getwd</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
# 读入数据
data_titles <span class="token operator">=</span> <span class="token function">read_csv</span><span class="token punctuation">(</span><span class="token string">"train_data.csv"</span><span class="token punctuation">)</span> #read_csv读取大数据的时候效率更高
data_titles
#另存数据
question_titles <span class="token operator">=</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>data_titles<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span>#另存为数据,只保留标题一栏,使不破坏原数据
#####
#<span class="token number">2.</span>中文分词与词性标注
#####
seg <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">worker</span><span class="token punctuation">(</span><span class="token string">'tag'</span><span class="token punctuation">)</span>#构筑词性标注环境
#这里不需要去除停用词,因为形容词不是停用词
seg_question <span class="token operator">=</span> <span class="token function">segment</span><span class="token punctuation">(</span>question_titles$question_title<span class="token punctuation">,</span>seg<span class="token punctuation">)</span> # 对所有的标题进行中文分词。
seg_question#显示标题中所有的词语及其词性<span class="token punctuation">,</span>这一过程需要耗时<span class="token number">15</span>秒
<span class="token function">str</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span>#查看数据类型
#这里得到的seg_question实质上是一个带属性的向量,这样其实不是特别好用。
#因此我要把它变成数据框的格式,方便以后利用。
title_table <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">enframe</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span>
#将上述数据存储为tibble数据框<span class="token punctuation">,</span>转换为具有名称和值的数据框
#如果用前一节课的questionFreq <span class="token operator">=</span> as<span class="token punctuation">.</span>data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span><span class="token function">table</span><span class="token punctuation">(</span>seg_question<span class="token punctuation">)</span><span class="token punctuation">)</span>普通的数据框只能生成频数表,得不到词性
title_table
#####
#<span class="token number">3.</span>筛选形容词<span class="token punctuation">,</span>过滤所有非形容词
######
#write<span class="token punctuation">.</span><span class="token function">csv</span><span class="token punctuation">(</span>title_table<span class="token punctuation">,</span>file <span class="token operator">=</span> <span class="token string">"title_table.csv"</span><span class="token punctuation">,</span>row<span class="token punctuation">.</span>names <span class="token operator">=</span> TRUE<span class="token punctuation">)</span>#另存为表格,这一过程需要耗时<span class="token number">10</span>秒
adj_question <span class="token operator"><</span><span class="token operator">-</span> <span class="token function">filter</span><span class="token punctuation">(</span><span class="token punctuation">.</span>data<span class="token operator">=</span>title_table<span class="token punctuation">,</span>name <span class="token operator">==</span> <span class="token string">"a"</span><span class="token punctuation">)</span>
adj_question
#<span class="token number">4.</span>构筑频数表
######
adj <span class="token operator">=</span> data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span>adj_question<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span>#把所有形容词另存为数据,使不破坏原数据
adj
adjFreq <span class="token operator">=</span> as<span class="token punctuation">.</span>data<span class="token punctuation">.</span><span class="token function">frame</span><span class="token punctuation">(</span><span class="token function">table</span><span class="token punctuation">(</span>adj<span class="token punctuation">)</span><span class="token punctuation">)</span>#生成频数表
#过滤出现次数过少的形容词,这一步骤可以省略
adjFreq <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token operator">-</span><span class="token function">which</span><span class="token punctuation">(</span><span class="token function">nchar</span><span class="token punctuation">(</span>as<span class="token punctuation">.</span><span class="token function">character</span><span class="token punctuation">(</span>adjFreq<span class="token punctuation">[</span><span class="token punctuation">,</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator"><</span><span class="token number">2</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">]</span>
adjFreq <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token function">order</span><span class="token punctuation">(</span><span class="token operator">-</span>adjFreq$Freq<span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">]</span>#排序
data <span class="token operator">=</span> adjFreq<span class="token punctuation">[</span><span class="token number">1</span><span class="token operator">:</span><span class="token number">100</span><span class="token punctuation">,</span><span class="token punctuation">]</span>#提取问题标题的频繁词前<span class="token number">100</span>个形容词
#####
#<span class="token number">5.</span>绘图
######
# 对柱子的顺序进行重新排列
data$adj <span class="token operator">=</span> <span class="token function">factor</span><span class="token punctuation">(</span>data$adj<span class="token punctuation">,</span>levels <span class="token operator">=</span> data$adj<span class="token punctuation">)</span>
<span class="token function">ggplot</span><span class="token punctuation">(</span>data<span class="token punctuation">,</span><span class="token function">aes</span><span class="token punctuation">(</span>x<span class="token operator">=</span>adj<span class="token punctuation">,</span>y<span class="token operator">=</span>Freq<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">geom_bar</span><span class="token punctuation">(</span>stat<span class="token operator">=</span><span class="token string">"identity"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">theme</span><span class="token punctuation">(</span>axis<span class="token punctuation">.</span>text<span class="token punctuation">.</span>x <span class="token operator">=</span> <span class="token function">element_text</span><span class="token punctuation">(</span>angle <span class="token operator">=</span> <span class="token number">60</span><span class="token punctuation">,</span>hjust <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">xlab</span><span class="token punctuation">(</span><span class="token string">"形容词"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">ylab</span><span class="token punctuation">(</span><span class="token string">"频数"</span><span class="token punctuation">)</span><span class="token operator">+</span>
<span class="token function">labs</span><span class="token punctuation">(</span>title <span class="token operator">=</span> <span class="token string">'问题标题的频繁词前100个形容词'</span><span class="token punctuation">)</span>
</code></pre></p>
http://www.stutter.cn/data/attachment/forum/20250103/1735868043423_3.png
以上内容为我对R语言结巴分词与词性提取的理解以及“提取知乎问题标题的频繁词前100个形容词”的实战,若理解有误,欢迎批评指正。
如果直接运行本文上方代码得到的图会与上图不符,会得到3月29日上午修正中的图,具体原因请见3月29日下午的修正,感谢@ 同学的批评指正。
------------------3月26日修正----------------------------
原注释有误,应该是过滤出现次数过少的形容词,而不是过滤关键词的最短长度,这一步可有可无,这是加了能让后面排序的计算快一些。
感谢同学提醒。
3月26日前原内容为
现更改为
------------------3月29日上午修正----------------------------
有同学反馈,说用的是和本人一样的代码,但是运行出来的图不一样,今天本人再次运行了原代码,发现得到的图如下
http://www.stutter.cn/data/attachment/forum/20250103/1735868043423_6.png
页:
[1]