admin 发表于 2024-12-20 16:33:21

毕设 垃圾邮件(短信)分类 算法实现

2 垃圾短信/邮件 分类算法 原理

垃圾邮件内容往往是广告或者虚假信息,甚至是电脑病毒、情色、反动等不良信息,大量垃圾邮件的存在不仅会给人们带来困扰,还会造成网络资源的浪费;

网络舆情是社会舆情的一种表现形式,网络舆情具有形成迅速、影响力大和组织发动优势强等特点,网络舆情的好坏极大地影响着社会的稳定,通过提高舆情分析能力有效获取发布舆论的性质,避免负面舆论的不良影响是互联网面临的严肃课题。

将邮件分为垃圾邮件(有害信息)和正常邮件,网络舆论分为负面舆论(有害信息)和正面舆论,那么,无论是垃圾邮件过滤还是网络舆情分析,都可看作是短文本的二分类问题。

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_0.png

2.1 常用的分类器 - 贝叶斯分类器

贝叶斯算法解决概率论中的一个典型问题:一号箱子放有红色球和白色球各 20 个,二号箱子放油白色球 10 个,红色球 30 个。现在随机挑选一个箱子,取出来一个球的颜色是红色的,请问这个球来自一号箱子的概率是多少?

利用贝叶斯算法识别垃圾邮件基于同样道理,根据已经分类的基本信息获得一组特征值的概率(如:“茶叶”这个词出现在垃圾邮件中的概率和非垃圾邮件中的概率),就得到分类模型,然后对待处理信息提取特征值,结合分类模型,判断其分类。

贝叶斯公式:

P(B|A)=P(A|B)*P(B)/P(A)

P(B|A)=当条件 A 发生时,B 的概率是多少。代入:当球是红色时,来自一号箱的概率是多少?

P(A|B)=当选择一号箱时,取出红色球的概率。

P(B)=一号箱的概率。

P(A)=取出红球的概率。

代入垃圾邮件识别:

P(B|A)=当包含"茶叶"这个单词时,是垃圾邮件的概率是多少?

P(A|B)=当邮件是垃圾邮件时,包含“茶叶”这个单词的概率是多少?

P(B)=垃圾邮件总概率。

P(A)=“茶叶”在所有特征值中出现的概率。

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_1.png

3 数据集介绍

使用中文邮件数据集:丹成学长自己采集,通过爬虫以及人工筛选。

数据集“data” 文件夹中,包含,“full” 文件夹和 “delay” 文件夹。

“data” 文件夹里面包含多个二级文件夹,二级文件夹里面才是垃圾邮件文本,一个文本代表一份邮件。“full” 文件夹里有一个 index 文件,该文件记录的是各邮件文本的标签。

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_2.png

数据集可视化:

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_3.png

4 数据预处理

这一步将分别提取邮件样本和样本标签到一个单独文件中,顺便去掉邮件的非中文字符,将邮件分好词。

邮件大致内容如下图:

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_4.png

每一个邮件样本,除了邮件文本外,还包含其他信息,如发件人邮箱、收件人邮箱等。因为我是想把垃圾邮件分类简单地作为一个文本分类任务来解决,所以这里就忽略了这些信息。

用递归的方法读取所有目录里的邮件样本,用 jieba 分好词后写入到一个文本中,一行文本代表一个邮件样本:

<p><pre>    <code class="prism language-python"><span class="token keyword">import</span> re
<span class="token keyword">import</span> jieba
<span class="token keyword">import</span> codecs
<span class="token keyword">import</span> os
<span class="token comment"># 去掉非中文字符</span>
<span class="token keyword">def</span> <span class="token function">clean_str</span><span class="token punctuation">(</span>string<span class="token punctuation">)</span><span class="token punctuation">:</span>
    string <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">r"[^\u4e00-\u9fff]"</span><span class="token punctuation">,</span> <span class="token string">" "</span><span class="token punctuation">,</span> string<span class="token punctuation">)</span>
    string <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">r"\s{2,}"</span><span class="token punctuation">,</span> <span class="token string">" "</span><span class="token punctuation">,</span> string<span class="token punctuation">)</span>
    <span class="token keyword">return</span> string<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">def</span> <span class="token function">get_data_in_a_file</span><span class="token punctuation">(</span>original_path<span class="token punctuation">,</span> save_path<span class="token operator">=</span><span class="token string">&#39;all_email.txt&#39;</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    files <span class="token operator">=</span> os<span class="token punctuation">.</span>listdir<span class="token punctuation">(</span>original_path<span class="token punctuation">)</span>
    <span class="token keyword">for</span> <span class="token builtin">file</span> <span class="token keyword">in</span> files<span class="token punctuation">:</span>
      <span class="token keyword">if</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>isdir<span class="token punctuation">(</span>original_path <span class="token operator">+</span> <span class="token string">&#39;/&#39;</span> <span class="token operator">+</span> <span class="token builtin">file</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                get_data_in_a_file<span class="token punctuation">(</span>original_path <span class="token operator">+</span> <span class="token string">&#39;/&#39;</span> <span class="token operator">+</span> <span class="token builtin">file</span><span class="token punctuation">,</span> save_path<span class="token operator">=</span>save_path<span class="token punctuation">)</span>
      <span class="token keyword">else</span><span class="token punctuation">:</span>
            email <span class="token operator">=</span> <span class="token string">&#39;&#39;</span>
            <span class="token comment"># 注意要用 &#39;ignore&#39;,不然会报错</span>
            f <span class="token operator">=</span> codecs<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>original_path <span class="token operator">+</span> <span class="token string">&#39;/&#39;</span> <span class="token operator">+</span> <span class="token builtin">file</span><span class="token punctuation">,</span> <span class="token string">&#39;r&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;gbk&#39;</span><span class="token punctuation">,</span> errors<span class="token operator">=</span><span class="token string">&#39;ignore&#39;</span><span class="token punctuation">)</span>
            <span class="token comment"># lines = f.readlines()</span>
            <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">:</span>
                line <span class="token operator">=</span> clean_str<span class="token punctuation">(</span>line<span class="token punctuation">)</span>
                email <span class="token operator">+=</span> line
            f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
            <span class="token triple-quoted-string string">"""
            发现在递归过程中使用 &#39;a&#39; 模式一个个写入文件比 在递归完后一次性用 &#39;w&#39; 模式写入文件快很多
            """</span>
            f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>save_path<span class="token punctuation">,</span> <span class="token string">&#39;a&#39;</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">&#39;utf8&#39;</span><span class="token punctuation">)</span>
            email <span class="token operator">=</span> <span class="token punctuation">[</span>word <span class="token keyword">for</span> word <span class="token keyword">in</span> jieba<span class="token punctuation">.</span>cut<span class="token punctuation">(</span>email<span class="token punctuation">)</span> <span class="token keyword">if</span> word<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">]</span>
            f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">&#39; &#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>email<span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">&#39;\n&#39;</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;Storing emails in a file ...&#39;</span><span class="token punctuation">)</span>
get_data_in_a_file<span class="token punctuation">(</span><span class="token string">&#39;data&#39;</span><span class="token punctuation">,</span> save_path<span class="token operator">=</span><span class="token string">&#39;all_email.txt&#39;</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;Store emails finished !&#39;</span><span class="token punctuation">)</span>
</code></pre></p>
然后将样本标签写入单独的文件中,0 代表垃圾邮件,1 代表非垃圾邮件。代码如下:

<p><pre>    <code class="prism language-python"><span class="token keyword">def</span> <span class="token function">get_label_in_a_file</span><span class="token punctuation">(</span>original_path<span class="token punctuation">,</span> save_path<span class="token operator">=</span><span class="token string">&#39;all_email.txt&#39;</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
    f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>original_path<span class="token punctuation">,</span> <span class="token string">&#39;r&#39;</span><span class="token punctuation">)</span>
    label_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
    <span class="token keyword">for</span> line <span class="token keyword">in</span> f<span class="token punctuation">:</span>
      <span class="token comment"># spam</span>
      <span class="token keyword">if</span> line<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">&#39;s&#39;</span><span class="token punctuation">:</span>
            label_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">&#39;0&#39;</span><span class="token punctuation">)</span>
      <span class="token comment"># ham</span>
      <span class="token keyword">elif</span> line<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">==</span> <span class="token string">&#39;h&#39;</span><span class="token punctuation">:</span>
            label_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">&#39;1&#39;</span><span class="token punctuation">)</span>
    f <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>save_path<span class="token punctuation">,</span> <span class="token string">&#39;w&#39;</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">&#39;utf8&#39;</span><span class="token punctuation">)</span>
    f<span class="token punctuation">.</span>write<span class="token punctuation">(</span><span class="token string">&#39;\n&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span>label_list<span class="token punctuation">)</span><span class="token punctuation">)</span>
    f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;Storing labels in a file ...&#39;</span><span class="token punctuation">)</span>
get_label_in_a_file<span class="token punctuation">(</span><span class="token string">&#39;index&#39;</span><span class="token punctuation">,</span> save_path<span class="token operator">=</span><span class="token string">&#39;label.txt&#39;</span><span class="token punctuation">)</span>
<span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;Store labels finished !&#39;</span><span class="token punctuation">)</span>
</code></pre></p>
5 特征提取

将文本型数据转化为数值型数据,本文使用的是 TF-IDF 方法。

TF-IDF 是词频-逆向文档频率(Term-,)。公式如下:

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_5.png

在所有文档中,一个词的 IDF 是一样的,TF 是不一样的。在一个文档中,一个词的 TF 和 IDF 越高,说明该词在该文档中出现得多,在其他文档中出现得少。因此,该词对这个文档的重要性较高,可以用来区分这个文档。

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_6.png

<p><pre>    <code class="prism language-python"><span class="token keyword">import</span> jieba
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>feature_extraction<span class="token punctuation">.</span>text <span class="token keyword">import</span> TfidfVectorizer
<span class="token keyword">def</span> <span class="token function">tokenizer_jieba</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># 结巴分词</span>
    <span class="token keyword">return</span> <span class="token punctuation">[</span>li <span class="token keyword">for</span> li <span class="token keyword">in</span> jieba<span class="token punctuation">.</span>cut<span class="token punctuation">(</span>line<span class="token punctuation">)</span> <span class="token keyword">if</span> li<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">]</span>
<span class="token keyword">def</span> <span class="token function">tokenizer_space</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># 按空格分词</span>
    <span class="token keyword">return</span> <span class="token punctuation">[</span>li <span class="token keyword">for</span> li <span class="token keyword">in</span> line<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> li<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">]</span>
<span class="token keyword">def</span> <span class="token function">get_data_tf_idf</span><span class="token punctuation">(</span>email_file_name<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># 邮件样本已经分好了词,词之间用空格隔开,所以 tokenizer=tokenizer_space</span>
    vectoring <span class="token operator">=</span> TfidfVectorizer<span class="token punctuation">(</span><span class="token builtin">input</span><span class="token operator">=</span><span class="token string">&#39;content&#39;</span><span class="token punctuation">,</span> tokenizer<span class="token operator">=</span>tokenizer_space<span class="token punctuation">,</span> analyzer<span class="token operator">=</span><span class="token string">&#39;word&#39;</span><span class="token punctuation">)</span>
    content <span class="token operator">=</span> <span class="token builtin">open</span><span class="token punctuation">(</span>email_file_name<span class="token punctuation">,</span> <span class="token string">&#39;r&#39;</span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">&#39;utf8&#39;</span><span class="token punctuation">)</span><span class="token punctuation">.</span>readlines<span class="token punctuation">(</span><span class="token punctuation">)</span>
    x <span class="token operator">=</span> vectoring<span class="token punctuation">.</span>fit_transform<span class="token punctuation">(</span>content<span class="token punctuation">)</span>
    <span class="token keyword">return</span> x<span class="token punctuation">,</span> vectoring
</code></pre></p>
6 训练分类器

这里学长简单的给一个逻辑回归分类器的例子

<p><pre>    <code class="prism language-python"><span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>linear_model <span class="token keyword">import</span> LogisticRegression
<span class="token keyword">from</span> sklearn <span class="token keyword">import</span> svm<span class="token punctuation">,</span> ensemble<span class="token punctuation">,</span> naive_bayes
<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>model_selection <span class="token keyword">import</span> train_test_split
<span class="token keyword">from</span> sklearn <span class="token keyword">import</span> metrics
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">"__main__"</span><span class="token punctuation">:</span>
    np<span class="token punctuation">.</span>random<span class="token punctuation">.</span>seed<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
    email_file_name <span class="token operator">=</span> <span class="token string">&#39;all_email.txt&#39;</span>
    label_file_name <span class="token operator">=</span> <span class="token string">&#39;label.txt&#39;</span>
    x<span class="token punctuation">,</span> vectoring <span class="token operator">=</span> get_data_tf_idf<span class="token punctuation">(</span>email_file_name<span class="token punctuation">)</span>
    y <span class="token operator">=</span> get_label_list<span class="token punctuation">(</span>label_file_name<span class="token punctuation">)</span>
    <span class="token comment"># print(&#39;x.shape : &#39;, x.shape)</span>
    <span class="token comment"># print(&#39;y.shape : &#39;, y.shape)</span>
   
    <span class="token comment"># 随机打乱所有样本</span>
    index <span class="token operator">=</span> np<span class="token punctuation">.</span>arange<span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>y<span class="token punctuation">)</span><span class="token punctuation">)</span>
    np<span class="token punctuation">.</span>random<span class="token punctuation">.</span>shuffle<span class="token punctuation">(</span>index<span class="token punctuation">)</span>
    x <span class="token operator">=</span> x<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    y <span class="token operator">=</span> y<span class="token punctuation">[</span>index<span class="token punctuation">]</span>
    <span class="token comment"># 划分训练集和测试集</span>
    x_train<span class="token punctuation">,</span> x_test<span class="token punctuation">,</span> y_train<span class="token punctuation">,</span> y_test <span class="token operator">=</span> train_test_split<span class="token punctuation">(</span>x<span class="token punctuation">,</span> y<span class="token punctuation">,</span> test_size<span class="token operator">=</span><span class="token number">0.2</span><span class="token punctuation">)</span>
    clf <span class="token operator">=</span> svm<span class="token punctuation">.</span>LinearSVC<span class="token punctuation">(</span><span class="token punctuation">)</span>
    <span class="token comment"># clf = LogisticRegression()</span>
    <span class="token comment"># clf = ensemble.RandomForestClassifier()</span>
    clf<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>x_train<span class="token punctuation">,</span> y_train<span class="token punctuation">)</span>
    y_pred <span class="token operator">=</span> clf<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>x_test<span class="token punctuation">)</span>
    <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;classification_report\n&#39;</span><span class="token punctuation">,</span> metrics<span class="token punctuation">.</span>classification_report<span class="token punctuation">(</span>y_test<span class="token punctuation">,</span> y_pred<span class="token punctuation">,</span> digits<span class="token operator">=</span><span class="token number">4</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;Accuracy:&#39;</span><span class="token punctuation">,</span> metrics<span class="token punctuation">.</span>accuracy_score<span class="token punctuation">(</span>y_test<span class="token punctuation">,</span> y_pred<span class="token punctuation">)</span><span class="token punctuation">)</span>
</code></pre></p>
7 综合测试结果

测试了2000条数据,使用如下方法:

可以看到,2000条数据训练结果,200条测试结果,精度还算高,不过数据较少很难说明问题。

8 其他模型方法

还可以构建深度学习模型

http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_7.png

网络架构第一层是预训练的嵌入层,它将每个单词映射到实数的N维向量(对应于该向量的大小,在这种情况下为100)。具有相似含义的两个单词往往具有非常接近的向量。

第二层是带有LSTM单元的递归神经网络。最后,输出层是2个神经元,每个神经元对应于具有激活功能的“垃圾邮件”或“正常邮件”。

<p><pre>    <code class="prism language-python"><span class="token keyword">def</span> <span class="token function">get_embedding_vectors</span><span class="token punctuation">(</span>tokenizer<span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token number">100</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
embedding_index <span class="token operator">=</span> <span class="token punctuation">{</span><span class="token punctuation">}</span>
<span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"data/glove.6B.</span><span class="token interpolation"><span class="token punctuation">{</span>dim<span class="token punctuation">}</span></span><span class="token string">d.txt"</span></span><span class="token punctuation">,</span> encoding<span class="token operator">=</span><span class="token string">&#39;utf8&#39;</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
<span class="token keyword">for</span> line <span class="token keyword">in</span> tqdm<span class="token punctuation">.</span>tqdm<span class="token punctuation">(</span>f<span class="token punctuation">,</span> <span class="token string">"Reading GloVe"</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
values <span class="token operator">=</span> line<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token punctuation">)</span>
word <span class="token operator">=</span> values<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
vectors <span class="token operator">=</span> np<span class="token punctuation">.</span>asarray<span class="token punctuation">(</span>values<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">,</span> dtype<span class="token operator">=</span><span class="token string">&#39;float32&#39;</span><span class="token punctuation">)</span>
embedding_index<span class="token punctuation">[</span>word<span class="token punctuation">]</span> <span class="token operator">=</span> vectors
word_index <span class="token operator">=</span> tokenizer<span class="token punctuation">.</span>word_index
embedding_matrix <span class="token operator">=</span> np<span class="token punctuation">.</span>zeros<span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>word_index<span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">,</span> dim<span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">for</span> word<span class="token punctuation">,</span> i <span class="token keyword">in</span> word_index<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
embedding_vector <span class="token operator">=</span> embedding_index<span class="token punctuation">.</span>get<span class="token punctuation">(</span>word<span class="token punctuation">)</span>
<span class="token keyword">if</span> embedding_vector <span class="token keyword">is</span> <span class="token keyword">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span>
<span class="token comment"># words not found will be 0s</span>
embedding_matrix<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> embedding_vector
<span class="token keyword">return</span> embedding_matrix
</code></pre></p>
<p><pre>    <code class="prism language-python"><span class="token keyword">def</span> <span class="token function">get_model</span><span class="token punctuation">(</span>tokenizer<span class="token punctuation">,</span> lstm_units<span class="token punctuation">)</span><span class="token punctuation">:</span>
<span class="token triple-quoted-string string">"""
Constructs the model,
Embedding vectors => LSTM => 2 output Fully-Connected neurons with softmax activation
"""</span>
<span class="token comment"># get the GloVe embedding vectors</span>
embedding_matrix <span class="token operator">=</span> get_embedding_vectors<span class="token punctuation">(</span>tokenizer<span class="token punctuation">)</span>
model <span class="token operator">=</span> Sequential<span class="token punctuation">(</span><span class="token punctuation">)</span>
model<span class="token punctuation">.</span>add<span class="token punctuation">(</span>Embedding<span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>tokenizer<span class="token punctuation">.</span>word_index<span class="token punctuation">)</span><span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">,</span>
EMBEDDING_SIZE<span class="token punctuation">,</span>
weights<span class="token operator">=</span><span class="token punctuation">[</span>embedding_matrix<span class="token punctuation">]</span><span class="token punctuation">,</span>
trainable<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">,</span>
input_length<span class="token operator">=</span>SEQUENCE_LENGTH<span class="token punctuation">)</span><span class="token punctuation">)</span>
model<span class="token punctuation">.</span>add<span class="token punctuation">(</span>LSTM<span class="token punctuation">(</span>lstm_units<span class="token punctuation">,</span> recurrent_dropout<span class="token operator">=</span><span class="token number">0.2</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
model<span class="token punctuation">.</span>add<span class="token punctuation">(</span>Dropout<span class="token punctuation">(</span><span class="token number">0.3</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
model<span class="token punctuation">.</span>add<span class="token punctuation">(</span>Dense<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">,</span> activation<span class="token operator">=</span><span class="token string">"softmax"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token comment"># compile as rmsprop optimizer</span>
<span class="token comment"># aswell as with recall metric</span>
model<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span>optimizer<span class="token operator">=</span><span class="token string">"rmsprop"</span><span class="token punctuation">,</span> loss<span class="token operator">=</span><span class="token string">"categorical_crossentropy"</span><span class="token punctuation">,</span>
metrics<span class="token operator">=</span><span class="token punctuation">[</span><span class="token string">"accuracy"</span><span class="token punctuation">,</span> keras_metrics<span class="token punctuation">.</span>precision<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> keras_metrics<span class="token punctuation">.</span>recall<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
model<span class="token punctuation">.</span>summary<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">return</span> model
</code></pre></p>
训练结果如下:

<p><pre>    <code class="prism language-python">_________________________________________________________________
Layer <span class="token punctuation">(</span><span class="token builtin">type</span><span class="token punctuation">)</span> Output Shape Param <span class="token comment">#</span>
<span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">=</span>
embedding_1 <span class="token punctuation">(</span>Embedding<span class="token punctuation">)</span> <span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span> <span class="token number">901300</span>
_________________________________________________________________
lstm_1 <span class="token punctuation">(</span>LSTM<span class="token punctuation">)</span> <span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">)</span> <span class="token number">117248</span>
_________________________________________________________________
dropout_1 <span class="token punctuation">(</span>Dropout<span class="token punctuation">)</span> <span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token number">128</span><span class="token punctuation">)</span> <span class="token number">0</span>
_________________________________________________________________
dense_1 <span class="token punctuation">(</span>Dense<span class="token punctuation">)</span> <span class="token punctuation">(</span><span class="token boolean">None</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span> <span class="token number">258</span>
<span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">=</span>
Total params<span class="token punctuation">:</span> <span class="token number">1</span><span class="token punctuation">,</span><span class="token number">018</span><span class="token punctuation">,</span><span class="token number">806</span>
Trainable params<span class="token punctuation">:</span> <span class="token number">117</span><span class="token punctuation">,</span><span class="token number">506</span>
Non<span class="token operator">-</span>trainable params<span class="token punctuation">:</span> <span class="token number">901</span><span class="token punctuation">,</span><span class="token number">300</span>
_________________________________________________________________
X_train<span class="token punctuation">.</span>shape<span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">4180</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span>
X_test<span class="token punctuation">.</span>shape<span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">1394</span><span class="token punctuation">,</span> <span class="token number">100</span><span class="token punctuation">)</span>
y_train<span class="token punctuation">.</span>shape<span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">4180</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span>
y_test<span class="token punctuation">.</span>shape<span class="token punctuation">:</span> <span class="token punctuation">(</span><span class="token number">1394</span><span class="token punctuation">,</span> <span class="token number">2</span><span class="token punctuation">)</span>
Train on <span class="token number">4180</span> samples<span class="token punctuation">,</span> validate on <span class="token number">1394</span> samples
Epoch <span class="token number">1</span><span class="token operator">/</span><span class="token number">20</span>
<span class="token number">4180</span><span class="token operator">/</span><span class="token number">4180</span> <span class="token punctuation">[</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token punctuation">]</span> <span class="token operator">-</span> 9s 2ms<span class="token operator">/</span>step <span class="token operator">-</span> loss<span class="token punctuation">:</span> <span class="token number">0.1712</span> <span class="token operator">-</span> acc<span class="token punctuation">:</span> <span class="token number">0.9325</span> <span class="token operator">-</span> precision<span class="token punctuation">:</span> <span class="token number">0.9524</span> <span class="token operator">-</span> recall<span class="token punctuation">:</span> <span class="token number">0.9708</span> <span class="token operator">-</span> val_loss<span class="token punctuation">:</span> <span class="token number">0.1023</span> <span class="token operator">-</span> val_acc<span class="token punctuation">:</span> <span class="token number">0.9656</span> <span class="token operator">-</span> val_precision<span class="token punctuation">:</span> <span class="token number">0.9840</span> <span class="token operator">-</span> val_recall<span class="token punctuation">:</span> <span class="token number">0.9758</span>
Epoch <span class="token number">00001</span><span class="token punctuation">:</span> val_loss improved <span class="token keyword">from</span> inf to <span class="token number">0.10233</span><span class="token punctuation">,</span> saving model to results<span class="token operator">/</span>spam_classifier_0<span class="token punctuation">.</span><span class="token number">10</span>
Epoch <span class="token number">2</span><span class="token operator">/</span><span class="token number">20</span>
<span class="token number">4180</span><span class="token operator">/</span><span class="token number">4180</span> <span class="token punctuation">[</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token operator">==</span><span class="token punctuation">]</span> <span class="token operator">-</span> 8s 2ms<span class="token operator">/</span>step <span class="token operator">-</span> loss<span class="token punctuation">:</span> <span class="token number">0.0976</span> <span class="token operator">-</span> acc<span class="token punctuation">:</span> <span class="token number">0.9675</span> <span class="token operator">-</span> precision<span class="token punctuation">:</span> <span class="token number">0.9765</span> <span class="token operator">-</span> recall<span class="token punctuation">:</span> <span class="token number">0.9862</span> <span class="token operator">-</span> val_loss<span class="token punctuation">:</span> <span class="token number">0.0809</span> <span class="token operator">-</span> val_acc<span class="token punctuation">:</span> <span class="token number">0.9720</span> <span class="token operator">-</span> val_precision<span class="token punctuation">:</span> <span class="token number">0.9793</span> <span class="token operator">-</span> val_recall<span class="token punctuation">:</span> <span class="token number">0.9883</span>
</code></pre></p>
http://www.stutter.cn/data/attachment/forum/20241220/1734683601243_8.png

9 最后
页: [1]
查看完整版本: 毕设 垃圾邮件(短信)分类 算法实现