pytorch-handbook/chapter3/3.1-logistic-regression.ipynb
2019-04-08 21:55:22 +08:00

283 lines
7.8 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'1.0.0'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"import numpy as np\n",
"torch.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.1 logistic回归实战\n",
"在这一章里面我们将处理一下结构化数据并使用logistic回归对结构化数据进行简单的分类。\n",
"## 3.1.1 logistic回归介绍\n",
"logistic回归是一种广义线性回归generalized linear model与多重线性回归分析有很多相同之处。它们的模型形式基本上相同都具有 wx + b其中w和b是待求参数其区别在于他们的因变量不同多重线性回归直接将wx+b作为因变量即y =wx+b,而logistic回归则通过函数L将wx+b对应一个隐状态pp =L(wx+b),然后根据p 与1-p的大小决定因变量的值。如果L是logistic函数就是logistic回归如果L是多项式函数就是多项式回归。\n",
"\n",
"说的更通俗一点就是logistic回归会在线性回归后再加一层logistic函数的调用。\n",
"\n",
"logistic回归主要是进行二分类预测我们在激活函数时候讲到过 Sigmod函数Sigmod函数是最常见的logistic函数因为Sigmod函数的输出的是是对于0~1之间的概率值当概率大于0.5预测为1小于0.5预测为0。\n",
"\n",
"下面我们就来使用公开的数据来进行介绍"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1.2 UCI German Credit 数据集\n",
"\n",
"UCI German Credit是UCI的德国信用数据集里面有原数据和数值化后的数据。\n",
"\n",
"German Credit数据是根据个人的银行贷款信息和申请客户贷款逾期发生情况来预测贷款违约倾向的数据集数据集包含24个维度的1000条数据\n",
"\n",
"在这里我们直接使用处理好的数值化的数据,作为展示。\n",
"\n",
"[地址](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 代码实战\n",
"我们这里使用的 german.data-numeric是numpy处理好数值化数据我们直接使用numpy的load方法读取即可"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data=np.loadtxt(\"german.data-numeric\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据读取完成后我们要对数据做一下归一化的处理"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"n,l=data.shape\n",
"for j in range(l-1):\n",
" meanVal=np.mean(data[:,j])\n",
" stdVal=np.std(data[:,j])\n",
" data[:,j]=(data[:,j]-meanVal)/stdVal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"打乱数据"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"np.random.shuffle(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"区分训练集和测试集,由于这里没有验证集,所以我们直接使用测试集的准确度作为评判好坏的标准\n",
"\n",
"区分规则900条用于训练100条作为测试\n",
"\n",
"german.data-numeric的格式为前24列为24个维度最后一个为要打的标签01所以我们将数据和标签一起区分出来"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"train_data=data[:900,:l-1]\n",
"train_lab=data[:900,l-1]-1\n",
"test_data=data[900:,:l-1]\n",
"test_lab=data[900:,l-1]-1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面我们定义模型,模型很简单"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"class LR(nn.Module):\n",
" def __init__(self):\n",
" super(LR,self).__init__()\n",
" self.fc=nn.Linear(24,2) # 由于24个维度已经固定了所以这里写24\n",
" def forward(self,x):\n",
" out=self.fc(x)\n",
" out=torch.sigmoid(out)\n",
" return out\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"测试集上的准确率"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def test(pred,lab):\n",
" t=pred.max(-1)[1]==lab\n",
" return torch.mean(t.float())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面就是对一些设置"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"net=LR() \n",
"criterion=nn.CrossEntropyLoss() # 使用CrossEntropyLoss损失\n",
"optm=torch.optim.Adam(net.parameters()) # Adam优化\n",
"epochs=1000 # 训练1000次\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面开始训练了"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch:100,Loss:0.6313,Accuracy0.76\n",
"Epoch:200,Loss:0.6065,Accuracy0.79\n",
"Epoch:300,Loss:0.5909,Accuracy0.80\n",
"Epoch:400,Loss:0.5801,Accuracy0.81\n",
"Epoch:500,Loss:0.5720,Accuracy0.82\n",
"Epoch:600,Loss:0.5657,Accuracy0.81\n",
"Epoch:700,Loss:0.5606,Accuracy0.81\n",
"Epoch:800,Loss:0.5563,Accuracy0.81\n",
"Epoch:900,Loss:0.5527,Accuracy0.81\n",
"Epoch:1000,Loss:0.5496,Accuracy0.80\n"
]
}
],
"source": [
"for i in range(epochs):\n",
" # 指定模型为训练模式,计算梯度\n",
" net.train()\n",
" # 输入值都需要转化成torch的Tensor\n",
" x=torch.from_numpy(train_data).float()\n",
" y=torch.from_numpy(train_lab).long()\n",
" y_hat=net(x)\n",
" loss=criterion(y_hat,y) # 计算损失\n",
" optm.zero_grad() # 前一步的损失清零\n",
" loss.backward() # 反向传播\n",
" optm.step() # 优化\n",
" if (i+1)%100 ==0 : # 这里我们每100次输出相关的信息\n",
" # 指定模型为计算模式\n",
" net.eval()\n",
" test_in=torch.from_numpy(test_data).float()\n",
" test_l=torch.from_numpy(test_lab).long()\n",
" test_out=net(test_in)\n",
" # 使用我们的测试函数计算准确率\n",
" accu=test(test_out,test_l)\n",
" print(\"Epoch:{},Loss:{:.4f},Accuracy{:.2f}\".format(i+1,loss.item(),accu))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"训练完成了我们的准确度达到了80%"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}