Scikit-learn

Scikit-learn

Python机器学习库

Scikit-Learn是什么

Scikit-Learn 是 Python 机器学习库,广泛应用在数据挖掘和数据分析。Scikit-Learn提供简单高效的工具,支持多种机器学习算法,包括分类、回归、聚类和降维等。Scikit-Learn设计简洁、易用,且与 NumPy 和 SciPy 等科学计算库无缝集成。Scikit-Learn 以其实用性、高性能和丰富的算法实现而闻名,适合从初学者到专家的各个层次的用户。Scikit-Learn提供详尽的文档和示例,帮助用户快速上手并解决实际问题。

Scikit-Learn

Scikit-Learn的主要功能

  • 机器学习算法:提供多种分类、回归、聚类和降维算法,满足不同机器学习任务需求。
  • 数据预处理:包含特征缩放、缺失值处理、特征编码和特征选择等工具,帮助准备数据以供模型训练。
  • 模型选择与评估:提供交叉验证、超参数调优和性能评估工具,帮助选择和优化模型。
  • 流水线(Pipeline):通过流水线工具将数据预处理、模型训练和评估组合成一个完整的流程,简化代码并提高效率。
  • 集成学习:提供 Bagging、Boosting 和随机森林等集成学习算法,提升模型的性能和稳定性。
  • 多输出与多标签:支持多输出分类和回归任务,及多标签分类任务,支持模型同时预测多个目标值或类别。

如何使用Scikit-Learn

  • 安装 scikit-learn
    • 使用 pip 安装
pip <span class="token function">install</span> <span class="token parameter variable">-U</span> scikit-learn
    • 使用 conda 安装
conda <span class="token function">install</span> <span class="token parameter variable">-c</span> conda-forge scikit-learn
  • 导入必要的模块:在 Python 中,导入 scikit-learn 及相关的模块(如 NumPy 和 Pandas)处理数据。
<span class="token keyword">import</span> numpy <span class="token keyword">as</span> np<span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd<span class="token keyword">from</span> sklearn <span class="token keyword">import</span> datasets<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>model_selection <span class="token keyword">import</span> train_test_split<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>preprocessing <span class="token keyword">import</span> StandardScaler<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>linear_model <span class="token keyword">import</span> LogisticRegression<span class="token keyword">from</span> sklearn<span class="token punctuation">.</span>metrics <span class="token keyword">import</span> accuracy_score<span class="token punctuation">,</span> classification_report
  • 加载数据集:scikit-learn 提供许多内置的数据集,例如鸢尾花数据集(Iris)、手写数字数据集(Digits)等。
    • 使用内置数据集
<span class="token comment"># 加载鸢尾花数据集</span>iris <span class="token operator">=</span> datasets<span class="token punctuation">.</span>load_iris<span class="token punctuation">(</span><span class="token punctuation">)</span>X <span class="token operator">=</span> iris<span class="token punctuation">.</span>datay <span class="token operator">=</span> iris<span class="token punctuation">.</span>target
    • 加载自定义数据集
<span class="token comment"># 使用 Pandas 加载 CSV 文件</span>data <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'your_dataset.csv'</span><span class="token punctuation">)</span>X <span class="token operator">=</span> data<span class="token punctuation">.</span>drop<span class="token punctuation">(</span><span class="token string">'target_column'</span><span class="token punctuation">,</span> axis<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span>y <span class="token operator">=</span> data<span class="token punctuation">[</span><span class="token string">'target_column'</span><span class="token punctuation">]</span>
  • 数据预处理:在训练模型之前,通常需要对数据进行预处理,例如划分训练集和测试集、标准化等。
X_train<span class="token punctuation">,</span> X_test<span class="token punctuation">,</span> y_train<span class="token punctuation">,</span> y_test <span class="token operator">=</span> train_test_split<span class="token punctuation">(</span>X<span class="token punctuation">,</span> y<span class="token punctuation">,</span> test_size<span class="token operator">=</span><span class="token number">0.2</span><span class="token punctuation">,</span> random_state<span class="token operator">=</span><span class="token number">42</span><span class="token punctuation">)</span>
  • 标准化数据
scaler <span class="token operator">=</span> StandardScaler<span class="token punctuation">(</span><span class="token punctuation">)</span>X_train <span class="token operator">=</span> scaler<span class="token punctuation">.</span>fit_transform<span class="token punctuation">(</span>X_train<span class="token punctuation">)</span>X_test <span class="token operator">=</span> scaler<span class="token punctuation">.</span>transform<span class="token punctuation">(</span>X_test<span class="token punctuation">)</span>
  • 训练模型:选择合适的模型并训练它。以逻辑回归为例。
model <span class="token operator">=</span> LogisticRegression<span class="token punctuation">(</span><span class="token punctuation">)</span>model<span class="token punctuation">.</span>fit<span class="token punctuation">(</span>X_train<span class="token punctuation">,</span> y_train<span class="token punctuation">)</span>
  • 模型评估:使用测试集评估模型的性能。
y_pred <span class="token operator">=</span> model<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>X_test<span class="token punctuation">)</span>accuracy <span class="token operator">=</span> accuracy_score<span class="token punctuation">(</span>y_test<span class="token punctuation">,</span> y_pred<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'Accuracy: </span><span class="token interpolation"><span class="token punctuation">{</span>accuracy<span class="token punctuation">:</span><span class="token format-spec">.2f</span><span class="token punctuation">}</span></span><span class="token string">'</span></span><span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span>classification_report<span class="token punctuation">(</span>y_test<span class="token punctuation">,</span> y_pred<span class="token punctuation">)</span><span class="token punctuation">)</span>
  • 使用模型进行预测:在新数据上使用训练好的模型进行预测。
new_data <span class="token operator">=</span> np<span class="token punctuation">.</span>array<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token punctuation">[</span><span class="token number">5.1</span><span class="token punctuation">,</span> <span class="token number">3.5</span><span class="token punctuation">,</span> <span class="token number">1.4</span><span class="token punctuation">,</span> <span class="token number">0.2</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">)</span>  <span class="token comment"># 示例新数据</span>new_data <span class="token operator">=</span> scaler<span class="token punctuation">.</span>transform<span class="token punctuation">(</span>new_data<span class="token punctuation">)</span>  <span class="token comment"># 标准化</span>prediction <span class="token operator">=</span> model<span class="token punctuation">.</span>predict<span class="token punctuation">(</span>new_data<span class="token punctuation">)</span><span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'Prediction: </span><span class="token interpolation"><span class="token punctuation">{</span>prediction<span class="token punctuation">}</span></span><span class="token string">'</span></span><span class="token punctuation">)</span>
  • 保存和加载模型
    • 保存模型
<span class="token keyword">import</span> joblibjoblib<span class="token punctuation">.</span>dump<span class="token punctuation">(</span>model<span class="token punctuation">,</span> <span class="token string">'model.pkl'</span><span class="token punctuation">)</span>
    • 加载模型
model <span class="token operator">=</span> joblib<span class="token punctuation">.</span>load<span class="token punctuation">(</span><span class="token string">'model.pkl'</span><span class="token punctuation">)</span>

Scikit-Learn的应用场景

  • 数据挖掘:从大量数据中提取有价值的信息,例如通过聚类算法发现数据中的自然分组。
  • 数据分析:帮助进行数据的探索性分析,例如用降维算法(如 PCA)可视化高维数据。
  • 分类任务:适用各种分类问题,如垃圾邮件检测、图像分类、疾病诊断等。
  • 回归任务:用在预测连续值,例如房价预测、股票价格预测、销售量预测等。
  • 聚类分析:基于无监督学习算法(如 K-均值)对数据进行分组,发现数据中的模式和结构。