scikit-learn - Documentation

What is scikit-learn?

scikit-learn (sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. It is built upon NumPy, SciPy, and Matplotlib, providing a consistent and user-friendly interface for a wide range of machine learning tasks. Scikit-learn is widely used in both research and industry for its efficiency, scalability, and comprehensive documentation. Its focus is on providing tools for building and evaluating predictive models, not on data visualization or data manipulation (though it offers some basic capabilities in these areas).

Key Features and Capabilities

Scikit-learn offers a rich set of tools and functionalities covering most aspects of the machine learning workflow:

Installation and Setup

The easiest way to install scikit-learn is using pip:

pip install scikit-learn

For conda users:

conda install -c conda-forge scikit-learn

Before installing, ensure you have the necessary dependencies installed. These primarily include NumPy and SciPy. It’s recommended to use a Python virtual environment to manage dependencies and avoid conflicts with other projects. For example, using venv:

python3 -m venv .venv
source .venv/bin/activate  # On Linux/macOS
.venv\Scripts\activate  # On Windows
pip install scikit-learn

After installation, verify the installation by importing the library in a Python interpreter:

import sklearn
print(sklearn.__version__)

Basic Workflow and Concepts

A typical scikit-learn workflow involves these steps:

  1. Data Loading and Preparation: Load your data using libraries like Pandas. Preprocess the data by cleaning, transforming, and scaling features. Handle missing values and encode categorical variables as needed.

  2. Model Selection: Choose an appropriate machine learning model based on your problem (classification, regression, clustering, etc.) and data characteristics.

  3. Model Training: Train the chosen model using your prepared data. This involves fitting the model to the training data using the fit() method.

  4. Model Evaluation: Evaluate the performance of the trained model using appropriate metrics (e.g., accuracy, precision, recall, F1-score for classification; R-squared, Mean Squared Error for regression). Techniques like cross-validation are crucial for robust evaluation.

  5. Model Tuning (Hyperparameter Optimization): Fine-tune the model’s hyperparameters to improve its performance. Techniques such as GridSearchCV or RandomizedSearchCV can be used.

  6. Model Deployment: Once satisfied with the model’s performance, deploy it to make predictions on new, unseen data. This might involve saving the trained model using joblib for later use.

Example (Simple Linear Regression):

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3]])
y = np.array([2, 4, 6])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model (example using R-squared)
# ... (add evaluation code here)

This is a simplified example; real-world applications will require more sophisticated data preparation and model selection.

Supervised Learning

Classification

Scikit-learn provides a wide array of classification algorithms for predicting categorical outcomes. These algorithms learn from labeled data where each data point is associated with a specific class label. Key aspects of classification within scikit-learn include:

Regression

Regression algorithms in scikit-learn predict continuous numerical outcomes. Similar to classification, they learn from labeled data, but the target variable is a real number instead of a category. Key aspects include:

Model Selection and Evaluation

Choosing the right model and evaluating its performance are crucial steps in supervised learning. Scikit-learn provides tools for this:

Hyperparameter Tuning

Hyperparameters are settings that control the learning process of a model. They are not learned from the data but are set beforehand. Tuning hyperparameters is crucial for optimizing model performance:

Unsupervised Learning

Clustering

Clustering algorithms in scikit-learn group data points into clusters based on similarity. No labeled data is required; the algorithm learns the structure of the data itself. Key aspects include:

Dimensionality Reduction

Dimensionality reduction techniques aim to reduce the number of features in a dataset while preserving important information. This is useful for visualizing high-dimensional data, improving model performance by reducing noise and redundancy, and speeding up computation.

Anomaly Detection

Anomaly detection identifies data points that deviate significantly from the norm. These deviations can represent errors, outliers, or interesting events.

Model Selection and Evaluation

Metrics for Classification

Evaluating the performance of classification models requires appropriate metrics that consider the model’s ability to correctly classify different classes and the trade-off between different types of errors. Scikit-learn provides a comprehensive set of classification metrics:

All these metrics are accessible through functions in sklearn.metrics.

Metrics for Regression

Evaluating regression models focuses on how well the predicted values match the actual values. Scikit-learn provides several regression metrics:

These metrics are also accessible through functions in sklearn.metrics.

Cross-Validation Techniques

Cross-validation is crucial for obtaining reliable estimates of model performance and preventing overfitting. Scikit-learn provides various cross-validation techniques through sklearn.model_selection:

cross_val_score and cross_val_predict in sklearn.model_selection simplify the implementation of these techniques.

Learning Curves

Learning curves plot the model’s performance (e.g., training and validation error) as a function of the training set size. They provide insights into model bias and variance:

Scikit-learn doesn’t directly provide a function to plot learning curves, but it’s straightforward to generate them using train_test_split, model training and evaluation functions, and plotting libraries like Matplotlib.

Confusion Matrices

A confusion matrix is a visualization tool that summarizes the performance of a classification model. It’s a square matrix where each row represents the instances in a predicted class, and each column represents the instances in an actual class. The entries in the matrix represent the counts of different combinations of predicted and actual classes:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

The confusion matrix can be used to calculate various metrics like accuracy, precision, recall, and F1-score. The confusion_matrix function in sklearn.metrics generates confusion matrices, and libraries like Matplotlib or Seaborn can create visual representations. For multi-class classification, the confusion matrix becomes larger, providing a detailed breakdown of performance for each class.

Data Preprocessing and Feature Engineering

Data Cleaning

Data cleaning is a crucial preprocessing step to ensure data quality and improve model performance. This involves handling inconsistencies, errors, and noise in the dataset. Scikit-learn doesn’t directly provide extensive data cleaning functionalities, but it integrates well with libraries like Pandas, which offer powerful tools for this purpose. Key aspects of data cleaning include:

Handling Missing Values

Missing values are a common issue in real-world datasets. Scikit-learn, along with libraries like Pandas and SciPy, offers several strategies:

The choice of imputation technique depends on the nature of the data and the amount of missingness.

Feature Scaling

Feature scaling transforms features to a similar scale, preventing features with larger values from dominating the model and improving the performance of algorithms sensitive to feature scales (e.g., k-NN, SVM, gradient descent-based algorithms). Common scaling techniques include:

The choice of scaling technique depends on the data distribution and the presence of outliers.

Encoding Categorical Features

Many machine learning algorithms require numerical input. Categorical features (e.g., colors, genders) need to be converted into numerical representations:

The choice of encoding technique depends on the nature of the categorical feature and the requirements of the machine learning algorithm.

Feature Selection

Feature selection aims to identify the most relevant features for the model, improving performance and reducing computational cost. Techniques include:

The choice of feature selection technique depends on the dataset, model, and computational resources. Feature selection should be carefully considered; sometimes the loss of information by removing less important features can be negative in terms of accuracy.

Working with Text Data

Text Vectorization

Before applying machine learning algorithms to text data, it needs to be converted into a numerical representation that algorithms can understand. This process is called text vectorization. Scikit-learn provides tools for this:

These vectorizers are found in sklearn.feature_extraction.text. They transform text data into matrices where rows represent documents and columns represent words (or n-grams).

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting scheme that assigns higher weights to words that are frequent in a specific document but infrequent in the overall corpus. It helps to downplay the importance of common words (stop words) that don’t carry much discriminative information. The formula is typically:

TF-IDF(word, document) = TF(word, document) * IDF(word)

Where:

TfidfVectorizer directly computes TF-IDF weights. The use_idf parameter can be set to False to get only TF values.

N-grams

N-grams are sequences of N consecutive words in a text. Using n-grams (where N > 1) captures word combinations and context, which can be crucial for understanding the meaning of text. For example, “New York” is different from “New” and “York” individually.

CountVectorizer and TfidfVectorizer support n-grams through the ngram_range parameter. Setting ngram_range=(1, 2) will include both unigrams (single words) and bigrams (two-word sequences). Higher values of N can capture longer phrases, but also increase the dimensionality of the resulting vector representation.

Topic Modeling

Topic modeling is a technique to discover underlying topics in a collection of documents. Latent Dirichlet Allocation (LDA) is a common probabilistic topic model. While not directly implemented in scikit-learn, it’s available through the gensim library, which often integrates well with scikit-learn’s preprocessing tools.

The basic workflow involves:

  1. Preprocessing: Clean and vectorize the text data using CountVectorizer or TfidfVectorizer.

  2. LDA Modeling: Apply LDA from gensim to the vectorized data. You specify the number of topics to discover.

  3. Interpretation: Examine the top words associated with each discovered topic to understand its meaning.

Topic modeling helps to understand the thematic structure of large text corpora. The choice of the number of topics is a crucial hyperparameter that often requires experimentation and domain expertise.

Working with Images

Scikit-learn’s core functionality is not primarily designed for image processing. It excels in machine learning model building, evaluation, and model selection. However, it can be effectively used in conjunction with other libraries like scikit-image, OpenCV, and Pillow for image-related tasks. The typical workflow involves using these other libraries for image preprocessing and feature extraction, and then using scikit-learn for model training and evaluation.

Image Feature Extraction

Extracting relevant features from images is crucial for image classification and object detection. Scikit-learn doesn’t directly provide image feature extraction methods, but it can utilize features extracted by other libraries:

After extracting features using an external library, you’ll typically have a numerical representation (feature matrix) of your images, which can then be used as input to scikit-learn’s machine learning algorithms.

Image Classification

Image classification involves assigning an image to a specific category (e.g., cat, dog, car). Scikit-learn can be used for this after feature extraction:

  1. Feature extraction: Extract features from images using one of the methods described above.

  2. Model selection: Choose an appropriate classification algorithm (e.g., Support Vector Machines, Random Forests, Logistic Regression, etc.).

  3. Model training: Train the selected model using the extracted image features and corresponding class labels.

  4. Model evaluation: Evaluate the model’s performance using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

Scikit-learn provides tools for model selection, training, evaluation, and hyperparameter tuning, making it suitable for building image classification models.

Object Detection

Object detection aims to identify and locate objects within an image. This is more complex than image classification. Similar to image classification, the workflow involves:

  1. Feature extraction: Extract features that capture both the presence and location of objects. Region-based CNNs (R-CNNs) and other deep learning architectures are commonly used for this.

  2. Model selection: Choose an appropriate model, often a deep learning model trained using a framework like TensorFlow or PyTorch. Scikit-learn is not typically used directly for the object detection model itself, due to the complexity and data requirements.

  3. Model training: Train the object detection model on labeled images containing bounding boxes around the objects of interest.

  4. Prediction: The trained model produces bounding boxes and class labels indicating the detected objects and their locations.

  5. Evaluation: Evaluate the performance using metrics appropriate for object detection, such as mean Average Precision (mAP).

While scikit-learn is not suitable for training the primary object detection model, it can play a supportive role in evaluating certain aspects of the output (such as evaluating a classifier that is used to classify the detected objects based on extracted features from the bounding boxes). External libraries are primarily used for the core object detection pipeline.

Advanced Topics

Pipeline Creation

Pipelines in scikit-learn chain multiple transformations and a final estimator into a single object. This simplifies the workflow, improves code readability, and helps avoid data leakage during model training (especially important with cross-validation). Pipelines are created using Pipeline from sklearn.pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Transformation step
    ('classifier', LogisticRegression())  # Estimator step
])

# Fit the pipeline to the data
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

This example shows a pipeline with a StandardScaler for data scaling and a LogisticRegression classifier. Multiple transformation steps can be included. Pipelines make the process of data preprocessing and model training more organized and efficient. They are especially useful when using cross-validation as the transformations are applied to each fold separately, preventing data leakage.

GridSearchCV and RandomizedSearchCV

Finding optimal hyperparameters for a model is crucial for maximizing performance. GridSearchCV and RandomizedSearchCV in sklearn.model_selection automate this process:

Example using GridSearchCV:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

This code searches for the best combination of C and gamma for an SVM classifier using 5-fold cross-validation. The best_params_ attribute gives the optimal hyperparameters, and best_score_ gives the corresponding cross-validated score.

Ensemble Methods

Ensemble methods combine multiple base estimators to improve prediction accuracy and robustness. Scikit-learn provides several ensemble techniques:

Ensemble methods often lead to improved prediction accuracy and better generalization compared to using a single base estimator.

Custom Estimators

For specialized machine learning tasks, you might need to create custom estimators. This involves creating classes that inherit from BaseEstimator and TransformerMixin (for transformers) or RegressorMixin or ClassifierMixin (for estimators). You need to implement the fit and transform (for transformers) or predict (for estimators) methods. Example:

from sklearn.base import BaseEstimator, TransformerMixin

class MyTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # Learn parameters from the training data (if any)
        return self

    def transform(self, X):
        # Apply transformation to the data
        return X + 1 #Example Transformation

This creates a simple transformer that adds 1 to each element of the input. Remember to carefully implement the methods and ensure your custom estimator adheres to the scikit-learn API conventions. This allows seamless integration with pipelines and other scikit-learn functionalities.

Modules Reference

This section provides a brief overview of key scikit-learn modules. For detailed documentation and API references, consult the official scikit-learn documentation.

sklearn.linear_model

This module implements linear models for regression and classification. Key classes include:

sklearn.tree

This module provides implementations of decision tree-based models:

sklearn.ensemble

This module contains ensemble methods:

sklearn.svm

This module implements Support Vector Machines (SVMs):

sklearn.naive_bayes

This module provides implementations of Naive Bayes classifiers:

sklearn.neighbors

This module implements nearest neighbor methods:

sklearn.cluster

This module provides clustering algorithms:

sklearn.decomposition

This module offers dimensionality reduction techniques:

sklearn.manifold

This module contains manifold learning algorithms for dimensionality reduction and visualization:

sklearn.preprocessing

This module offers various data preprocessing techniques:

sklearn.feature_selection

This module provides tools for feature selection:

sklearn.model_selection

This module contains tools for model evaluation and selection:

sklearn.metrics

This module provides functions for evaluating model performance:

sklearn.pipeline

This module provides tools for creating pipelines:

This is not an exhaustive list, and many other modules and classes are available within scikit-learn. Refer to the official documentation for a complete and detailed reference.

Appendix

Glossary of Terms

This glossary defines key terms used throughout the scikit-learn documentation and codebase.

This is a partial glossary; many other terms are used in machine learning.

Frequently Asked Questions (FAQ)

Further Reading and Resources

This list provides starting points for further learning and exploration. The field of machine learning is constantly evolving, so continuous learning is encouraged.