The essential library for building segmentation models

MartinThoma, CC0, via Wikimedia Commons (edited)

northNeural network models have proven to be highly effective in solving segmentation problems, achieving next-generation accuracy. They have led to significant improvements in various applications, including medical image analysis, autonomous driving, robotics, satellite imaging, video surveillance, and much more. However, creating these models is often time consuming, but after reading this guide, you should be able to create one with just a few lines of code.

Table of Contents

  1. Introduction
  2. Building blocks
  3. build a model
  4. train the model

Segmentation is the task of dividing an image into multiple segments or regions based on certain features or properties. A segmentation model takes an image as input and returns a segmentation mask:

(Left) An input image | (Right) Your segmentation mask. Both images by PyTorch.

Segmentation neural network models consist of two parts:

  • A encoder: takes an input image and extracts features. Examples of encoders are ResNet, EfficentNet, and ViT.
  • TO decoder: Takes the extracted features and generates a segmentation mask. The decoder varies by architecture. Examples of architectures are U-Net, FPN, and DeepLab.

Therefore, when creating a segmentation model for a specific application, you must choose an architecture and an encoder. However, it is difficult to choose the best combination without trying several. This usually takes a long time because changing the model requires writing a lot of boilerplate code. The segmentation model library solves this problem. It allows you to create a model in a single line by specifying the architecture and the encoder. So you just need to modify that line to change any of them.

To install the latest version of PyPI Segmentation Models use:

pip install segmentation-models-pytorch

The library provides a class for most of the pipeline architectures, and each of them can be used with any of the available encoders. In the next section, you will see that to build a model you need to instantiate the chosen architecture class and pass the chosen encoder string as a parameter. The following figure shows the class name of each architecture provided by the library:

Class names of all architectures provided by the library.

The following figure shows the names of the most common encoders provided by the library:

Names of the most common encoders provided by the library.

There are over 400 encoders so it’s not possible to list them all, but you can find a full list here.

Once the architecture and the encoder of the previous figures have been chosen, building the model is very simple:


  • encoder_name is the name of the chosen encoder (eg resnet50, efficientnet-b7, mit_b5).
  • encoder_weights is the data set of the pre-trainer. Yeah encoder_weights is equal to "imagenet" the encoder weights are initialized by using the pretrained ImageNet. All encoders have at least one pretrained and a full list is available here.
  • in_channels is the number of channels of the input image (3 if RGB).
    Even if in_channels is not 3 a pretrained ImageNet can be used: the first layer will be initialized by reusing the weights of the first pretrained convolutional layer (the procedure is described here).
  • out_classes is the number of classes in the data set.
  • activation is the activation function for the output layer. The possible options are None (default), sigmoid and softmax .
    Note: when using a loss function that expects logits as input, the activation function must be None. For example, when using the CrossEntropyLoss function, activation must be None .

This section shows all the code needed to perform the training. However, this library does not change the usual pipeline for training and validating a model. To simplify the process, the library provides implementation of many loss functions like Jaccard Loss, Dice Loss, Cross Dice Entropy Loss, Focal Loss, and metrics like Accuracy, precision, recall, F1Score and IOUScore. For a full list of them and their parameters, see their documentation in the Losses and Metrics sections.

The proposed training example is a binary segmentation using the Oxford-IIIT Pet Dataset (will be downloaded by code). Here are two samples from the data set:

Finally, these are all the steps to perform this type of segmentation task:

  1. Build the model.

Set the last layer activation function based on the loss function you plan to use.

2. Define the parameters.

Remember that when you use a pretrainer, the input must be normalized using the mean and standard deviation of the data used to train the pretrainer.

3. Define the train function.

Nothing here changes from the train function that you would have written to train a model without using the library.

4. Define the validation function.

The true positives, false positives, false negatives, and true negatives of the batches are added together to calculate the metrics only at the end of the batches. Note that logits must be converted to classes before metrics can be calculated. Call the train show to start training.

5. Use the model.

These are some segmentations:

concluding remarks

This library has everything you need to experiment with segmentation. It’s very easy to create a model and apply changes, and most loss metrics and functions are provided. Also, using this library doesn’t change the pipeline we’re used to. See the official documentation for more information. I’ve also included some of the more common encoders and architectures in the references.

The Oxford-IIIT Pet Dataset is available for download for commercial/research purposes under a Creative Commons Attribution-ShareAlike 4.0 International License. Copyright remains with the original owners of the images.

All images, unless otherwise stated, are by the author. Thanks for reading, I hope you found this useful.

[1] O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation (2015)

[2] Z. Zhou, Md. MR Siddiquee, N. Tajbakhsh, and J. Liang, UNet++: A Nested U-Net Architecture for Medical Image Segmentation (2018)

[3] L. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking Atrous Convolution for Semantic Image Segmentation (2017)

[4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Atrous separable convolution encoder-decoder for semantic image segmentation (2018)

[5] R. Li, S. Zheng, C. Duan, C. Zhang, J. Su, PM Atkinson, Multiattention Network for Semantic Segmentation of Fine Resolution Remote Sensing Images (2020)

[6] A. Chaurasia, E. Culurciello, LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation (2017)

[7] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature Pyramid Networks for Object Detection (2017)

[8] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid Scene Analysis Network (2016)

[9] H. Li, P. Xiong, J. An, L. Wang, Pyramid Attention Network for Semantic Segmentation (2018)

[10] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition (2015)

[12] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregate Residual Transformations for Deep Neural Networks (2016)

[13] J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Compression and Excitation Networks (2017)

[14] G. Huang, Z. Liu, L. van der Maaten, KQ Weinberger, Densely Connected Convolutional Networks (2016)

[15] M. Tan, QV Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (2019)

[16] E. Xie, W. Wang, Z. Yu, A. Anandkumar, JM Alvarez, P. Luo, SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers (2021)


Scroll to Top