Reproduction of: “W-Net: A Deep Model for Fully Unsupervised Image Segmentation”

14 min readApr 16, 2021

- Georgios Apostolides (5377498) & Michael Joseph Sherman (5248558)

This is a blog post regarding our attempt to reproduce the paper “W-net: A Deep Model for Fully Unsupervised Image Segmentation. [1] The code for reproduction can be found here: https://github.com/Joesher15/W_net_reproduction

We will first explain the purpose of the paper and how we approached its reproduction as well as some repositories which we found useful and upon which we based our reproduction attempt. We emphasize the difficulties we found in the paper reproduction; this includes some ambiguities in the paper as well as some hyperparameter’s definition missing from the paper itself.

Motivation

The paper discusses the unsupervised image segmentation using a novel architecture called the “W-net” architecture. But first, let’s explain what “Unsupervised segmentation” is. Semantic segmentation refers to the task of finding regions of pixels in an image that correspond to the same class of object. Unsupervised image segmentation is the performance of image segmentation without using pixel-level labels. You would ask “Why is unsupervised image segmentation important?”. The reason is that most of the time, pixel-level labels are difficult and expensive to obtain as they require human effort. With such algorithms, we can train the algorithm to segment images without the need for labels.

Theory

Novel Architecture

The paper proposes a novel architecture the “W-net” which performs exactly what we explained above “unsupervised image segmentation”. The architecture of W-net is composed of two major parts: the encoder part (U Enc) and the Decoder part (U Dec) both forming an auto-encoder architecture. The encoder part receives as an input an image of dimensions 224×224 and is responsible for outputting the segmented image. The decoder part receives as input, the output of the encoder and produces a reconstruction of the original image. The shape of both the encoder and decoder is based on the U-net architecture, used to perform supervised image segmentation on biological images. [2]

The whole network consists of 46 convolution layers organized in 18 modules which are shown by a dashed orange line in Figure 1. Every module consists of 2 Convolutional Layers and 2 ReLu functions (one following each convolution).

Fig. 1: Novel architecture of W-net. [1] — **Fig. 1:** Novel architecture of W-net. [1]

The W-net architecture is formed by stacking next to each other to U-net architectures. Each U-net architecture consists of a contracting (downhill) and an expansive path (uphill). As in the originally proposed U-net architecture, there are skip connections from the contractive to the expansive path of the U net architecture for both U Enc and U Dec.

More details about the architecture of W-net can be found in the original paper [1] however we want to highlight two important aspects of the architecture:

Just before the intersection of U Enc and U Dec (9th Module) a 1x1 Convolution is performed which matches the 64 channels of the previous convolution to a desired number of K-classes. The K-classes is the number of classes for which we are allowed to segment the input image based on our interpretation. This is a hyperparameter that needs to be defined.
All modules except from the top modules (i.e. 1,9,10,18) use “Depth-wise Separable convolution” (DSC). A nice blog post about DSC and its constituents can be found in [3]. The purpose of using DSC is to investigate correlations between color channels and spatial position separately.

Optimization

The optimization of the above network is done using two loss functions:

Soft N-cut Loss:

The soft N-cut loss function is optimized only in the encoder part of the network. Performing the 1x1 convolution gives a K-class prediction for every pixel denoting its probabilities belonging to a particular class. By taking the argmax we can determine in which class it belongs. The function based on which the image segmentation occurs in the paper is the N Cut function eq (1). More information about N Cut functions in image segmentation can be found in [4].

The problem with the function in eq. (1) is that the argmax of it cannot be differentiated thus we cannot update the weights while backpropagating. Thus, to enable its differentiation and consequently backpropagation of the network, it is being transformed into a soft variant of it shown in eq. (2). This ensures that the weights of the network can be updated during backpropagation.

2. Reconstruction Loss:

Autoencoders architectures, such as the one presented, penalize the reconstruction loss. The reconstruction loss for the above network is given in eq (3), the underlying reason for using the reconstruction loss is to ensure that the segmentation prediction generated by the encoder part of the network (U Enc) is closely associated with the image passed to the network. Thus, the reconstruction loss is optimized over the entire network architecture. The reconstruction loss is calculated between the input data and the output of the Decoder based on its input from the Encoder.

Eq. 3

A detailed explanation of the Loss Functions and their individual variables can be found in the W-net paper here we are giving the basic intuition. [1]

The algorithm below (Algorithm 1) presents the optimization loop performed during training which considers both the reconstruction loss and the Soft N-cut loss. We emphasize that the weights of the encoder are updated twice, once while optimizing the J Soft- Ncut loss and another while optimizing the reconstruction loss.

**Fig. 2:** Training Loop Pseudocode as presented in W-net Paper [1]

Post-Processing

After obtaining the segmentation prediction from the encoder part of the network the authors move into a series of post-processing steps using Conditional Random Fields Smoothing (CRF) smoothing and hierarchical smoothing. It is important to emphasize that in our reproduction attempt we did not include those post-processing steps.

The authors defend the use of CRF by explaining that the large receptive fields caused by the use of Max pooling and CNN layers can lead to the reduction of localization accuracy and thus poor object localization boundaries. More details about the implementation of CRF can be found in [5].

The output from the CRF is fed to the next post-processing step that of hierarchical segmentation. The goal of hierarchical segmentation is to convert the overly segmented output from the CRF into a simpler segmentation prediction by merging similar regions. More details about how the hierarchical segmentation was implemented can be found in the paper and in [6].

How to train (your dragon) / Experimental Setup

In this section, we are going to talk about the details of training, the experimental setup, hyperparameters that you need to set as well as the datasets that were used for training and testing of the algorithm.

Datasets

The dataset used for the training of the network is the PASCAL Visual Classes Challenge 2012 (VOC2012). The VOC2012 consists of 17 125 images as downloaded from [7]; however, in the paper, they mention 11 530 images which raises some questions of whether they indeed used the training set they said they used. Additionally, it needs to be highlighted that the paper does not specify how they further split up the training data into training and validation. We took the liberty of doing an 80%/20% split for performing our experiments. Additionally, the training data also contains the segmentation labels for each image. However, since the algorithm proposed is an unsupervised image segmentation algorithm no ground truth data were used during training.

For the test data, we use two datasets, the Berkeley Segmentation Database, BSDS300, and the BSDS500. [8, 9] They consist of 300 and 500 segmentations respectively and the BSDS500 was essentially built using the BSDS300 by adding 200 new images. A problem that we faced while downloading the two datasets was that in the case of the BSDS300 the ground truths which are used to assess the validity of our predictions were given as results per human individual thus there are multiple variants of the ground truth. To avoid any confusion and ambiguity in which participant to pick, we used the ground truths of the BSDS500 dataset, as it is the extended version of BSDS300, and the 300 ground truths are already contained in the BSDS500 ground truths.

Hyperparameters

Some of the parameters performed while training the network are mentioned by the author but as said earlier we believe some of the parameters which are significant for making a prediction have been omitted. The author suggests that prior to inputting the images to the network all images were resized to 214×214. The images are passed to the network in mini-batches of 10. The learning rate used for training the network was set to 0.003, however, they do not mention which optimizer they have used. We proceeded by employing the ADAM optimizer with the proposed learning rate. The paper also proposes a learning schedule where the learning rate of the optimizer is divided by 10 every 1000 iterations. The whole training process is repeated for 50 000 iterations which based on our training dataset translates into 30 epochs. Finally, a dropout rate of 0.65 was used to regularize the network.

As mentioned earlier the author fails to identify the value selected as a hyperparameter for the value of K, i.e., the number of segmentation classes specified in the architecture of the network. For this reason, we took the liberty of experimenting with different values such as 64 and 24. Additionally, due to the lack of processing power, we did not resize the image into 224×224 but to 128×128. Finally, it needs to be mentioned that we did implement the two optimizers one for optimizing the J Soft N-Cut and one for optimizing the J_rec . We tried defining the learning rate as specified above but as expected no further improvement was displayed in the losses after the 8th or 9th epoch due to a really small learning rate. For this reason, we decided to not use the learning schedule proposed by the paper and go for a constant learning rate of 0.003.

Summary of Paper Ambiguities

The training/validation percentages used for training are not mentioned.
The number of segmentation classes (K) used during the experiments was not defined
The authors do not specify which optimizer they have used.
Learning rate looks very small to reduce both a loss value of J Rec = 17500 and a loss value of J Soft N-Cut = 19, with the learning schedule proposed the learning rate after 10 000 iterations (i.e. 1/5 of training) the learning rate already drops at 3×10^-13.

Summary of Reproduction Deviations from paper

We used an 80% 20% split for training/validation data.
The paper’s images were resized to 128×128 and not to 224×224 due to processing power limitations.
The desired number of classes hyperparameter (K) was not defined thus we used different self-defined values e.g. 24 and 64.
We used a constant learning rate of 0.003 instead of the proposed learning schedule.
We did not implement the post-processing steps followed after the prediction made by the network.

Results

In this section, we will present the results of our reproduction attempt. We should first clarify that we do not expect to get values matching those presented in the paper as we did not implement the post-processing steps suggested by the authors. The results of the paper as well as the experiments we performed are presented in Table 1. It should be clarified that the results in the paper were given in terms of Optimal Dataset Size (ODS) and Optimal Image Size (OIS) due to the hierarchical segmentation post-processing step which was used. Because we did not use hierarchical segmentation and thus no ODS and OIS value the average of the two was taken for comparison.

In all experimental setups, we used an input image of dimensions 128×128 instead of 224×224, a dropout rate of 0.65, and a batch size of 10. 20% of the training data were kept as a validation set. The experiments we performed involved using either a constant learning rate or decaying learning. Additionally, we varied the value of desired predicted classes as this was not given in the paper.

**Fig 3:** Demonstrates a comparison between some of the best results between our implementation (receding learning rate =0.003) and the paper.

Useful Repositories for Reproduction

To facilitate our reproduction, we first search for some existing implementation upon which we can work. Although some of the repositories tried to make a reproduction of the paper we could not find one that had a complete and correct reproduction of the W-net paper.

The repository which we chose to work on [10] had the network architecture correct, but the training loop was used to optimize a Reconstruction Loss, defined as the cross-entropy instead of what proposed by the paper. It is important to highlight that the implementation of the soft Ncut loss was inefficient thus we replaced it using code from another repository. [11] Additionally, the training loop did not follow the algorithm proposed by the paper, both the reconstruction loss and the Soft Ncut loss were used to optimize the weights of the whole network. Instead, the paper suggests using the soft N cut loss to update the Encoder structure of the network and the reconstruction to update the whole network. This was something that we did implement ourselves as we found no other repository that had implemented this. Finally, the calculation of the metrics required by the paper was not present in the main W-net repository we were following. For this reason, we used code from a separate repository which can be found in [12]; this repository facilitates the calculation of Segmentation coverage (SC), Probabilistic Rand. Index (PRI) and Variation of Information (VI).

Implementation Pipeline

In this section, we will provide some clarifications about our implementation to facilitate running our code. All the main scripts used during the pipeline are included in the codes directory in our repository.

All the hyperparameters for training and testing can be set from the config.py file. The config.py file also includes the directories from where the datasets will be loaded as well as flags for visualizations or not.

Start by downloading the datasets of BSDS500 and BSDS300. [8, 9] and extract them in the directory datasets. It is important to mention that ground truths in the case of BSDS500 are given in “.mat” files. Our predictions script uses “.npy” files thus a conversion function (BSDS500gt_to_npy.py) was written to make this conversion. You can enable the conversion via the config.py file. This converts the ground truths in “.npy” and saves them in the data/converted_segmentations directory. If you had done this once you can disable the flag later as the files will be loaded from the converted_segmentations directory.

In the case of BSDS300, the ground truths are given in the form of “.seg” files you can use the converter (convertSegmentation.npy) already provided by the directory we used for our reproduction [10] to convert them into “.npy” files. An alternative would be as we said before to copy the groundTruths folder from the BSDS500 as it already contains the 300 ground truths we need and place it inside the BSDS300 folder.

**Fig. 4:** Demonstrates a summary of our implementation pipeline.

The train.py file is used for the training of our architecture. This includes loading of the dataset from their corresponding directories and the training loop. At the end of the train.py file a CSV file with the losses as well as the model of the network will be saved in the results directory. Additionally, a directory in the results called latent_images contains images with the segmentation result of the reconstructed image for the validation set batches.

The predict.py as shown in Fig. 4 takes the test set and performs predictions on it using the model we saved from the training stage. The predictions of the model are saved in the results/test_set_predictions directory. The ground truths are loaded in the predict.py script to facilitate visualization of both the predictions and the ground truths next to each other.

Finally, the metrics_evaluation.py script loads the predictions made by the predict.py and calculates the metrics given in the paper. A flag in the config.py file can be used to view the prediction, ground truth, and original image next to each other and inspect the dataset one by one, or to calculate the metrics for the whole dataset without inspecting the images one by one.

Final Words

To conclude, we have partially reproduced the W-net paper by creating a new code variant of existing codes as mentioned in the sections above where we retained useful snippets of code from existing repositories and added/modified certain sections as per the details of the paper. Critically, we believe our addition of the separate update of the Encoder using N-soft-cut Loss and update of the entire network using Reconstruction Loss is an improvement over the existing repositories. Towards the final reproduction deliverables, we generated the two tables containing the metric of evaluation over the test datasets, without implementing the post-processing mentioned in the paper. Additionally, to see the effect of the ambiguous hyper-parameter values we ran several iterations of the entire training and testing loop to compare with the Author’s claims. In general, the results we obtained are not as good as the authors’ and this could be attributed to the various hyperparameters choices we had to make without the help of the paper and also if the claim of the paper is true, apart from training, the post-processing step, which we omitted, also plays an important contribution in the results.
As our observation was that many repositories claimed to have reproduced this paper, but none of them had the complete pipeline from train to test as given in the paper, we believe our work will help in the production of a complete reproduction. A major future contribution to our work will be the addition of the post-processing steps.

Appendix

Some Interesting results of our reproduction:

**Fig. 5:** Demonstrates a comparison between our predictions (receding learning rate = 0.003 and K=64) on the BSDS500 Dataset and its ground truths.

**Fig. 6:** Demonstrates a comparison between our predictions (Receding learning rate = 0.003 and K=64) on the BSDS500 Dataset and its ground truths.

References

[1]X. Xia and B. Kulis, “W-Net: A Deep Model for Fully Unsupervised Image Segmentation”, arXiv.org, 2021. [Online]. Available: https://arxiv.org/abs/1711.08506. [Accessed: 16- Apr- 2021].

[2]O. Ronneberger, P. Fischer and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv.org, 2021. [Online]. Available: https://arxiv.org/abs/1505.04597. [Accessed: 16- Apr- 2021].

[3]C. Versloot, “Understanding separable convolutions — MachineCurve”, MachineCurve, 2021. [Online]. Available: https://www.machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/. [Accessed: 16- Apr- 2021].

[4]J. Shi and J. Malik, “Normalized cuts and image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000. Available: 10.1109/34.868688 [Accessed 16 April 2021].

[5]L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018. Available: 10.1109/tpami.2017.2699184.

[6]P. Arbeláez, M. Maire, C. Fowlkes and J. Malik, “Contour Detection and Hierarchical Image Segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898–916, 2011. Available: https://ieeexplore.ieee.org/document/5557884. [Accessed 16 April 2021].

[7]M. Everingham, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012)”, Host.robots.ox.ac.uk, 2021. [Online]. Available: http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2012/. [Accessed: 16- Apr- 2021].

[8]”The Berkeley Segmentation Dataset and Benchmark (BSDS300)”, Www2.eecs.berkeley.edu, 2021. [Online]. Available: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/. [Accessed: 16- Apr- 2021].

[9]”Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)”, Www2.eecs.berkeley.edu, 2021. [Online]. Available: https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html. [Accessed: 16- Apr- 2021].

[10]G. Bishop, “W-Net-Pytorch”, GitHub, 2021. [Online]. Available: https://github.com/gr-b/W-Net-Pytorch. [Accessed: 16- Apr- 2021].

[11]F. Odom, “wnet-unsupervised-image-segmentation”, GitHub, 2021. [Online]. Available: https://github.com/fkodom/wnet-unsupervised-image-segmentation/blob/master/src/loss.py. [Accessed: 16- Apr- 2021].

[12]K. Haofei, “BSD500-Segmentation-Evaluator”, GitHub, 2021. [Online]. Available: https://github.com/KuangHaofei/BSD500-Segmentation-Evaluator. [Accessed: 16- Apr- 2021].

Reproduction of: “W-Net: A Deep Model for Fully Unsupervised Image Segmentation”

Written by Joseph Sherman