# positional normalization - pono. our normalization scheme, which we refer to as positional...

Post on 11-Feb-2020

5 views

Embed Size (px)

TRANSCRIPT

Positional Normalization

Boyi Li1,2∗, Felix Wu1∗, Kilian Q. Weinberger1, Serge Belongie1,2 1Cornell University 2Cornell Tech

{bl728, fw245, kilian, sjb344}@cornell.edu

Abstract

A popular method to reduce the training time of deep neural networks is to normal- ize activations at each layer. Although various normalization schemes have been proposed, they all follow a common theme: normalize across spatial dimensions and discard the extracted statistics. In this paper, we propose an alternative nor- malization method that noticeably departs from this convention and normalizes exclusively across channels. We argue that the channel dimension is naturally ap- pealing as it allows us to extract the first and second moments of features extracted at a particular image position. These moments capture structural information about the input image and extracted features, which opens a new avenue along which a network can benefit from feature normalization: Instead of disregarding the normalization constants, we propose to re-inject them into later layers to preserve or transfer structural information in generative networks.

1 Introduction

Conv1_2 Conv2_2 Conv3_4 Conv4_4Input

σ

µ

µ

σ

Figure 1: The mean µ and standard devi- ation σ extracted by PONO at different layers of VGG-19 capture structural in- formation from the input images.

A key innovation that enabled the undeniable success of deep learning is the internal normalization of activations. Although normalizing inputs had always been one of the “tricks of the trade” for training neural networks [38], batch normalization (BN) [28] extended this practice to every layer, which turned out to have crucial benefits for deep networks. While the success of normalization methods was initially attributed to “reducing internal covariate shift” in hidden layers [28, 40], an array of recent studies [1, 2, 4, 24, 47, 58, 67, 75] has provided evidence that BN changes the loss surface and prevents divergence even with large step sizes [4], which accelerates training [28].

Multiple normalization schemes have been proposed, each with its own set of advantages: Batch normalization [28] benefits training of deep networks primarily in computer vision tasks. Group normalization [72] is often the first choice for small mini-batch settings such as object detec- tion and instance segmentation tasks. Layer Normaliza- tion [40] is well suited to sequence models, common in natural language processing. Instance normalization [66] is widely used in image synthesis owing to its apparent ability to remove style information from the inputs. However, all aforementioned normalization schemes follow a common theme: they normalize across spatial dimensions and discard the extracted statistics. The philosophy behind their design is that the first two moments are considered expendable and should be removed.

In this paper, we introduce Positional Normalization (PONO), which normalizes the activations at each position independently across the channels. The extracted mean and standard deviation capture

∗: Equal contribution.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Batch Normalization Instance Normalization Group Normalization Layer Normalization Positional Normalization

B C

H ,W

B C

H ,W

B C

H ,W

B C

H ,W

B C

H ,W

Figure 2: Positional Normalization together with previous normalization methods. In the figure, each subplot shows a feature map tensor, with B as the batch axis, C as the channel axis, and (H,W ) as the spatial axis. The entries colored in green or blue (ours) are normalized by the same mean and standard deviation. Unlike previous methods, our method processes each position independently, and compute both statistics across the channels.

the coarse structural information of an input image (see Figure 1). Although removing the first two moments does benefit training, it also eliminates important information about the image, which — in the case of a generative model — would have to be painfully relearned in the decoder. Instead, we propose to bypass and inject the two moments into a later layer of the network, which we refer to as Moment Shortcut (MS) connection.

PONO is complementary to previously proposed normalization methods (such as BN) and as such can and should be applied jointly. We provide evidence that PONO has the potential to substantially enhance the performance of generative models and can exhibit favorable stability throughout the training procedure in comparison with other methods. PONO is designed to deal with spatial infor- mation, primarily targeted at generative [19, 29] and sequential models [23, 32, 56, 63]. We explore the benefits of PONO with MS in several initial experiments across different model architectures and image generation tasks and provide code online at https://github.com/Boyiliee/PONO.

2 Related Work

Normalization is generally applied to improve convergence speed during training [50]. Normaliza- tion methods for neural networks can be roughly categorized into two regimes: normalization of weights [49, 53, 57, 71] and normalization of activations [28, 30, 36, 40, 46, 48, 59, 66, 72]. In this work, we focus on the latter.

Given the activations X ∈ RB×C×H×W (where B denotes the batch size, C the number of channels, H the height, and W the width) in a given layer of a neural net, the normalization methods differ in the dimensions over which they compute the mean and variance, see Figure 2. In general, activation normalization methods compute the mean µ and standard deviation (std) σ of the features in their own manner, normalize the features with these statistics, and optionally apply an affine transformation with parameters β (new mean) and γ (new std). This can be written as

X ′b,c,h,w = γ

( Xb,c,h,w − µ

σ

) + β. (1)

Batch Normalization (BN) [28] computes µ and σ across the B, H, and W dimensions. BN increases the robustness of the network with respect to high learning rates and weight initializations [4], which in turn drastically improves the convergence rate. Synchronized Batch Normalization treats features of mini-batches across multiple GPUs like a single mini-batch. Instance Normalization (IN) [66] treats each instance in a mini-batch independently and computes the statistics across only spatial dimensions (H and W). IN aims to make a small change in the stylization architecture results in a significant qualitative improvement in the generated images. Layer Normalization (LN) normalizes all features of an instance within a layer jointly, i.e., calculating the statistics over the C, H, and W dimensions. LN is beneficial in natural language processing applications [40, 68]. Notably, none of the aforementioned methods normalize the information at different spatial position independently. This limitation gives rise to our proposed Positional Normalization.

Batch Normalization introduces two learned parameters β and γ to allow the model to adjust the mean and std of the post-normalized features. Specifically, β, γ ∈ RC are channel-wise parameters. Condi- tional instance normalization (CIN) [15] keeps a set parameter of pairs {(βi, γi)|i ∈ {1, . . . , N}}

2

https://github.com/Boyiliee/PONO

which enables the model to have N different behaviors conditioned on a style class label i. Adaptive instance normalization (AdaIN) [26] generalizes this to an infinite number of styles by using the µ and σ of IN borrowed from another image as the β and γ. Dynamic Layer Normalization (DLN) [35] relies on a neural network to generate the β and γ. Later works [27, 33] refine AdaIN and generate the β and γ of AdaIN dynamically using a dedicated neural network. Conditional batch normalization (CBN) [10] follows a similar spirit and uses a neural network that takes text as input to predict the residual of β and γ, which is shown to be beneficial to visual question answering models.

Notably, all aforementioned methods generate β and γ as vectors, shared across spatial posi- tions. In contrast, Spatially Adaptive Denormalization (SPADE) [52], an extension of Synchro- nized Batch Normalization with dynamically predicted weights, generates the spatially dependent β, γ ∈ RB×C×H×W using a two-layer ConvNet with raw images as inputs. Finally, we introduce shortcut connections to transfer the first and second moment from early to later layers. Similar skip connections (with add, concat operations) have been introduced in ResNets [20] and DenseNets [25] and earlier works [3, 23, 34, 54, 62], and are highly effective at improving network optimization and convergence properties [43].

3 Positional Normalization and Moment Shortcut

µ σ

Figure 3: PONO statistics of DenseBlock-3 of a pretrained DenseNet-161.

Prior work has shown that feature normalization has a strong ben- eficial effect on the convergence behavior of neural networks [4]. Although we agree with these findings, in this paper we claim that removing the first and second order information at multiple stages throughout the network may also deprive the deep net of poten- tially useful information — particularly in the context of generative models, where a plausible image needs to be generated.

PONO. Our normalization scheme, which we refer to as Positional Normalization (PONO), differs from prior work in that we normal- ize exclusively over the channels at any given fixed pixel location (see Figure 2). Consequently, the extracted statistics are position dependent and reveal structural inform