Tech/ Tricks¶
Data¶
data augmentation¶
horizontally flipping, random crops, color jittering
PCA Whitening¶
reduce dimension
i.i.d (independent and identically distributed)
Regarlization¶
plenty the weight in loss function
solve: overfit
Normalization¶
Normalization
solve: Internal Covariate Shift
Skip Connection¶
Residual scaling¶
if the number of filters exceeded 1000, the residual variants started to exhibit instabilities and the network has just “died” early in the training, meaning that the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations.
caling down the residuals before adding them to the previous layer activation seemed to stabilize the training.
visual gradient vanishing¶
Batch size¶
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour(2017) Solve difficulties of large minibatches with distributed synchronous stochastic gradient descent(SGD)