InplaceABN Backward Error_inplace_abn_sync some elements marked as dirty dur_EnjoyCodingAndGame的博客

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

豪情万千的麻辣香锅 · 斗罗之神� - 搜狗图片搜索· 1 年前 ·

火星上的海豚 · 我独自成神漫画免费 - 我独自成神漫画 - ...· 1 年前 ·

健壮的山寨机 · 同窗校友-英文MBA-长江商学院· 1 年前 ·

踢足球的镜子 · 威迈斯产品均价逐年下滑 ...· 1 年前 ·

很拉风的罐头 · 腾讯问卷 ...· 1 年前 ·

近日在对一个包含InplaceABN模块的网络进行魔改的时候，遇到了如下报错：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 256, 7, 7]], which is output 0 of InPlaceABNBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

之前应用InplaceABN的时候，并没有研读过paper和代码，所以在解决这个问题的时候，花费了数小时，像无头苍蝇一样试错，虽然知道是连续的inplace操作引发的问题，但是没有定位到具体引发问题是在哪个block的哪块代码，居然一直在错误地方尝试clone()来解决。次日常看github的issue，才将问题原因真正搞清楚。

1. InplaceABN提供的block

ABN is standard BN + activation (no memory savings).
InPlaceABN is BN+activation done inplace (with memory savings).
InPlaceABNSync is BN+activation done inplace (with memory savings) + computation of BN (fwd+bwd) with data from all the gpus.

2. Inplace shortcut

out += residual to out = out + residual

+=和add_()是Inplace操作

我遇到的问题其实是，在 ResidualBlock中，有InplaceABN和add_两个连续的inplce操作。

3. 解决方案

reference:

https://github.com/mapillary/inplace_abn/issues/6

inplace_abn/resnet.py at main · mapillary/inplace_abn · GitHub

inplace_abn/residual.py at main · mapillary/inplace_abn · GitHub

近日在对一个包含InplaceABN模块的网络进行魔改的时候，遇到了如下报错：RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 256, 7, 7]], which is output 0 of InPlaceABNBackward, is at version 3; expec

就地激活的批次标准就地激活的BatchNorm（ InPl ace - ABN ）是一种新颖的方法，可以减少训练深度网络所需的内存。通过将BN +非线性激活重新定义为一次就地操作，它可以在现代体系结构（如ResNet，ResNeXt和Wider ResNet）中节省多达50％的内存，同时根据需要智能地丢弃或重新计算中间缓冲区。该存储库包含 InPl ace - ABN 层的实现，以及一些用于重现本文中报告的ImageNet分类结果的训练脚本。现在，我们还发布了用于语义分割的推理代码，以及的Mapillary Vistas训练模型。可以在本页底部找到更多信息。如果您在研究中使用就地激活的批次标准，请引用： @inproce ed ings { rotabulo2017pl ace , title = { In-Pl ace Activat ed BatchNorm for Memory-Opti

Tr ace back (most recent call last): File "train.py", line 14, in <module> from unet import UNet File "/data3/yuechen/new/ pytorch _unet/unet/__init__.py", line 1, in <module> from .unet_model import UNet

pytorch GPU多卡并行的一点坑说在前头1、torch.cuda()2、ninja 的问题3、libcudart.so.9.1 找不到4、os.environ[&quot;CUDA_VISIBLE_DEVICES&quot;] 设置无效5、 Inpl ace ABN Sync 使用中的编译相关问题6、 Inpl ace ABN Sync 同步时卡住不动众所周知，torch.nn.DataParallel(va...

2、大部分情况是model为GPU而输入data为CPU，此时错误内容大概就是指输入类型是CPU（torch.FloatTensor），而参数类型是GPU（torch.cuda.FloatTensor）。关于数据类型的链接：官方链接首先，请先检查是否正确使用了CUDA。通常我们这样指定使用CUDA： device = torch.device("cuda" if torch.cuda.is_availabl

PyTorch 报错“Runtime Error : one of the variables ne ed ed for gradient computation has been modifi ed by……”

如果安装 inpl ace _ abn 报错如下： distutils. error s.Distutils Error : Could not find suitable distribution forRequirement.parse('setuptools_scm') Command "python setup.py egg