现有的工作表明,经过微调的文本转换器模型可以实现最先进的预测性能,但也容易受到对抗性文本扰动的影响。传统的对抗性评估通常是在微调模型并忽略训练数据之后才进行的。在本文中,我们想要证明训练数据和模型鲁棒性之间也存在很强的相关性。为此,我们提取了代表各种输入微调语料库属性的 13 个不同特征,并使用它们来预测微调模型的对抗鲁棒性。我们主要关注仅编码器的 Transformer 模型 BERT 和 RoBERTa,以及 BART、ELECTRA 和 GPT2 的其他结果,提供了多种证据来支持我们的论点。首先,实证分析表明,(a)提取的特征可以与随机森林等轻量级分类器一起使用,以有效预测攻击成功率;(b)对模型鲁棒性影响最大的特征与鲁棒性有明显的相关性。其次,我们的框架可以用作鲁棒性评估的快速有效的附加工具,因为它(a)与传统技术相比节省了 30 倍至 193 倍的运行时间,(b)可以跨模型转移,(c)可以在对抗性训练下使用,(d) 对统计随机性具有鲁棒性。我们的代码将公开。
Existing works have shown that fine-tuned textual transformer models achieve
state-of-the-art prediction performances but are also vulnerable to adversarial
text perturbations. Traditional adversarial evaluation is often done
\textit{only after} fine-tuning the models and ignoring the training data. In
this paper, we want to prove that there is also a strong correlation between
training data and model robustness. To this end, we extract 13 different
features representing a wide range of input fine-tuning corpora properties and
use them to predict the adversarial robustness of the fine-tuned models.
Focusing mostly on encoder-only transformer models BERT and RoBERTa with
additional results for BART, ELECTRA and GPT2, we provide diverse evidence to
support our argument. First, empirical analyses show that (a) extracted
features can be used with a lightweight classifier such as Random Forest to
effectively predict the attack success rate and (b) features with the most
influence on the model robustness have a clear correlation with the robustness.
Second, our framework can be used as a fast and effective additional tool for
robustness evaluation since it (a) saves 30x-193x runtime compared to the
traditional technique, (b) is transferable across models, (c) can be used under
adversarial training, and (d) robust to statistical randomness. Our code will
be publicly available.