引言
二进制代码相似性检测:
基于二进制代码流:word2vec,bert,asm2vec等
基于Block:DeepBinDiff
基于CFG,LCG图:
参考:图相似性综述
基于AST:Asteria
SAFE
论文原文:https://arxiv.org/pdf/1811.05296.pdf
代码地址:https://github.com/gadiluna/SAFE
模型
概览
基于Word2Vec实现:
i2v
根据现有一些不同架构的指令集的汇编指令进行规范化处理,将每条asm instruction变成一个word形式,例如:mov_mem_10,将规范化的指令集构成Corpus,利用word2vec进行embedding,根据Corpus训练拿到权重矩阵即可,但这要求输入的指令在我们的语料库中需要出现(gensim提供的FastText Model兴许可以解决此问题,有时间可以换这个试试,它与word2vec各有优劣)
i2v model从原理上来看就是应用了自然语言处理的Word2vec模型。
关于word2vec模型的介绍可以参考:word2vec详解
BiRNN(BiLSTM)
LSTM的全称是Long Short-Term Memory,它是RNN(Recurrent Neural Network)的一种。LSTM由于其设计的特点,非常适合用于对时序数据的建模,如文本数据。BiLSTM是Bi-directional Long Short-Term Memory的缩写,是由前向LSTM与后向LSTM组合而成。两者在自然语言处理任务中都常被用来建模上下文信息。
为什么使用LSTM与BiLSTM?
将词的表示组合成句子的表示,可以采用相加的方法,即将所有词的表示进行加和,或者取平均等方法,但是这些方法没有考虑到词语在句子中前后顺序。如句子“我不觉得他好”。“不”字是对后面“好”的否定,即该句子的情感极性是贬义。使用LSTM模型可以更好的捕捉到较长距离的依赖关系。因为LSTM通过训练过程可以学到记忆哪些信息和遗忘哪些信息。
但是利用LSTM对句子进行建模还存在一个问题:无法编码从后到前的信息。在更细粒度的分类时,如对于强程度的褒义、弱程度的褒义、中性、弱程度的贬义、强程度的贬义的五分类任务需要注意情感词、程度词、否定词之间的交互。举一个例子,“这个餐厅脏得不行,没有隔壁好”,这里的“不行”是对“脏”的程度的一种修饰,通过BiLSTM可以更好的捕捉双向的语义依赖。
前向的LSTM与后向的LSTM结合成BiLSTM。比如,我们对“我爱中国”这句话进行编码。
使用PyTorch搭建BiLSTM样例代码:
模型搭建
#!/usr/bin/env python
# coding:utf8
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
torch.manual_seed(123456)
class BLSTM(nn.Module):
"""
Implementation of BLSTM Concatenation for sentiment classification task
"""
def __init__(self, embeddings, input_dim, hidden_dim, num_layers, output_dim, max_len=40, dropout=0.5):
super(BLSTM, self).__init__()
self.emb = nn.Embedding(num_embeddings=embeddings.size(0),
embedding_dim=embeddings.size(1),
padding_idx=0)
self.emb.weight = nn.Parameter(embeddings)
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
# sen encoder
self.sen_len = max_len
self.sen_rnn = nn.LSTM(input_size=input_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
dropout=dropout,
batch_first=True,
bidirectional=True)
self.output = nn.Linear(2 * self.hidden_dim, output_dim)
def bi_fetch(self, rnn_outs, seq_lengths, batch_size, max_len):
rnn_outs = rnn_outs.view(batch_size, max_len, 2, -1)
# (batch_size, max_len, 1, -1)
fw_out = torch.index_select(rnn_outs, 2, Variable(torch.LongTensor([0])).cuda())
fw_out = fw_out.view(batch_size * max_len, -1)
bw_out = torch.index_select(rnn_outs, 2, Variable(torch.LongTensor([1])).cuda())
bw_out = bw_out.view(batch_size * max_len, -1)
batch_range = Variable(torch.LongTensor(range(batch_size))).cuda() * max_len
batch_zeros = Variable(torch.zeros(batch_size).long()).cuda()
fw_index = batch_range + seq_lengths.view(batch_size) - 1
fw_out = torch.index_select(fw_out, 0, fw_index) # (batch_size, hid)
bw_index = batch_range + batch_zeros
bw_out = torch.index_select(bw_out, 0, bw_index)
outs = torch.cat([fw_out, bw_out], dim=1)
return outs
def forward(self, sen_batch, sen_lengths, sen_mask_matrix):
"""
:param sen_batch: (batch, sen_length), tensor for sentence sequence
:param sen_lengths:
:param sen_mask_matrix:
:return:
"""
''' Embedding Layer | Padding | Sequence_length 40'''
sen_batch = self.emb(sen_batch)
batch_size = len(sen_batch)
''' Bi-LSTM Computation '''
sen_outs, _ = self.sen_rnn(sen_batch.view(batch_size, -1, self.input_dim))
sen_rnn = sen_outs.contiguous().view(batch_size, -1, 2 * self.hidden_dim) # (batch, sen_len, 2*hid)
''' Fetch the truly last hidden layer of both sides
'''
sentence_batch = self.bi_fetch(sen_rnn, sen_lengths, batch_size, self.sen_len) # (batch_size, 2*hid)
representation = sentence_batch
out = self.output(representation)
out_prob = F.softmax(out.view(batch_size, -1))
return out_prob
init()函数中对网络进行初始化,设定词向量维度,前向/后向LSTM中隐层向量的维度,还有要分类的类别数等。
bi_fetch()函数的作用是将 与 拼接起来并返回拼接后的向量。由于使用了batch,所以需要使用句子长度用来定位开始padding时前一个时刻的输出的隐层向量。
forward()函数里进行前向计算,得到各个类别的概率值。
模型训练
def train(model, training_data, args, optimizer, criterion):
model.train()
batch_size = args.batch_size
sentences, sentences_seqlen, sentences_mask, labels = training_data
# print batch_size, len(sentences), len(labels)
assert batch_size == len(sentences) == len(labels)
''' Prepare data and prediction'''
sentences_, sentences_seqlen_, sentences_mask_ = \
var_batch(args, batch_size, sentences, sentences_seqlen, sentences_mask)
labels_ = Variable(torch.LongTensor(labels))
if args.cuda:
labels_ = labels_.cuda()
assert len(sentences) == len(labels)
model.zero_grad()
probs = model(sentences_, sentences_seqlen_, sentences_mask_)
loss = criterion(probs.view(len(labels_), -1), labels_)
loss.backward()
optimizer.step()
代码中training_data是一个batch的数据,其中包括输入的句子sentences(句子中每个词以词下标表示),输入句子的长度sentences_seqlen,输入的句子对应的情感类别labels。 训练模型前,先清空遗留的梯度值,再根据该batch数据计算出来的梯度进行更新模型。
model.zero_grad()
probs = model(sentences_, sentences_seqlen_, sentences_mask_)
loss = criterion(probs.view(len(labels_), -1), labels_)
loss.backward()
optimizer.step()
模型测试
以下是进行模型测试的代码。
def test(model, dataset, args, data_part="test"):
"""
:param model:
:param args:
:param dataset:
:param data_part:
:return:
"""
tvt_set = dataset[data_part]
tvt_set = yutils.YDataset(tvt_set["xIndexes"],
tvt_set["yLabels"],
to_pad=True, max_len=args.sen_max_len)
test_set = tvt_set
sentences, sentences_seqlen, sentences_mask, labels = test_set.next_batch(len(test_set))
assert len(test_set) == len(sentences) == len(labels)
tic = time.time()
model.eval()
''' Prepare data and prediction'''
batch_size = len(sentences)
sentences_, sentences_seqlen_, sentences_mask_ = \
var_batch(args, batch_size, sentences, sentences_seqlen, sentences_mask)
probs = model(sentences_, sentences_seqlen_, sentences_mask_)
_, pred = torch.max(probs, dim=1)
if args.cuda:
pred = pred.view(-1).cpu().data.numpy()
else:
pred = pred.view(-1).data.numpy()
tit = time.time() - tic
print " Predicting {:d} examples using {:5.4f} seconds".format(len(test_set), tit)
labels = numpy.asarray(labels)
''' log and return prf scores '''
accuracy = test_prf(pred, labels)
return accuracy
def cal_prf(pred, right, gold, formation=True, metric_type=""):
"""
:param pred: predicted labels
:param right: predicting right labels
:param gold: gold labels
:param formation: whether format the float to 6 digits
:param metric_type:
:return: prf for each label
"""
''' Pred: [0, 2905, 0] Right: [0, 2083, 0] Gold: [370, 2083, 452] '''
num_class = len(pred)
precision = [0.0] * num_class
recall = [0.0] * num_class
f1_score = [0.0] * num_class
for i in xrange(num_class):
''' cal precision for each class: right / predict '''
precision[i] = 0 if pred[i] == 0 else 1.0 * right[i] / pred[i]
''' cal recall for each class: right / gold '''
recall[i] = 0 if gold[i] == 0 else 1.0 * right[i] / gold[i]
''' cal recall for each class: 2 pr / (p+r) '''
f1_score[i] = 0 if precision[i] == 0 or recall[i] == 0 \
else 2.0 * (precision[i] * recall[i]) / (precision[i] + recall[i])
if formation:
precision[i] = precision[i].__format__(".6f")
recall[i] = recall[i].__format__(".6f")
f1_score[i] = f1_score[i].__format__(".6f")
''' PRF for each label or PRF for all labels '''
if metric_type == "macro":
precision = sum(precision) / len(precision)
recall = sum(recall) / len(recall)
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
elif metric_type == "micro":
precision = 1.0 * sum(right) / sum(pred) if sum(pred) > 0 else 0
recall = 1.0 * sum(right) / sum(gold) if sum(recall) > 0 else 0
f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return precision, recall, f1_score
SAFE 使用BiRNN
Self-Attention Network
参考论文:A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING
可以看出SAFE所采用的Self Attention Network和一个全连接层输出函数的embedding,该过程很类似Transformer的一个Encoder
Siamese Network
A friendly introduction to Siamese Networks
Siamese network就是“连体的神经网络”,神经网络的“连体”是通过共享权值来实现的,如下图所示。
孪生神经网络事衡量两个输入的相似程度。孪生神经网络有两个输入(Input1 and Input2),将两个输入feed进入两个神经网络(Network1 and Network2),这两个神经网络分别将输入映射到新的空间,形成输入在新的空间中的表示。通过Loss的计算,评价两个输入的相似度。
SAFE论文中采用的Siamese network如下,采用余弦相似性计算两个函数的相似性。
数据集
评价指标
模型改进:
增加多点对比,根据函数调用图进行匹配,减少搜索量,根据调用函数的完全图匹配的最高相似度(Kuhn Munkres算法)的均值作为两个函数的相似性进行对比,测试模型的准确率。