A number of malware types rely on Domain Generation Algorithms (DGA)s to establish a communication link with command and control (C2) server to receive instruction and/or to exfiltrate the data to malicious actors.
In this talk, we aim to introduce a novel approach to improve ML models’ accuracy on detecting new DGA types by utilizing a separate ML model to specifically learn the embedding representation on normal English text corpus. This model uses the general representations to transform domain names before feeding it to the classifier. Such architecture avoids overfitting to the training data and at the same time captures essential contextual information about the language to be able to differentiate between normal character sequence vs random DGA sequence.
We evaluated our models on three new DGA families to test our modelโs generalization ability upon receiving new types of DGAs and compared our modelโs results against unified architecture on the identical train and test dataset. We have found that our model achieves significantly better results than unified ML approaches on examples of new DGA malware families.