Cut & paste code in various forms abounds in software – and with such things, cut & paste bugs. Aside from cut & paste bugs, vulnerabilities in commonly-used libraries yield vulnerabilities in many different concrete targets. In both scenarios, attackers benefit from having practical techniques for identifying known code snippets from a large library in binaries under examination.
This talk discusses some adventures and lessons-learned building code to recognize third-party libraries in executables. The content touches on topics such as fast approximate nearest-neighbor search over bit vectors, mistakes one makes as machine learning beginner, and specific difficulties that often get glossed over in promising academic research.