On the Precision of Dynamic Program Fingerprints Based on Performance Counters
This program is tentative and subject to change.
Task classification is the challenge of determining whether two binary programs perform the same task. This problem is essential in scenarios such as malware identification, plagiarism detection, and redundancy elimination. Classification can be performed statically or dynamically. In the former case, the classifier analyzes the binary image of the program, whereas in the latter it observes the program's execution. Recent research has demonstrated that dynamic classification is more accurate, particularly in adversarial settings where programs may be obfuscated. This remains true even when both classifiers use the exact representation of programs, such as histograms of instruction opcodes. The superior accuracy of dynamic classification stems from its ability to disregard dead code inserted during the obfuscation process. However, state-of-the-art dynamic techniques, such as Valgrind plugins, can slow down program execution by as much as 100 times due to binary instrumentation. This paper proposes to eliminate this overhead by replacing program instrumentation with the sampling of hardware performance counters. Our findings reveal both advantages and limitations of this approach. On the positive side, classifiers based on hardware counters impose almost no runtime overhead while retaining greater accuracy than purely static classifiers, particularly in the presence of obfuscation. On the downside, counter-based classifiers are slightly less accurate than instrumentation-based approaches and offer coarser granularity, being limited to whole-program classification rather than individual functions. Despite these limitations, our results challenge the conventional belief that dynamic code classifiers are too costly to be deployed in environments such as online servers, operating systems, and virtual machines.