Evaluating the Generalizability of Large Language Models to Real-World Complexity

Evaluating the Generalizability of Large Language Models to Real-World Complexity

Abstract:

Evaluating Code Large Language Models (LLMs) concerning real-world complexities is essential. Otherwise, there is a risk of overestimating LLMs’ programming abilities based on simplistic benchmarks and disappointment when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by mining or augmenting open-source repositories. Such solutions are usually task-specific. Data quality control from real-world projects can also be time-consuming and error-prone. More importantly, evaluating LLMs on fixed benchmark problems is subject to data contamination and overfitting. We propose GeneBench, an automated technique to add real-world complexities to any programming benchmark. GeneBench leverages a multi-objective optimization to maximize the complexity of programming problems while maintaining the readability of code similar to real-world programs. Transforming four widely-used programming benchmarks using GeneBench and evaluating 13 LLMs (including two reasoning LLMs) on them shows a significant performance drop across four different programming tasks (14.9%–60.5%, avg= 35.2%), demonstrating LLMs’ struggle under real-world complexities. The struggle persists even when LLMs are presented with GeneBench transformation of the same or different programs through fine-tuning or few-shot learning, although to a slightly lesser degree. Finally, we show that the performance of the studied LLMs in bug repair is similar under GeneBench and SWE-Bench. This, along with the consistent reproduction of performance drop of all studied LLMs across four tasks under different versions of GeneBench, makes the technique suitable to evaluate LLMs without costly construction of real-world benchmarks.