This chart displays the performance of fifteen models on the BlinkCode dataset after two rounds of refinement. The results indicate that closed-source models perform significantly better than open-source models. For instance, Claude-3.5-Sonnet and GPT-4 variants, which are closed-source, achieved the highest scores. In contrast, open-source models like IDEFICS-23B, Fuyu-8B, and Palegorma-38B-Mix-22.4 scored very low or even zero. This suggests that closed-source models are currently more effective in handling the tasks presented in the BlinkCode benchmark.

Abstract

Program synthesis is a critical capability of Large Language Models (LLMs). The generated code is often used as an interface for LLMs to act as agents and interact with the environment. On the other hand, multimodal LLMs, equipped with additional vision modules, have the potential to act as vision-enabled agents that perform interactive tasks with perceived visual information. It is thus also important for multimodal LLMs to be able to generate executable code, yet based on what they have observed in the environment. While well-designed interactive coding benchmarks have been proposed for LLMs, appropriate ones for multimodal LLMs are lacking. We thus propose BlinkCode, an interactive, comprehensive, visual coding benchmark with execution feedback for multimodal LLMs. BlinkCode covers three types of tasks, evaluating capabilities including basic coding, planning, and refinement based on visual information. Our evaluation result demonstrates that, as opposed to LLMs, most open-source multimodal LLMs lack coding capabilities, calling for the community to develop techniques to inject coding skills into multimodal LLMs to turn them into vision-enabled agents. We include the whole dataset in the supplementary material.

Overview of BlinkCode Benchmark

Figure: Overview of our proposed benchmark BlinkCode compared to previous benchmarks for multimodal LLMs. To evaluate complex reasoning, planning, tool use, and interactive refinement capabilities of multimodal LLMs, we design three types of coding tasks, including basic coding problems, code-generated figure reconstruction, and visual programming tasks, highlighting the potential of multimodal LLMs to be vision-enabled agents.