Abstract
Abstract
This thesis addresses the lack of a standardized benchmark for evaluating Vision Language Models (VLMs) on the Turkish high school curriculum. We introduce a novel, manually curated benchmark dataset for the Yükseköğretim Kurumları Sınavı (YKS), which comprises 1854 samples spanning 309 topics. The dataset is designed to comprehensively test VLMs' ability to reason over complex, high-school-level exam questions. We first establish a baseline by benchmarking the performance of both open-source and proprietary VLMs on this new dataset. To further advance model capabilities in this domain, we curated three large-scale training datasets—D, M, and L—totaling 161.4 million tokens, which were augmented with solutions generated by advanced models like Gemini 2.0, Gemini 2.5 and through video-assisted prompting for complex problems. Using this combined dataset, we fine-tuned the open-source Qwen-2.5VL-32B model, achieving a significant performance improvement of 25.8\%, raising its accuracy from 62.46\% to 78.59\% on the YKS benchmark. This fine-tuned model's performance is on par with the state-of-the-art proprietary model, Gemini-2.0, which achieved 79.61\% on the same test set. This work provides both a valuable benchmark for future research and a demonstration of how fine-tuning with high-quality, domain-specific data can close the performance gap between open-source and commercial models on challenging academic tasks.