"Multimodal Benchmarking for NCAA Basketball" by Brendan Barnett
 

Date of Completion

Spring 5-1-2025

Thesis Advisor(s)

Dr. Dongjin Song

Honors Major

Computer Science and Engineering

Disciplines

Applied Statistics | Computer Sciences | Data Science

Abstract

We present the first multimodal, multitask benchmark for NCAA basketball, synthesizing structured statistical features with large language model (LLM)-generated game summaries across 19,739 games spanning four NCAA Division I seasons (2021--2025). We evaluate three model families---XGBoost, deep neural networks, and Transformers---under tabular-only and early-fusion settings to measure the impact of LLM-derived textual embeddings. To assess practical utility, we simulate fixed-stake and Kelly criterion-based betting strategies using historical bookmaker odds, analyzing both profitability and downside risk via Monte Carlo simulation. Our results show that XGBoost with early-fusion achieves the highest return on investment and the lowest risk of loss. This work is, to our knowledge, the first to integrate LLM-generated narrative data with structured inputs for calibrated forecasting in sports, offering a reproducible benchmark for multimodal decision-making under uncertainty.

Share

COinS