2025-06-18 · 8 min read

How I Fine-Tuned GPT-2 for Cricket Commentary

400k commentary rows, TensorFlow tricks, and plenty of slog-sweeps. This is how I trained a model to generate match-style commentary.

NLPFine-tuningCricketT5GPT-2

Why this project mattered

Sports commentary is emotional, fast, and contextual. Generic language models sound flat when they describe cricket moments.

I wanted a model that could generate short, punchy lines that feel like live commentary while still being grounded in what happened in the match.

Dataset and preprocessing

I collected a large commentary dataset and normalized inconsistent text patterns such as over formats, team names, punctuation, and special symbols.

I removed noisy rows, duplicated snippets, and low-value boilerplate so the model would focus on useful commentary style instead of artifacts.

  • Built a cleaned dataset focused on meaningful ball-by-ball context
  • Standardized abbreviations and match metadata
  • Created train and validation splits to avoid leakage

Modeling approach

I used GPT-2 fine-tuning for generation quality and compared multiple prompt templates to stabilize tone.

I also tested sequence length and decoding settings because cricket commentary quality drops quickly when outputs become repetitive.

  • Prompt format had a major impact on output style
  • Sampling settings changed creativity versus factual consistency
  • Short output targets gave the best production-like results

Training and evaluation

Training quality improved when I balanced learning rate, context window, and batch settings for the GPU budget available.

I evaluated with both offline metrics and manual review, since sports language needs human judgment for excitement and readability.

  • Tracked loss curves across checkpoints
  • Compared outputs on the same held-out match contexts
  • Selected the checkpoint with best readability and style match

Key outcomes

The final model produced sharper, more idiomatic commentary lines than baseline prompting on a general model.

The biggest lesson was that dataset quality and prompt structure mattered more than trying complex model changes too early.