New paper finds all LLM improvements are just task contamination
Current benchmarks are overestimating the true capabilities of LLMs
LLMs feel very capable these days, despite their obvious limitations. These models have been celebrated for their ability to engage in what's known as "few-shot" and "zero-shot" learning, where they perform tasks with minimal to no specific training. However, recent investigations suggest that the efficacy of these learning techniques might be overestimated due to a phenomenon known as "task contamination." A new paper has studied task contamination and concluded many LLM "improvements" are just artifacts of task contamination - and that for tasks where there's no possibility of contamination, there's been no improvement in LLMs over time!
That would be a pretty big finding. So let's break it down.
The Promise of Few-Shot Learning
Few-shot learning is one of the most exciting capabilities demonstrated by large language models. It refers to the ability to learn new tasks from just a few examples, sometimes even a single example. For instance, an LLM might be given just 5 labeled text examples for a sentiment analysis task, and be able to classify new text based on those examples alone.
This technique contrasts with traditional machine learning, which requires thousands or millions of training examples to learn concepts. Few-shot learning enables models to acquire new skills and knowledge rapidly from very limited data.
The few-shot prowess of models like GPT-4 has opened up new possibilities for quickly adapting LLMs to new domains and tasks without extensive retraining. Few-shot learning has become a major area of interest and benchmark for measuring progress in LLMs.
Keep reading with a 7-day free trial
Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.