Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
Suji Kim, Kangsan Kim, Sung Ju Hwang
Read on arXiv →Key claim
LearnWeak improves small agent specialization without annotations.
LearnWeak is a new framework that helps small computer-use agents specialize in specific domains without requiring extensive annotations. It identifies weaknesses in agents and generates targeted training tasks, leading to significant performance improvements. The key result shows average gains of over 11 percentage points compared to existing models.
In plain English
LearnWeak is a new framework that helps small computer-use agents specialize in specific domains without requiring extensive annotations. It identifies weaknesses in agents and generates targeted training tasks, leading to significant performance improvements. The key result shows average gains of over 11 percentage points compared to existing models.
The introduction of an annotation-free specialization framework for small computer-use agents represents a significant advancement in the field.
The claims are supported by empirical results across multiple domains, though some limitations in baseline comparisons exist.
Deep reliability assessment
The methodology supports that, in OSWorld local Docker domains and with a strong teacher plus GPT-5-mini assistance, student-aware task synthesis and error-aware DPO improve two small CUA backbones over their unspecialized versions and several data-generation baselines. Broader claims about general domain specialization are not fully established because the evaluation is limited to OSWorld, excludes Chrome for instability, and depends on capable proprietary/reference models for verification, weakness analysis, and generation.
Reproducibility
Partial: the paper gives substantial implementation detail for training, data generation, evaluation, LoRA settings, and baselines, and mentions a project page, but the provided text does not state that code, generated datasets, or trained adapters are open-sourced.
Discussion questions
- 1.Does the core assumption hold that the teacher-student disagreement reliably identifies the student's true weaknesses, rather than merely encoding the teacher's biases or OSWorld-specific evaluation artifacts?
- 2.For builders, is the cost and operational complexity of running a stronger teacher plus GPT-5-mini for verification, summarization, ranking, and query generation justified compared with using a larger general CUA directly?
- 3.What would happen if LEARNWEAK were tested on held-out real enterprise workflows, noisier GUI states, or domains where the teacher is also weak; would the gains disappear or become negative?
Key figure
Figure 1 contrasts LEARNWEAK's iterative weakness-aware query generation and selective error-aware DPO against expensive manual annotation and failure-blind rigid training, showing the small student improving across software domains.
