2026-05-28infracode

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

Alaa Khamis, Alaa Maalouf

Key claim

HullFT improves quality-efficiency in test-time finetuning.

HullFT is a new method for test-time finetuning that optimizes both speed and quality by using a geometric approach. It effectively selects relevant training sequences and reduces computation time through Gradient Reuse. The key result is that HullFT achieves lower bits-per-byte at a significantly reduced runtime compared to existing methods.

In plain English

Novelty

8.0/10

HullFT introduces a geometric approach to TTFT that significantly improves the quality-efficiency tradeoff.

Reliability

7.5/10

The claims are supported by experiments showing improved performance over state-of-the-art methods.

Deep reliability assessment

The methodology supports a quality-latency improvement for TTFT in the authors’ setup: kNN-preselected candidates, fixed embeddings, The Pile subsets, and comparisons to kNN retrieval and SIFT. Broader claims about latency-constrained LLM adaptation are not fully established without evidence across more retrievers, embedding models, base LLMs, domains, and production serving constraints.

Reproducibility

Yes for code: the paper states code is available at https://github.com/alaa-khamis/HullFT. Dataset appears to be public, with experiments reported across 12 subsets of The Pile, though the provided excerpt does not include full experimental configuration details.

Discussion questions

1.Does convex reconstruction in embedding space actually correspond to examples that produce useful gradient updates for the LLM, or is it only a proxy for semantic diversity?
2.For builders, when does per-query TTFT with HullFT become operationally worthwhile compared with cheaper alternatives such as longer context retrieval, LoRA adapters, prompt caching, or reranking-only RAG?
3.What result would falsify HullFT’s core claim: failure on non-Pile domains, disappearance of gains with a stronger retriever/embedding model, or cases where lower convex reconstruction error does not correlate with lower BPB after finetuning?

Key figure

Figure 1 shows the HullFT pipeline: retrieve a kNN candidate pool, use Frank-Wolfe to approximate the prompt embedding as a sparse convex combination, integerize the weights into repeated finetuning examples, reuse cached gradients across repeats, and evaluate the adapted LLM on the prompt.

Benchmark results

~12 subsets of The Pileselection speedup vs SIFT: 12vs SIFT12× faster selection on average

~12 subsets of The Pilefinetuning speedup from Gradient Reuse: 1.48vs naive finetuning without Gradient Reuse1.48× finetuning speedup

GitHub1 repo

alaa-khamis/HullFTOfficial