Merging by Matching Models in Task Parameter Subspaces

Model Merging for multi-task model
Previous works leverage different notions of task parameter subspace \
- models being matched before merged)
- limited to closed form solution
This work treats task parameter subspace as loss landscape
- can be seen as solving linear system of equation
- use conjugate gradient method
  - closed form solution
  - enables merging via linear systems that are otherwise intractable to solve
  - allows choosing from wide variety of initialization
  - estimate for the task parameter subspace
SOTA on mulittask and intermediate-task model merging

Why model merging?

recycle specialized models to create better base models
unlike multi-task learning, merging doesn’t require simultaneous access to individual task datasets
M times cheaper to run

Previous Works

simple parameter averaging (same architecture, initialization)
consider parameter importance (fisher weight averaging)
match activation (Dataless knowledge fusion by merging weights of language models)
omit contribution of the pretrained model (task arithmetic)
resolve interference across models (TIES-MERGING)

Recent Works tend to find single model that matches task-specific models in their task parameter subspace

differ only in the choice of task parameter space
task parameter subspace = subspace implicitly used by a given merging method = aims to correspond to the important dimensions in parameter space for the task
to match models in their task parameter subspace, mreging method upweight each model in its task parameter subspace → task relevant componenets of the model remained after merging
other work (Model Merging by Uncertainty-Based Gradient Matching)
- focused on inaccuracies in model merging by gradient mismatch in different models
- connect diagonal Fisher meging & task arithmetic

**Matching models in their task parameter space requires solving ==linear system of equations==

linear system of equations defines *merging objective that relates to given merging method’s choice of task parameter space
previous works vs this work
- prev: used closed form solution
- this work: merging framework that uses conjugate gradient method to solve a given linear system
+) support different merging objectives and initializations(→ can impact convergence speed)
+) enables use of merging objectives for linear system that don’t have closed form solution
- Expr. insighted from K-FAC, introduce merging method where model’s task perameter subspace is based on block-diagonal approximation of the Fisher information matrix

Expr. compare with existing merging methods

task (via PEFT / full model finetuning)
- multitask
- intermediate-task merging
- vision models trained
setting
- preexisting merging method as initialization for MaTS
result
- MaTS boost performance given suitable choice of merging objective
- mutlitask: SOTA for large margin
- MaTS can boost performance over its initialization and often attains state-of-the-art results
cost
- higher cost than existing methods but cheaper than explicit multitask training

Def.

M : # of models to merge (same pretrained model)
$θ_{m}$
- 1. p-dimensional vector of all parameters in the model m
- 1. dk-dimensional vector of the parameters of a particular linear layer (matrix flattened)
x : input, y: output (model aim to parameterize $p_{θ_{m}} (y ∣ x)$ )
$D_{m}$ : dataset containing $N_{m}$ samples
$L_{m}$ : loss function, used to train model m
$l$ : linear layer in m, with weight matrix $W_{m} \in R^{d \times k}$
$z \in R^{d}$ : layer’s input activation / $Z_{m} \in R^{N_{m} \times d}$ : stacked row-wise
$o \in R^{k}$ : layer’s output activation / $O_{m} \in R^{N_{m} \times k}$ : stacked row-wise
$o^{'} \in R^{k}$ : gradient of the loss function w.r.t linear layer’s output activation for one particular input / $O_{m}^{'} \in R^{N_{m} \times k}$ : stacked row-wise
$s^{'}$ : gradient of the log probability of the correct class with respect to the linear layer’s output (model uses cross-entropy loss function) = output activation gradient
- $O_{m} = Z_{m} \times W_{m}$ : operation computed by linear layer

Simple Averaging their parameters

Widely used in

Fisher Merging

model merging = finding the set of parameters with highest joint probability according to individual model’s posteriors
POSTERIOR should be APPROXIMATED: maximum likelihood based training doesn’t provide access to posterior distribution
- → Laplace approximation : assumes parameters being sampled from Gaussian distribution with mean $θ_{t}$ , covariance set to inverse of Fisher information matrix $F_{m}$
- If parameters are assumed to be sampled from isotropic Gaussian posteriors, above equation’s solution = simple averaging, but is poor of true posterior
- computing, storing, inverting Fisher is intractable, so used diagonal approximation of Fisher F^ = matrix whose diagonal entries correspond to average of per-example gradients squared The closed form solution under above approximation

RegMean

model merging = minimizing distance btw output activations of original model and output activation of merged model
merge parameters of each linear layer separately The closed form solution can be found as
it requires computing and inverting gram matrices $Z_{m}^{T} Z_{M}$ , so to ensure invertibility, RegMean scales the nondiagonal terms of the Gram matrix with a constant λ (with 0 < λ < 1) whose value can be tuned based on performance on held-out data

Task Parameter Subspace of Prior Merging Methods

MaTS

Merging via Conjugate Gradient Method

Block-diagonal Fisher Merging

jina.xyz