0. Abstract
- Model Merging for multi-task model
- Previous works leverage different notions of task parameter subspace \
- models being matched before merged)
- limited to closed form solution
- This work treats task parameter subspace as loss landscape
- can be seen as solving linear system of equation
- use conjugate gradient method
-
closed form solution
- enables merging via linear systems that are otherwise intractable to solve
- allows choosing from wide variety of initialization
- estimate for the task parameter subspace
-
- SOTA on mulittask and intermediate-task model merging
1. Introduction
Why model merging?
- recycle specialized models to create better base models
- unlike multi-task learning, merging doesn’t require simultaneous access to individual task datasets
- M times cheaper to run
Previous Works
- simple parameter averaging (same architecture, initialization)
- consider parameter importance (fisher weight averaging)
- match activation (Dataless knowledge fusion by merging weights of language models)
- omit contribution of the pretrained model (task arithmetic)
- resolve interference across models (TIES-MERGING)
Recent Works tend to find single model that matches task-specific models in their task parameter subspace
-
differ only in the choice of task parameter space
-
task parameter subspace = subspace implicitly used by a given merging method = aims to correspond to the important dimensions in parameter space for the task
-
to match models in their task parameter subspace, mreging method upweight each model in its task parameter subspace → task relevant componenets of the model remained after merging
-
other work (Model Merging by Uncertainty-Based Gradient Matching)
- focused on inaccuracies in model merging by gradient mismatch in different models
- connect diagonal Fisher meging & task arithmetic
**Matching models in their task parameter space requires solving ==linear system of equations==
- linear system of equations defines *merging objective that relates to given merging method’s choice of task parameter space
- previous works vs this work
- prev: used closed form solution
- this work: merging framework that uses conjugate gradient method to solve a given linear system
- +) support different merging objectives and initializations(→ can impact convergence speed)
- +) enables use of merging objectives for linear system that don’t have closed form solution
- Expr. insighted from K-FAC, introduce merging method where model’s task perameter subspace is based on block-diagonal approximation of the Fisher information matrix
Expr. compare with existing merging methods
- task (via PEFT / full model finetuning)
- multitask
- intermediate-task merging
- vision models trained
- setting
- preexisting merging method as initialization for MaTS
- result
- MaTS boost performance given suitable choice of merging objective
- mutlitask: SOTA for large margin
- MaTS can boost performance over its initialization and often attains state-of-the-art results
- cost
- higher cost than existing methods but cheaper than explicit multitask training
2. Background
Def.
-
M : # of models to merge (same pretrained model)
-
-
- p-dimensional vector of all parameters in the model m
-
- dk-dimensional vector of the parameters of a particular linear layer (matrix flattened)
-
-
x : input, y: output (model aim to parameterize )
-
: dataset containing samples
-
: loss function, used to train model m
-
: linear layer in m, with weight matrix
-
: layer’s input activation / : stacked row-wise
-
: layer’s output activation / : stacked row-wise
-
: gradient of the loss function w.r.t linear layer’s output activation for one particular input / : stacked row-wise
-
: gradient of the log probability of the correct class with respect to the linear layer’s output (model uses cross-entropy loss function) = output activation gradient
- : operation computed by linear layer
Simple Averaging their parameters
Widely used in
- federated learning
- distributed optimization
- merge models to retain OOD
- improve pretrained model
- create multimodal models
Fisher Merging
- model merging = finding the set of parameters with highest joint probability according to individual model’s posteriors
- POSTERIOR should be APPROXIMATED: maximum likelihood based training doesn’t provide access to posterior distribution
- → Laplace approximation : assumes parameters being sampled from Gaussian distribution with mean , covariance set to inverse of Fisher information matrix

- If parameters are assumed to be sampled from isotropic Gaussian posteriors, above equation’s solution = simple averaging, but is poor of true posterior
- computing, storing, inverting Fisher is intractable, so used diagonal approximation of Fisher F^ = matrix whose diagonal entries correspond to average of per-example gradients squared
The closed form solution under above approximation

- → Laplace approximation : assumes parameters being sampled from Gaussian distribution with mean , covariance set to inverse of Fisher information matrix
RegMean
- model merging = minimizing distance btw output activations of original model and output activation of merged model
- merge parameters of each linear layer separately
The closed form solution can be found as

- it requires computing and inverting gram matrices , so to ensure invertibility, RegMean scales the nondiagonal terms of the Gram matrix with a constant λ (with 0 < λ < 1) whose value can be tuned based on performance on held-out data
Task Parameter Subspace of Prior Merging Methods
3. Method
MaTS
Merging via Conjugate Gradient Method
Block-diagonal Fisher Merging

