0. Abstract

  • Model Merging for multi-task model
  • Previous works leverage different notions of task parameter subspace \
    • models being matched before merged)
    • limited to closed form solution
  • This work treats task parameter subspace as loss landscape
    • can be seen as solving linear system of equation
    • use conjugate gradient method
      • closed form solution

      • enables merging via linear systems that are otherwise intractable to solve
      • allows choosing from wide variety of initialization
      • estimate for the task parameter subspace
  • SOTA on mulittask and intermediate-task model merging

1. Introduction

Why model merging?

  • recycle specialized models to create better base models
  • unlike multi-task learning, merging doesn’t require simultaneous access to individual task datasets
  • M times cheaper to run

Previous Works

  • simple parameter averaging (same architecture, initialization)
  • consider parameter importance (fisher weight averaging)
  • match activation (Dataless knowledge fusion by merging weights of language models)
  • omit contribution of the pretrained model (task arithmetic)
  • resolve interference across models (TIES-MERGING)

Recent Works tend to find single model that matches task-specific models in their task parameter subspace

  • differ only in the choice of task parameter space

  • task parameter subspace = subspace implicitly used by a given merging method = aims to correspond to the important dimensions in parameter space for the task

  • to match models in their task parameter subspace, mreging method upweight each model in its task parameter subspace task relevant componenets of the model remained after merging

  • other work (Model Merging by Uncertainty-Based Gradient Matching)

    • focused on inaccuracies in model merging by gradient mismatch in different models
    • connect diagonal Fisher meging & task arithmetic

**Matching models in their task parameter space requires solving ==linear system of equations==

  • linear system of equations defines *merging objective that relates to given merging method’s choice of task parameter space
  • previous works vs this work
    • prev: used closed form solution
    • this work: merging framework that uses conjugate gradient method to solve a given linear system
  • +) support different merging objectives and initializations( can impact convergence speed)
  • +) enables use of merging objectives for linear system that don’t have closed form solution
    • Expr. insighted from K-FAC, introduce merging method where model’s task perameter subspace is based on block-diagonal approximation of the Fisher information matrix

Expr. compare with existing merging methods

  • task (via PEFT / full model finetuning)
    • multitask
    • intermediate-task merging
    • vision models trained
  • setting
    • preexisting merging method as initialization for MaTS
  • result
    • MaTS boost performance given suitable choice of merging objective
    • mutlitask: SOTA for large margin
    • MaTS can boost performance over its initialization and often attains state-of-the-art results
  • cost
    • higher cost than existing methods but cheaper than explicit multitask training

2. Background

Def.

  • M : # of models to merge (same pretrained model)

      1. p-dimensional vector of all parameters in the model m
      1. dk-dimensional vector of the parameters of a particular linear layer (matrix flattened)
  • x : input, y: output (model aim to parameterize )

  • : dataset containing samples

  • : loss function, used to train model m

  • : linear layer in m, with weight matrix

  • : layer’s input activation / : stacked row-wise

  • : layer’s output activation / : stacked row-wise

  • : gradient of the loss function w.r.t linear layer’s output activation for one particular input / : stacked row-wise

  • : gradient of the log probability of the correct class with respect to the linear layer’s output (model uses cross-entropy loss function) = output activation gradient

    • : operation computed by linear layer

Simple Averaging their parameters

Widely used in

  • federated learning
  • distributed optimization
  • merge models to retain OOD
  • improve pretrained model
  • create multimodal models

Fisher Merging

  • model merging = finding the set of parameters with highest joint probability according to individual model’s posteriors
  • POSTERIOR should be APPROXIMATED: maximum likelihood based training doesn’t provide access to posterior distribution
    • Laplace approximation : assumes parameters being sampled from Gaussian distribution with mean , covariance set to inverse of Fisher information matrix
    • If parameters are assumed to be sampled from isotropic Gaussian posteriors, above equation’s solution = simple averaging, but is poor of true posterior
    • computing, storing, inverting Fisher is intractable, so used diagonal approximation of Fisher F^ = matrix whose diagonal entries correspond to average of per-example gradients squared The closed form solution under above approximation

RegMean

  • model merging = minimizing distance btw output activations of original model and output activation of merged model
  • merge parameters of each linear layer separately The closed form solution can be found as
  • it requires computing and inverting gram matrices , so to ensure invertibility, RegMean scales the nondiagonal terms of the Gram matrix with a constant λ (with 0 < λ < 1) whose value can be tuned based on performance on held-out data

Task Parameter Subspace of Prior Merging Methods

3. Method

MaTS

Merging via Conjugate Gradient Method

Block-diagonal Fisher Merging

4. Experiments

5. Conclusion