Problem Statement:
- Parameter regularization methods face significant forgetting.
- Network expansion methods bring extra costs and overhead even sometimes not necessary.
Research Goals:
- By considering the task similarity, the framework will adaptively decide whether to use the same network (with the same weights) and apply parameter regularization to selected network or build a different network from scratch and learn the new task with the new network.
- To achieve higher performance results than parameter regularization methods and less computational costs than modular networks.
Proposed Approach:
- Parameter Allocation and Regularization (PAR) is an adaptive method which consists of 2 stages. First stage is measuring the task similarity.
Task Similarity via Task Distance via Nearest-Prototype Distance:
PAR calculates the KL divergence between 2 tasks (distributions). However, it is very hard to calculate the distributions from raw data like image dataset. Instead, a model can be used to calculate the representations of images and in the end distributions of the images.
We can calculate distribution of images by using a ML model. In this paper they used pre-trained ResNet18 to infer the images and get their distributions for current task and previous tasks.
They only hold the mean of distributions of each classes in previous tasks. Then when a new task comes they calculated the distance between sample Xi (from new task) and mean distribution of all images in the same task Ti.
In the last step they calculate the KL divergence between p(l) and q(l) and decide whether new task is similar to any of the previous tasks. They define a threshold value alpha and they compare the result of KL divergence with the alpha value.
Parameter Allocation:
They adopt a cell-based NAS to find the best architecture for the new task. They propose a relatedness-aware sampling-based architecture search strategy to improve efficiency.
Experimental Setup:
- Dataset: CTrL CIFAR100 and F-CelebA
- Hyperparameters
- Training setup
Evaluation metrics:
- Average performance (AP) and Average forgetting (AF).
Results:
It is evaluated on TIL setup and it performs much better than EWC, LwF, MAS, GPM, AGEM, Learn to Grow, Progressive Networks, and Efficient Continual Learning with Modular Networks and Task-Driven Priors.
Limitations:
- It needs to build an Expert model if the task is found to very divergent which kind of harms the continual learning policy of having a single model.
- Measuring the task similarity part can be enhanced with meta learning perhaps. Even without using a model to find distributions for the datasets. (Task2Vec, Data2Vec)
- We don't know how it will respond for the Class Incremental Learning maybe it can be applied to CIL.
Strengths:
- Considers the task similarity to overcome the extra computation cost and defy the forgetting issue.
- Does not have memory.
Future work:
This method is flexible, and in the future,it is possible to implement more relatedness estimation, regularization, and allocation strategies into PAR.
CIL setup can be future direction.