Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

T. Nguyen (co-first author), T. Nguyen (co-first author), D. D. Le, K. Nguyen, A. Tran, R. G. Baraniuk, N. Ho, S. J. Osher. Transformer with a Mixture of Gaussian Keys. Submitted to ICLR, 2022

GRAND++: Graph Neural Diffusion with a Source Term

We propose GRAph Neural Diffusion with a source term (GRAND++) for graph deep learning with a limited number of labeled nodes, i.e., low-labeling rate. GRAND++ is a class of continuous-depth graph deep learning architectures whose theoretical underpinning is the diffusion process on graphs with a source term. The source term guarantees two interesting theoretical properties of GRAND++: (i) the representation of graph nodes, under the dynamics of GRAND++, will not converge to a constant vector over all nodes even as the time goes to infinity, which mitigates the over-smoothing issue of graph neural networks and enables graph learning in very deep architectures. (ii) GRAND++ can provide accurate classification even when the model is trained with a very limited number of labeled training data. We experimentally verify the above two advantages on various graph deep learning benchmark tasks, showing a significant improvement over many existing graph neural networks.

M. Thorpe (co-first author), T. Nguyen (co-first author), H. Xia (co-first author), T. Strohmer, A. Bertozzi, S. Osher, B. Wang. GRAND++: Graph Neural Diffusion with a Source Term. Submitted to ICLR, 2022.

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

We propose FMMformers, a class of efficient and flexible transformers inspired by the celebrated fast multipole method (FMM) for accelerating interacting particle simulation. FMM decomposes particle-particle interaction into near-field and far-field components and then performs direct and coarse-grained computation, respectively. Similarly, FMMformers decompose the attention into near-field and far-field attention, modeling the near-field attention by a banded matrix and the far-field attention by a low-rank matrix. Computing the attention matrix for FMMformers requires linear complexity in computational time and memory footprint with respect to the sequence length. In contrast, standard transformers suffer from quadratic complexity. We analyze and validate the advantage of FMMformers over the standard transformer on the Long Range Arena and language modeling benchmarks. FMMformers can even outperform the standard transformer in terms of accuracy by a significant margin. For instance, FMMformers achieve an average classification accuracy of 60.74% over the five Long Range Arena tasks, which is significantly better than the standard transformer’s average accuracy of 58.70%.

T. Nguyen, V. Suliafu, S. J. Osher, L. Chen, and B. Wang. FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention. NeurIPS, 2021.

Heavy Ball Neural Ordinary Differential Equations

H. Xia, V. Suliafu, H. Ji, T. Nguyen, A. L. Bertozzi, S. J. Osher, and B. Wang. Heavy Ball Neural Ordinary Differential Equations. NeurIPS, 2021.

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum long-short term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks, with little compromise in computational or memory efficiency. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance.

T. Nguyen, R. G. Baraniuk, A. L Bertozzi, and S. J. Osher. MomentumRNN: Integrating Momentum into Recurrent Neural Networks. NeurIPS, 2020.

Neural Networks with Recurrent Generative Feedback

Neural networks are vulnerable to input perturbations such as additive noise and adversarial attacks. In contrast, human perception is much more robust to such perturbations. The Bayesian brain hypothesis states that human brains use an internal generative model to update the posterior beliefs of the sensory input. This mechanism can be interpreted as a form of self-consistency between the maximum a posteriori (MAP) estimation of an internal generative model and the external environment. Inspired by such hypothesis, we enforce self-consistency in neural networks by incorporating generative recurrent feedback. We instantiate this design on convolutional neural networks (CNNs). The proposed framework, termed Convolutional Neural Networks with Feedback (CNN-F), introduces a generative feedback with latent variables to existing CNN architectures, where consistent predictions are made through alternating MAP inference under a Bayesian framework. In the experiments, CNN-F shows considerably improved adversarial robustness over conventional feedforward CNNs on standard benchmarks.

Y. Huang, J. Gornet, S. Dai, Z. Yu, T. Nguyen, D. Y. Tsao, A. Anandkumar. Neural Networks with Recurrent Generative Feedback. NeurIPS, 2020.

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.

Wang (co-first author), T. Nguyen (co-first author), A. L Bertozzi, R. G. Baraniuk, and S. J. Osher. Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent. arXiv preprint arXiv:2002.10583, 2020.

InfoCNF: An Efficient Conditional Continuous Normalizing Flow with Adaptive Solvers

Continuous Normalizing Flows (CNFs) have emerged as promising deep generative models for a wide range of tasks thanks to their invertibility and exact likelihood estimation. However, conditioning CNFs on signals of interest for conditional image generation and downstream predictive tasks is inefficient due to the high-dimensional latent code generated by the model, which needs to be of the same size as the input data. In this paper, we propose InfoCNF, an efficient conditional CNF that partitions the latent space into a class-specific supervised code and an unsupervised code that shared among all classes for efficient use of labeled information. Since the partitioning strategy (slightly) increases the number of function evaluations (NFEs), InfoCNF also employs gating networks to learn the error tolerances of its ordinary differential equation (ODE) solvers for better speed and performance. We show empirically that InfoCNF improves the test accuracy over the baseline while yielding comparable likelihood scores and reducing the NFEs on CIFAR10. Furthermore, applying the same partitioning strategy in InfoCNF on time-series data helps improve extrapolation performance.

T. Nguyen, A. Garg, R. G. Baraniuk, A. Anandkumar. InfoCNF: An Efficient Conditional Continuous Normalizing Flow with Adaptive Solvers. arXiv preprint arXiv:1912.03978, 2019.

Neural Rendering Model: Joint Generation and Prediction for Semi-Supervised Learning

Unsupervised and semi-supervised learning are important problems, yet challenging with complex data like natural images. Progress on these problems would accelerate if we had access to appropriate generative models under which to pose the associated inference tasks.
Given the success of the Convolutional Neural Networks (CNNs) for prediction in images, we design a new class of probabilistic generative models, namely the Neural Rendering Models (NRMs), whose inference corresponds to any given CNN architecture. NRM uses the given CNN to design the prior distribution in the probabilistic model. We show that it leads to efficient semi-supervised learning, which uses less labeled data while maintaining good prediction performance. NRM generates images from coarse to finer scales. It introduces a small set of latent variables at each level, and enforces dependencies among all the latent variables via a conjugate prior distribution. This conjugate prior yields a new regularizer based on paths rendered in the generative model for training CNNs–the Rendering Path Normalization (RPN). We demonstrate that this regularizer improves generalization, both in theory and in practice. Furthermore, likelihood estimation in NRM yields training losses for CNNs, and inspired by this, we design a new loss termed as the Max-Min cross entropy which outperforms the traditional cross-entropy loss for object classification. The Max-Min cross entropy suggests a new deep network architecture, namely the Max-Min network, to realize this loss. Numerical experiments demonstrate that the NRM with RPN and Max-Min cross entropy exceed or match the-state-of-art on benchmarks including SVHN, CIFAR10, and CIFAR100 for semi-supervised and supervised learning tasks.

Y. Huang, J. Gornet, S. Dai, Z. Yu, T. Nguyen, D. Y. Tsao, A. Anandkumar. Neural Networks with Recurrent Generative Feedback. arXiv preprint arXiv: 2007.09200, 2020. (Accepted at NeurIPS 2020)

T. Nguyen (co-first author), N. Ho (co-first author),A. B. Patel, A. Anandkumar, M. I. Jordan, R. G. Baraniuk. Neural Rendering Model: Joint Generation and Prediction for Semi-Supervised Learning. DeepMath, 2019.

N. Ho, T. Nguyen (co-first author), A. B. Patel, A. Anandkumar, M. I. Jordan, R. G. Baraniuk. The Latent-Dependent Deep Rendering Model. Workshop on Theoretical Foundations and Applications of Deep Generative Models at ICML, 2018.

A. B. Patel, T. Nguyen, and R. G. Baraniuk. A Probabilistic Framework for Deep Learning. NIPS, 2016.

T. Nguyen, W. Liu, E. Perez, R. G. Baraniuk, and A. B. Patel. Semi-supervised Learning with the Deep Rendering Mixture Model. arXiv preprint arXiv:1612.01942, 2016.

T. Nguyen, W. Liu, F. Sinz, R. G. Baraniuk, A. A. Tolias, X. Pitkow, A. B. Patel. Towards a Cortically Inspired Deep Learning Model: Semi-Supervised Learning, Divisive Normalization, and Synaptic Pruning. Conference on Cognitive Computational Neuroscience (CCN), 2017

Learning image classifiers from (limited) real and (abundant) synthetic data

While deep learning’s biggest successes in computer vision rely on massive datasets consisting of labeled images, it’s often costly or infeasible to acquire and annotate such voluminous data in practice. One promising solution is to train models on synthetic data, for which we know the true labels, and then deploy these models in real-world scenarios. Unfortunately, supervised learning techniques perform poorly when training and test distributions diverge. The subtle differences between real and synthetic data significantly degrade performance. To learn models without real-world labels, we propose a two-part solution: (i) we employ a synthetic renderer, capable of generating large amounts of realistically varying synthetic images; and (ii) we propose a domain adaptation strategy to bridge the gap between synthetic and real images. By mixing synthetic and real data in each minibatch during training, we improve the test accuracy for object classification tasks. Finally, we propose the Mixed-Reality Generative Adversarial Network (MrGAN) which iteratively maps between synthetic and real data via a multi-stage, iterative process. The result of the optimization is a shared space into which both real and synthetic images can be mapped. After training in the shared space, our models generalize better (from synthetic) to real data. We validate the advantages of using synthetic data and MrGANs on our CIFAR-based datasets for domain adaptation. Using both synthetic data and MrGANs, we achieve an improvement of 8.85\% in test accuracy.

T. Nguyen, H. Chen, Z. C. Lipton, L. Dirac, S. Soatto, A. Anandkumar. Learning Image Classifiers from (Limited) Real and (Abundant) Synthetic Data. 2018

Tan Nguyen — Research Page

Rice University Blogs site

Research Projects