I'm Luke

I used to design robots; now I'm doing a PhD in machine vision. I use this website to collect research papers I've read and loved, and to journal my ideas. All posts are listed bellow in chronological order.


  • Scalable MatMul-free Language Modeling

    The authors eliminate Matrix Multiplication (MatMul) operations in large language models (LLMs), reducing computational costs while maintaining performance at billion-parameter scales. Their MatMul-free models match state-of-the-art Transformers and cut memory usage significantly. Performance gaps narrow with larger models. A GPU-efficient implementation reduces training memory by 61% and inference memory by over 10×. A custom FPGA solution achieves brain-like efficiency at 13W for billion-parameter models, showcasing lightweight operations for future LLM accelerators.
  • An Image is Worth More Than 16×16 Patches: Exploring Transformers on Individual Pixels

    The authors challenge the necessity of locality as an inductive bias in computer vision architectures. They find that vanilla Transformers, treating each pixel as a token, achieve high performance across object classification, self-supervised learning, and image generation tasks. This contrasts with Vision Transformers (ViT) that use 16×16 patches. Despite being less computationally practical, their Pixel Transformer (PiT) demonstrates that eliminating locality can yield better results, suggesting that locality is not essential for vision tasks. This finding urges the community to reconsider locality when designing future neural architectures for computer vision.
  • ChatGPT is bullshit

    Large language models (LLMs) like ChatGPT generate text that appears truthful but is not concerned with actual truth, making it more accurate to label their false claims as bullshit rather than lies or hallucinations. This distinction is important because current descriptions like hallucinations mislead the public and policymakers about the nature of LLM outputs.
  • Grokfast: Accelerated Grokking by Amplifying Slow Gradients

    This paper addresses grokking, where models achieve delayed generalization long after overfitting. They propose accelerating this process by decomposing parameter gradients into fast-varying (overfitting) and slow-varying (generalization) components. By amplifying the slow-varying gradients, their method, GROKFAST, accelerates generalization over 50 times with minimal code changes. Experiments show effectiveness across tasks involving images, languages, and graphs, making grokking more practical for machine learning practitioners under resource constraints.
  • Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

    The authors investigate the reasoning capabilities of large language models (LLMs) such as GPT-3.5/4, Claude, and LLaMa by presenting them with a simple common sense problem. Despite high performance on standardized benchmarks, these models show dramatic failures in solving the problem, displaying overconfidence and providing nonsensical justifications for incorrect answers. Standard interventions, like enhanced prompting, fail to improve outcomes. This study urges the scientific community to reassess LLM capabilities and develop benchmarks that better detect reasoning deficits to guide future improvements
  • Contextual Position Encoding: Learning to Count What’s Important

    This paper argues that we should not be assigning position encodings naively based only on absolute or relative position, but instead based partly on location and partly on the context. This is acheived via a gating mechanism sigmoid(qk), where q is the current token and k is the target token. Position encodings are incremented as you move away from the current token but only incremented by sigmoid(qk). This is shown to help a lot with counting, copying etc and generalisation to larger contexts than training.
  • Transformers Can Do Arithmetic with the Right Embeddings

    Transformers perform poorly on arithmetic due to their inability to track the exact position of each digit within a number. This issue is addressed by adding an embedding to each digit that encodes its position relative to the start of the number. With positions resolved, the study investigates whether transformers can solve arithmetic problems that are larger and more complex than their training data. Training on 20-digit numbers with a single GPU for one day achieves state-of-the-art performance, with up to 99% accuracy on 100-digit addition problems. Additionally, these improvements in numeracy also enhance performance on other multi-step reasoning tasks, including sorting and multiplication.
  • The Road Less Scheduled

    This is Aaron Defazios much hyped schedule free optimizaiion method. Traditional convergence theory advocates for the Polyak-Ruppert (PR) average over the last iterate of SGD, but empirical evidence shows better performance with the latter. Defazio proposes a new averaging approach that matches the empirical performance of learning rate schedules without sacrificing theoretical guarantees. This method tracks the Pareto frontier of loss versus training time, requires no additional hyperparameters, and utilizes an alternative form of momentum with worst-case optimal properties. The key contribution is an online-to-batch conversion theorem, validating the optimality of the method and unifying existing theories. Extensive evaluations across 28 problems confirm that Schedule-Free methods perform comparably to or better than heavily-tuned schedules. Very cool.
  • Thermodynamic Natural Gradient Descent

    This study proposes natural gradient descent (NGD), a second-order optimization method, that can achieve similar computational complexity per iteration to first-order methods with appropriate hardware. A new hybrid digital-analog algorithm for neural network training, equivalent to NGD in certain parameters, avoids costly linear system solves. Exploiting the thermodynamic properties of an analog system, this algorithm requires an analog thermodynamic computer, operating in a hybrid digital-analog loop for gradient and Fisher information matrix calculations. Numerical results show this method's superiority over state-of-the-art digital first- and second-order methods in classification and language model fine-tuning tasks.
  • The Platonic Representation Hypothesis

    This paper argues that AI model representations, especially in deep networks, are converging. It surveys examples in the literature showing how neural network representations align over time and across domains. It demonstrates convergence across data modalities, noting that as vision and language models grow, they measure distances between data points similarly. The authors hypothesize that this convergence leads to a shared statistical model of reality, termed the platonic representation and discuss possible selective pressures toward it. The paper also addresses the implications, limitations, and counterexamples to this trend.
  • xLSTM: Extended Long Short-Term Memory

    LSTMs were introduced in the 1990's, and were used as part of the first Large Language Models (LLMs). Transformers have since passed LSTMs at scale. This study explores scaling LSTMs to billions of parameters using modern LLM techniques while addressing LSTM limitations. It introduces exponential gating with normalization and stabilization techniques and modifies the LSTM memory structure, creating (i) sLSTM with scalar memory, scalar update, and new memory mixing, and (ii) mLSTM with fully parallelizable matrix memory and covariance update. These extensions form xLSTM blocks, residually stacked into xLSTM architectures, enabling xLSTMs to perform comparably to state-of-the-art Transformers and State Space Models in performance and scaling.
  • KAN: Kolmogorov–Arnold Networks

    Inspired by the Kolmogorov-Arnold representation theorem, Kolmogorov-Arnold Networks (KANs) are proposed as alternatives to Multi-Layer Perceptrons (MLPs). Unlike MLPs with fixed activation functions on nodes, KANs have learnable activation functions on edges. KANs replace linear weights with univariate functions parametrized as splines. This modification enhances KANs' accuracy and interpretability compared to MLPs. Smaller KANs achieve comparable or superior accuracy in data fitting and PDE solving. KANs also exhibit faster neural scaling laws and are more interpretable, allowing intuitive visualization and user interaction. Examples in mathematics and physics demonstrate KANs' utility in helping scientists (re)discover laws. KANs present promising alternatives to MLPs, potentially advancing current deep learning models.
  • Non-negative Contrastive Learning

    Deep representations offer promising performance for downstream tasks but lack interpretability, posing a significant challenge. In this paper, we introduce Non-negative Contrastive Learning (NCL), a refinement of Non-negative Matrix Factorization (NMF) aimed at generating interpretable features. NCL enforces non-negativity constraints, akin to NMF, resulting in sparse and disentangled representations, unlike standard contrastive learning (CL). We establish theoretical guarantees on NCL's identifiability and downstream generalization. Empirically, NCL outperforms CL in feature disentanglement, selection, and downstream classification tasks. Moreover, NCL can be extended to other learning scenarios, benefiting supervised learning.
  • HYPO: Hyperspherical Out-of-Distribution Generalization

    We propose HYPO, a novel framework for out-of-distribution (OOD) generalization in machine learning, which learns domain-invariant features in a hyperspherical space. Our method focuses on aligning features of the same class across different domains close to their class prototypes and separating different class prototypes maximally. We offer theoretical justifications for its improvement on OOD generalization and show through experiments on OOD benchmarks that HYPO outperforms existing baselines with superior performance.
  • Atomically Accurate De Novo Design of Single-domain Antibodies

    Despite the critical role antibodies play in medicine, current methods for designing new antibodies targeting specific epitopes are time-consuming. Here, we showcase the effectiveness of a refined RFdiffusion network in designing novel antibody variable heavy chains (VHH’s) to target user-specified epitopes. Through experiments, we validate the binding capability of these designed VHH's to four disease-relevant epitopes. Furthermore, the cryo-EM structure analysis reveals that a designed VHH bound to influenza hemagglutinin closely matches the design model, affirming the accuracy of our approach.
  • Simple and Scalable Strategies to Continually Pre-train Large Language Models

    Large language models (LLMs) often require re-training on new data, consuming extensive computational resources. We present an efficient method that combines learning rate re-warming, re-decaying, and data replay, effectively maintaining performance without full re-training. This approach works well across different data distributions, including minor shifts (English→English) and significant shifts (English→German), tested up to 405M and 10B parameter models. Our findings suggest that continual learning strategies can update LLMs with minimal computational cost, rivaling traditional re-training methods. Additionally, we propose alternatives to the cosine learning rate schedule to reduce forgetting, offering more flexibility in learning without a fixed token budget.
  • VideoMamba: State Space Model for Efficient Video Understanding

    This work introduces VideoMamba, an innovative approach that addresses local redundancy and global dependencies in video understanding by adapting the Mamba framework to video analysis. VideoMamba outperforms existing 3D convolutional neural networks and video transformers through its linear-complexity operator, which facilitates efficient long-term modeling for high-resolution, long-duration videos. Its effectiveness is demonstrated across four main areas: scalability in the visual domain without needing extensive dataset pretraining, thanks to a novel self-distillation technique; the ability to recognize short-term actions with fine-grained motion differences; superior performance in long-term video understanding compared to traditional feature-based models; and robust compatibility with multiple modalities, enhancing multi-modal video analysis. VideoMamba establishes a new standard for comprehensive and efficient video understanding.
  • Chronos: Learning the Language of Time Series

    Chronos is a framework that enhances pretrained probabilistic time series models by tokenizing time series data for training with transformer-based architectures, notably the T5 family. Pretrained on a mix of public and synthetic datasets created via Gaussian processes, Chronos models outshine others in 42 benchmark datasets by showing superior performance on familiar datasets and competitive or better zero-shot capabilities on new datasets. This illustrates Chronos's ability to generalize across different domains, offering a simpler approach to forecasting tasks.
  • Is Cosine-Similarity of Embeddings Really About Similarity?

    Cosine similarity, often used to gauge semantic similarity between high-dimensional objects by comparing low-dimensional feature embeddings, can yield variable results compared to unnormalized dot products. We investigate this phenomenon by analyzing embeddings from regularized linear models, revealing that cosine similarity can produce arbitrary and even meaningless similarities. This applies not only to linear models but also to deep models due to the implicit effects of various regularizations. Consequently, we advise against relying solely on cosine similarity and suggest exploring alternative approaches.
  • Stable Diffusion 3: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    This paper focuses on improving diffusion models for high-dimensional data like images. The authors enhance noise sampling techniques for training rectified flow models, prioritizing perceptually relevant scales. Their large-scale study shows superior performance over established diffusion methods in high-resolution text-to-image synthesis. A new transformer-based architecture is introduced for better text comprehension, typography, and human preference in text-to-image generation. The findings demonstrate predictable scaling trends, with lower validation loss tied to improved synthesis. The largest models outperform state-of-the-art approaches.
  • White Box transformers via sparse rate reduction

  • Learning and Leveraging World Models in Visual Representation Learning

    The Joint-Embedding Predictive Architecture (JEPA) is a self-supervised method previously used for predicting missing input parts. This study expands JEPA to predict a wider range of corruptions by introducing Image World Models (IWM), which learn the effects of global photometric transformations in latent space. Key to effective IWM learning are conditioning, prediction difficulty, and model capacity. The study demonstrates that IWM's predictive world model, when fine-tuned, can tackle diverse tasks and either matches or outperforms existing self-supervised techniques. Additionally, it enables control over the abstraction level of learned representations, achieving either invariant or equivariant representations, similar to contrastive methods or masked image modeling, respectively.
  • Fine-tuning with Very Large Dropout

    This study challenges the notion that machine learning practices assume training and testing data have the same distribution. It explores the effectiveness of high dropout rates, as opposed to ensemble techniques, in developing rich data representations suitable for multiple distribution scenarios. These representations surpass those achieved by traditional in-distribution performance regularization and the implicit sparsity induced by common stochastic gradient methods. While training deep networks from scratch with high dropout rates is impractical, fine-tuning pre-trained models under these conditions is feasible and yields better out-of-distribution performance than ensembles and model averaging techniques like model soups. This finding is significant due to the growing relevance of fine-tuning with large pre-trained models, offering insights into the nature of rich representations and the linear characteristics of fine-tuning large networks with small datasets.
  • CL-MAE: Curriculum-Learned Masked Autoencoders

    This paper proposes a curriculum learning approach that updates the masking strategy of Masked Auto Encoders (MAE) to progressively increase the complexity of the self-supervised reconstruction task. To achieve this, a novel learnable masking module is introduced, capable of generating masks of varying complexities, and integrated into masked autoencoders (MAE). This module is jointly trained with the MAE, adjusting its behavior from partner (optimizing the same reconstruction loss) to adversary (optimizing the opposite loss), with a smooth transition regulated by a factor multiplied with the reconstruction loss. This training procedure creates an easy-to-hard curriculum. The Curriculum-Learned Masked Autoencoder (CL-MAE) is trained on ImageNet and demonstrates superior representation learning compared to MAE. Empirical results on five downstream tasks confirm that curriculum learning can successfully self-supervise masked autoencoders.
  • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Recent research introduces BitNet b1.58, a ternary (values of -1, 0, 1) 1-bit Large Language Model (LLM) that matches traditional full-precision LLMs in performance and perplexity but is significantly more efficient in terms of latency, memory, throughput, and energy. This work establishes a new scaling law and training approach for future high-performance, cost-effective LLMs, while also facilitating the design of specialized hardware optimized for 1-bit LLMs.
  • MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

    State Space Models (SSMs) are emerging as strong competitors in sequential modeling, challenging Transformers. Integrating Mixture of Experts (MoE) has enhanced Transformer-based models, including cutting-edge open models. We suggest combining SSMs with MoE to further unlock their scaling potential. Our model, MoE-Mamba, based on the SSM model Mamba, exceeds the performance of both Mamba and traditional Transformer-MoE models. Notably, MoE-Mamba achieves Mamba's performance with 2.35 times fewer training steps, maintaining Mamba's inference advantages over Transformers.
  • Deep Networks Always Grok and Heres Why

    This paper introduces the concept of delayed robustness to describe DNNs becoming robust to adversarial examples post-generalization. This emergence is explained through a measure of local complexity, analyzing the density of linear regions in the DNN input space. The authors find that during training linear regions emerge near training samples with the nonlinearities being forced towards decision boundries.
  • ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

    Safety in large language models (LLMs) is crucial, but current safety techniques, such as data filtering and supervised fine-tuning, overlook the complexity of real-world applications, like the use of ASCII art in forums, which can bypass these safety measures. We introduce an ASCII art-based jailbreak attack, ArtPrompt, and a benchmark, Vision-in-Text Challenge (VITC), to test LLMs' abilities to recognize non-semantic prompts. Our findings reveal that state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle with ASCII art recognition. ArtPrompt exploits this vulnerability, demonstrating that it can effectively compromise the safety mechanisms of these models with just black-box access.
  • Neural Network Diffusion

    Diffusion models, known for their success in image and video generation, can also generate high-performing neural network parameters, as demonstrated in this study. By employing a simple combination of an autoencoder and a standard latent diffusion model, our method involves extracting latent representations of network parameters, which are then synthesized from random noise by the diffusion model. These new representations, processed through the autoencoder's decoder, serve as fresh network parameters. Tested across various architectures and datasets, this diffusion approach consistently produces models with comparable or superior performance to traditionally trained networks at minimal extra cost. Importantly, the generated models show distinct performance differences from the trained ones, suggesting further exploration into the versatile applications of diffusion models.
  • Suppressing Pink Elephants with Direct Principle Feedback

    Existing methods for controlling language models focus on training desired behaviors, but often lack the flexibility needed for diverse applications. The authors address this with the Pink Elephant Problem, demonstrating the need for language models to adapt to different contexts by avoiding certain topics (the Pink Elephant) in favor of others (Grey Elephant). The authors introduce Direct Principle Feedback (DPF), an adaptation of Constitutional AI that improves control by directly applying critiques and revisions without ranking responses. Our study shows that a 13B LLaMA 2 model fine-tuned with DPF on a synthetic dataset outperforms existing models and matches GPT-4 in managing the Pink Elephant Problem.
  • Neural Networks Learn Statistics of Increasing Complexity

    The distributional simplicity bias (DSB) theory suggests neural networks first learn basic patterns in data before understanding more complex correlations. We provide new evidence supporting DSB, showing networks initially excel with data matching training set's simple statistics but this ability diminishes later. Extending DSB to discrete domains, we demonstrate an equivalence between n-gram frequencies and vector moments, also observing this bias in large language models (LLMs). Additionally, by adjusting low-level statistics of images to resemble another class, we reveal that networks in early training phases misclassify these edited images as if they belonged to the target class.
  • Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Mode

    This paper presents Vim, a Vision archetecture that uses bidirectional Mamba blocks, challenging the necessity of self-attention in vision. Testing on ImageNet, COCO, and ADE20k shows Vim outperforms established vision transformers like DeiT in performance while being significantly more computation and memory efficient.
  • A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention

    We explore the learning process of a dot-product attention layer, which learns both positional and semantic attention matrices, enabling tokens to attend based on position or meaning. Through experiments on an algorithmic task, we demonstrate that this architecture can use either mechanism for solving the task. Theoretically, we examine a non-linear self-attention layer with special query and key matrices, offering a closed-form solution for its non-convex loss landscape in high-dimensional data, which reveals a phase transition from positional to semantic mechanisms as sample complexity increases. We also show that the dot-product attention layer surpasses a linear positional baseline through the semantic mechanism with adequate data.
  • DINOv2: Learning Robust Visual Features without Supervision

    This paper introdices an automated pipeline for building high-quality, diverse image datasets. Also proposes minor mofications to the archetecture and loss of a DINO ViT. A large ViT model with 1 billion parameters was trained and distilled into smaller models.
  • Contrastive Masked Autoencoders are Stronger Vision Learners

    Contrastive Masked Autoencoders (CMAE) is a self-supervised method that adds a contrastive loss to the Masked Autoencoder. It features a dual-branch architecture, including an asymmetric encoder-decoder for holistic feature learning and a momentum encoder for boosting feature discriminability through contrastive learning.
  • Better Call GPT, Comparing Large Language Models Against Lawyers

    This paper compares Large Language Models (LLMs) with traditional legal contract reviewers—Junior Lawyers and Legal Process Outsourcers (LPOs). It evaluates whether LLMs outperform humans in accuracy, speed, and cost-efficiency during contract review. Empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, revealing that advanced models match or exceed human accuracy in identifying legal issues. LLMs complete reviews in seconds, vastly outpacing the hours required by humans. Cost-wise, LLMs achieve a 99.97 percent reduction compared to traditional methods. These findings indicate a transformative shift in legal practice, with LLMs enhancing the accessibility and efficiency of legal services. The research suggests that LLM dominance in legal contract review challenges the status quo, necessitating a reimagined future for legal workflows.
  • MambaByte: Token-free Selective State Space Model

    MambaByte, a token-free language model, effectively operates on byte sequences without subword tokenization bias, offering computational efficiency and outperforming state-of-the-art subword models. Its linear scaling and fast inference demonstrate its potential for token-free language modeling.
  • Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

    This paper investigates how self-supervised pretraining methods learn part-aware representations. The authors describe contrastive learning as transforming part representations into whole object representations and masked image modeling as inferring masked object parts from visible ones, suggesting that these methods predispose encoders to recognize object parts. Through empirical comparison, they find that while fully-supervised models excel in object-level recognition, self-supervised models, particularly those using contrastive learning and masked image modeling, perform better in part-level recognition. Combining contrastive learning and masked image modeling further enhances performance.
  • Hallucination is Inevitable: An Innate Limitation of Large Language Models

    This paper formally defines hallucination as the failure to reproduce the output of a computable function, showing it to be inevitable for any LLM regardless of architecture or training. Empirical studies validate these theoretical findings, highlighting the need for effective mitigators and careful deployment of LLMs in real-world applications.
  • VMamba: Visual State Space Model

    This paper introduces the Visual State Space Model (VMamba), inspired by state space models, to achieve linear complexity while maintaining global receptive fields. The addition of a multi-directional Cross-Scan Module (CSM) addresses direction-sensitivity, allowing effective spatial domain traversal. Extensive tests show VMamba's effectiveness in visual perception tasks, especially at higher resolutions.
  • Transformer-Based Visual Segmentation: A Survey

    This survey provides a comprehensive review of transformer-based visual segmentation, a crucial field for various applications. The authors cover the evolution from convolutional methods to vision transformers, offering a unified framework to simplify understanding recent advancements. The survey discusses various transformer-based segmentation approaches, modifications, and applications, highlighting specific areas such as 3D point cloud, foundation model tuning, domain-aware, efficient, and medical segmentation. The authors re-evaluate these methods on established datasets, outline current challenges, and suggest future research directions.
  • SAM as an Optimal Relaxation of Bayes

    We introduce SAM, a method enhancing generalization in deep learning, as a relaxation of the Bayes objective. SAM replaces the expected negative-loss with an optimal convex lower bound derived using the Fenchel biconjugate. This connection enables an Adam-like extension of SAM, offering automatic uncertainty estimates and potential accuracy improvements. Bridging adversarial and Bayesian methods, our work paves the way for robustness enhancement.
  • Data is Overrated: Perceptual Metrics Can Lead Learning in the Absence of Training Data

    Perceptual metrics, designed to replicate human perceptual behavior, are utilized as loss functions in generative models to capture the inherent structure of natural signals like images and audio. Taking this to the extreme, this study trains a compressive autoencoder on uniform noise instead of natural data in the audio domain. Results demonstrate that using perceptual losses enhances the reconstruction quality of spectrograms and re-synthesized audio at test time compared to standard Euclidean loss even when trained on pure noise.
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    This paper introduces the Mamba archetecture, a selective state space model (SSSM) with input dependance. Mamba has better speed and scalability compared to transformers in various domains including language, audio, and genomics, especially in tasks involving extremely long sequences. Authors also present a hardware efficient scanning mechanism that is used in Mamba.
  • Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking

    Power et al. (2022) discovered that neural networks initially memorize arithmetic tasks, achieving perfect training accuracy but poor test accuracy, then suddenly achieve perfect test accuracy after extended training. This study explains this grokking phenomenon with the theory that early and late phase implicit biases cause this shift. Training homogeneous neural networks with large initialization and small weight decay on classification and regression tasks results in a prolonged period where the network acts like a kernel predictor, followed by a sudden shift to min-norm/max-margin predictors, significantly improving test accuracy.
  • Exponentially Faster Language Modeling

    UltraFastBERT, a BERT variant, operates with just 0.3% of its neurons—selectively using 12 out of 4095 neurons per layer—for inference, matching the performance of similar models. It replaces conventional feedforward networks with fast feedforward networks (FFFs) to achieve this efficiency. Although fully efficient conditional neural execution isn't yet practical, we offer a high-level CPU code that achieves a 78x speedup and a PyTorch implementation with a 40x speedup over standard batched feedforward inference.
  • Masked Image Residual Learning for Scaling Deeper Vision Transformers

    Training deeper Vision Transformers (ViTs) presents challenges, including a degradation problem in deeper layers during masked image modeling (MIM) pre-training. In this paper Masked Image Residual Learning (MIRL), a self-supervised learning framework, was introduced to alleviate the degradation issue and enable effective scaling of ViT depth for performance improvement. MIRL redefines the pre-training objective for deep ViT layers as learning to recover the residual of the masked image. Extensive testing shows that MIRL allows deeper ViTs to be optimized more effectively, enhancing accuracy with increased depth.
  • The geometry of hidden representations of large transformer models

    This paper investigates the geometric and statistical properties of representations across layers in large transformers used for self-supervised learning on various data types. The authors observe common evolution patterns, with data manifolds expanding initially, contracting at intermediate layers, and the intrinsic dimension (ID) stabilizing or peaking slightly towards the end. They find that semantic information peaks after initial expansion, a trend consistent across models and datasets. The study proposes an unsupervised method to identify layers richest in semantic content, suggesting that those at a relative ID minimum are optimal for downstream tasks.
  • Operational Neural Networks for Parameter-Efficient Hyperspectral Single-Image Super-Resolution

    Hyperspectral Imaging, a key tool in remote sensing, captures more spectral information than standard images but with lower spatial resolution. Super-resolution aims to enhance low-resolution inputs, with modern techniques often using deep convolutional neural networks (CNNs) that rely on non-linear activation functions. Recently, self-organized operational neural networks (SONNs) have been proposed, utilizing learnable non-linear functions instead of convolutional filters, to address the depth issue of CNNs. This study enhances a popular super-resolution model with operational filters for better hyperspectral image performance, examining the impact of residual connections and normalization types. Operational neural networks, despite fewer parameters, outperform CNN equivalents on small datasets.
  • When to Learn What: Model-Adaptive Data Augmentation Curriculum

    Data augmentation (DA) improves neural network generalization by enforcing invariances and symmetries to predefined transformations applied to input data. However, fixed augmentation policies affect samples differently at various training stages, and existing approaches cannot adapt policies to individual samples and the training model. This paper proposes Model-Adaptive Data Augmentation (MADAug) which trains an augmentation policy network to determine when to learn what. Unlike previous work, MADAug selects augmentation operators for each input image with a model-adaptive policy that varies between training stages, creating an optimized data augmentation curriculum. The policy is trained using a bi-level optimization scheme to minimize validation-set loss of a model trained with policy-produced augmentations. Extensive evaluations on multiple image classification tasks and network architectures show that MADAug outperforms or matches existing DA approaches, enhances fairness by improving all classes, particularly the difficult ones, and performs better when transferred to fine-grained datasets. Additionally, the auto-optimized policy in MADAug gradually increases perturbations, forming an easy-to-hard curriculum.
  • Vision Transformers Need Registers

    This paper is effectively a follow-up to DinoV2. The study addresses the issue of artifacts in feature maps of Vision Transformer (ViT) networks, identified as high-norm tokens appearing mainly in low-informative background areas during inference. The authors introduce a straightforward strategy of adding extra tokens to the input sequence of the Vision Transformer, which eliminates these artifacts for both supervised and self-supervised models. This approach not only resolves the artifact issue but also establishes new benchmarks for self-supervised visual models on dense visual prediction tasks, facilitates object discovery with larger models, and results in smoother feature and attention maps for downstream visual processing tasks.
  • Sigmoid Loss for Language Image Pre-Training

    This paper introduces a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP), which outperforms standard contrastive learning by not requiring a global normalization of pairwise similarities. This decouples batch size from the loss, and allows the authors to increase the batch size and also improves performance even with smaller batches. Pushing the batch size to one million showed diminishing returns, with 32k being a good middle ground.
  • Fast Feedforward Networks

    This paper introduces the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks. FFFs are up to 220x faster than feedforward networks, up to 6x faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execution.
  • DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

    This paper introduces DiffBIR, a framework that utilizes pretrained text-to-image diffusion models for blind image restoration. The two-stage pipeline involves pretraining a restoration module on various degradations to enhance real-world applicability, followed by the use of latent diffusion models for realistic restoration. The authors introduce a novel injective modulation sub-network, LAControlNet, for fine-tuning and employ pre-trained Stable Diffusion for its generative capabilities. Additionally, a controllable module allows users to adjust quality and fidelity during the denoising process. Extensive tests demonstrate DiffBIR's superiority in blind image super-resolution and face restoration tasks across synthetic and real-world datasets.
  • Loss of Plasticity in Deep Continual Learning

    Most deep learning systems are designed to be trained once, or possibly pretrained and fintuned. These systems perform quite poorly in continual learning setups where training is ongoing. This is primarily due to two problems: catastrophic forgetting and loss of plasticity. This paper addresses the second. Various architectures and techniques were tested, with L2-regularization and shrink and perturm improving plasticity a little. This paper then introduces Continual Backpropagation, which reinitializes dead units, and seems to maintain plasticity indefinitely
  • Attention Is All You Need

    We introduce the Transformer, a novel network architecture solely based on attention mechanisms, eliminating the need for recurrence and convolutions. Our experiments on machine translation tasks demonstrate superior quality, improved parallelization, and reduced training time compared to existing models. Achieving 28.4 BLEU on WMT 2014 English-to-German and 41.8 BLEU on English-to-French tasks, our model outperforms previous state-of-the-art results. Notably, it trains in only 3.5 days on eight GPUs, significantly reducing training costs. Furthermore, the Transformer demonstrates strong generalization to other tasks, including English constituency parsing, with both large and limited training data.
  • Pre-training Vision Transformers with Very Limited Synthesized Images

    Formula-driven supervised learning (FDSL) uses synthetic images from mathematical formulas, like fractals, for pre-training vision transformers, demonstrating competitive performance on various downstream tasks. This study proposes that generating different instances within the same category in FDSL acts as data augmentation. Given this perspective, a one-instance fractal database (OFDB) is developed where only a single image per category is present. Despite OFDB's significantly smaller size of 21k images compared to ImageNet-21k's 14M comparable or superior results are obtaine to models pre-trained on ImageNet-21k in ImageNet-1k fine-tuning.
  • Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

    Recent advancements have highlighted the effectiveness of flat minima in enhancing generalization, particularly through Sharpness-Aware Minimization (SAM). However, existing definitions of flatness, such as zeroth-order flatness, have limitations in discerning between minima with low and high generalization errors. To address this, we propose first-order flatness, which considers maximal gradient norm within a perturbation radius. We introduce Gradient norm Aware Minimization (GAM) as a novel training approach to achieve uniformly small curvature across all directions. Experimental results demonstrate GAM's ability to enhance generalization compared to standard optimizers like SGD and AdamW across various datasets and networks. Moreover, GAM facilitates SAM in identifying flatter minima, leading to improved generalization.
  • Impact of Noise on Calibration and Generalisation of Neural Networks

    This study investigates the effects of various noise injection and data augmentation strategies on neural networks (NNs) to enhance generalization, robustness, and calibration. Activation noise is shown to significantly improve generalization across scenarios, while input augmentation noise notably enhances calibration in out-of-distribution data but is less effective for in-distribution data.
  • On The Duality Between Contrastive and Non-contrastive Self-Supervised Learning

    Recent self-supervised learning approaches for image representation can be broadly divided into contrastive and non-contrastive methods. This study focuses on their theoretical similarities rather than their differences. By developing contrastive and covariance-based non-contrastive criteria that are algebraically related and equivalent under certain conditions, the authors demonstrate the close relationship between these two families. This analysis includes improving SimCLR's performance to match that of VICReg through precise hyperparameter adjustments and challenging the assumption that non-contrastive methods require large output dimensions. Results indicate that with better network design and hyperparameter tuning, the performance gap between contrastive and non-contrastive methods can be minimized.
  • Self-Paced Absolute Learning Progress as a Regularized Approach to Curriculum Learning

    Reinforcement Learning's usability is limited by its high computation times. Curriculum Reinforcement Learning accelerates learning by ordering tasks from simple to hard. Curricula based on Absolute Learning Progress (ALP) have shown success in various environments but waste computation on redundant tasks. This issue is addressed by introducing Self-Paced Absolute Learning Progress (SPALP), a regularization method inspired by Self-Paced Learning. Evaluated in three environments, SPALP achieves performance comparable to ALP in all cases and reaches it faster in two. Further improvements in SPALP's efficiency and performance are also discussed.
  • Architecture-Agnostic Masked Image Modeling – From ViT back to CNN

    Masked image modeling (MIM), a self-supervised pre-training method, enhances vision tasks using Vision transformers by masking and reconstructing parts of an image. The compatibility of MIM with CNNs and its operational principle are unclear. this study reveals that MIM improves generalized feature extraction through middle-order patch interactions and introduces Architecture-Agnostic Masked Image Modeling (A2MIM), compatible with both Transformers and CNNs. Extensive testing demonstrates A2MIM's ability to enhance representation learning and transferability to various tasks without specialized modifications.
  • Dropout Reduces Underfitting

    Dropout is a well-known technique for preventing overfitting in neural networks. This study reveals that early application of dropout can also prevent underfitting by reducing the directional variance of gradients across mini-batches and aligning them with the full dataset's gradient, which improves the stability of SGD training. Authors then introduce two dropout schedules: early dropout which prevents underfitting, and late dropout which prevents overfitting.
  • Learning from Children: Improving Image-Caption Pretraining via Curriculum

    This paper introduces a curriculum learning framework for image-caption pretraining, inspired by children's language learning from cognitive science, to address the challenges of aligning multiple concepts from captions to objects in images. The method starts with simple image-caption pairs and gradually increases complexity by adding more concepts, leveraging knowledge from each phase for subsequent learning. This approach outperforms traditional image-caption training across various settings, including starting from scratch, using pretrained encoders, and in low data scenarios.
  • On the Maximum Hessian Eigenvalue and Generalization

    This study investigates the relationship between training interventions and the generalization of deep networks. While previous research suggests that flatter solutions generalize better than sharper ones, particularly measured by λmax, the largest eigenvalue of the Hessian of the loss, this paper challenges this notion. Through experiments, we demonstrate that larger learning rates reduce λmax for all batch sizes but do not consistently improve generalization. Additionally, scaling batch size and learning rate simultaneously can change λmax without affecting generalization. Sharpness-Aware Minimization (SAM) produces smaller λmax but does not consistently enhance generalization, especially with larger batch sizes. Excessive dropout probabilities can degrade generalization, despite promoting smaller λmax. Batch normalization, while not consistently reducing λmax, still improves generalization. These findings question λmax's role in explaining generalization in neural networks, highlighting the limits of its explanatory power.
  • Reduction of Class Activation Uncertainty with Background Information

    This paper proposes a novel approach using a background class to enhance generalization in neural network training, offering computational efficiency compared to multitask learning. We introduce a method for selecting background images and explore potential enhancements. Our approach, applied to various datasets, demonstrates improved generalization with reduced computational cost. Furthermore, by analyzing class activation mappings, we observe a propensity for broader context comprehension in certain classification tasks. Integration of the proposed background class with transformers yields state-of-the-art performance on multiple datasets, including STL-10, CIFAR-10, CIFAR-100, Oxford-102, Caltech-101, and CINIC-10.
  • Qlora: Efficient Finetuning of Quantized Llms

    This paper introduces QLORA, a finetuning method that enables a 65B parameter model to be finetuned on a single 48GB GPU while maintaining 16-bit task performance. QLORA utilizes 4-bit quantized language models and Low Rank Adapters (LoRA) and incorporates several memory-saving innovations without compromising performance. The authors' best-performing model family, Guanaco, surpasses all openly available models on the Vicuna benchmark, achieving 99.3% of ChatGPT's performance with only 24 hours of finetuning on one GPU. The authors applied QLORA to finetune over 1,000 models and analyzed performance across various datasets, model types, and scales, demonstrating its ability to achieve state-of-the-art results even with smaller models. Their findings suggest that GPT-4 evaluations are a viable substitute for human assessments and question the reliability of current chatbot benchmarks. The authors make their models and 4-bit training code public.
  • A comprehensive review of Binary Neural Network

    Deep learning (DL) is widely used in intelligent systems and real-life applications, but its deployment on computationally limited and energy-constrained devices requires efficient technologies like Binary Neural Networks (BNN). BNNs save significant storage, computation cost, and energy, making them ideal for small devices, despite the trade-offs in memory and performance. This article offers a comprehensive review of BNN developments, focusing on 1-bit activations and weights. It covers the evolution of BNN, from early models to advanced algorithms and design aspects, and addresses BNN optimization, deployment, computing architectures, and diverse applications. It also outlines potential research directions in BNN technology.
  • Symbolic Discovery of Optimization Algorithms

    We propose a method for algorithm discovery via program search, focusing on optimizing deep neural network training. Our approach, Lion (EvoLved Sign Momentum), is memory-efficient and achieves comparable or superior performance to widely-used optimizers such as Adam and Adafactor across various tasks. Specifically, Lion enhances accuracy on tasks like image classification and vision-language contrastive learning, while reducing training compute. Notably, Lion exhibits improved performance with larger batch sizes and requires smaller learning rates compared to Adam. Despite its effectiveness, Lion has limitations, which we analyze, and provide insights into its deployment and performance. Our implementation is publicly available and has been successfully utilized in Google's search ads CTR model.
  • ImageNet-D: A new challenging robustness dataset inspired by domain adaptation

    We introduce ImageNet-D, a novel dataset designed to evaluate the robustness of ImageNet-trained models across various domains. With six distinct domains including Real, Painting, Clipart, Sketch, Infograph, and Quick-draw, ImageNet-D challenges even state-of-the-art models, revealing interpretable errors. For instance, our leading EfficientNet-L2 model exhibits a significant performance decrease, dropping from 11.6% on clean ImageNet to 29.2% on the Real domain.
  • A Closer Look at Self-Supervised Lightweight Vision Transformers

    This study looks at self-supervised pre-training methods on small scale or lightweight Vision Transformers (ViTs). Surprisingly, with appropriate pre-training, lightweight ViTs can match or exceed the performance of state-of-the-art (SOTA) networks that have more complex designs. However, the study also highlights limitations, such as a lack of improvement from large-scale pre-training data and weaker performance on tasks with limited data. Through an analysis of layer representations and attention maps, the impact of pre-training is detailed. Furthermore, a distillation strategy during pre-training is proposed, enhancing downstream performance for Masked Autoencoder (MAE)-based methods.
  • What do Self-supervised vision transformers learn?

    This study compares contrastive learning (CL) and masked image modeling (MIM) in self-supervised Vision Transformers (ViTs), focusing on their representations and downstream task performance. Key findings include: (1) CL captures longer-range global patterns and is more shape-oriented, aiding in linear image separation but leading to homogenous self-attentions. (2) CL focuses on low-frequency signals, while MIM emphasizes high-frequencies, making MIM more texture-oriented. (3) CL is significant in later layers, whereas MIM targets early layers. The study suggests CL and MIM can be harmonized to leverage both methods' strengths, enhancing performance.
  • Hungry Hungry Hippos: Towards Language Modeling with State Space Models

    State Space Models (SSM), despite scaling better with sequence length, underperform attention and suffer from poor hardware utilization. The study introduces a new SSM layer, H3, designed to improve recall and comparison across sequences, narrowing the performance gap with Transformers. To improve SSM training efficiency, the paper proposes FlashConv, a method that significantly speeds up processing.
  • VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

    This paper introduces VideoMAE, a scalable, self-supervised pre-training approach for video foundation models, capable of handling billions of parameters. It utilizes a dual masking strategy to efficiently pre-train by dividing video tokens between the encoder and decoder, thus reducing computational costs. The approach includes progressive training, starting with an unlabeled multi-source dataset followed by a labeled mixed dataset. The result is a billion-parameter video ViT model that sets new performance records on Kinetics and Something-Something datasets, demonstrating its efficacy as a general-purpose video representation learner.
  • I-JEPA Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

    This paper introduces the Image-based Joint-Embedding Predictive Architecture (I-JEPA) for learning semantic image representations without hand-crafted data augmentations. This self-supervised method uses multiple crops of an image. Given a context crop the network is trained to predict the embeddings of several target crops. When applied to Vision Transformers, I-JEPA acheived strong performance in various tasks like image classification and depth prediction, outperforming Masked Auto Encoding (MAE) in linear probing when controlling for compute.
  • Hard Patches Mining for Masked Image Modeling

    This paper proposes Hard Patches Mining (HPM), a novel framework for masked image modeling (MIM) pre-training that goes beyond simply solving given problems. HPM aims for the model to generate more challenging tasks for itself, using reconstruction loss as a metric for task difficulty. It incorporates an auxiliary loss predictor to determine which patches to mask based on predicted patch-wise losses, using a strategy to avoid overfitting. Experiments show HPM's effectiveness in creating challenging masked images and enhancing representation quality through the loss prediction objective, highlighting its ability to identify and learn from hard-to-reconstruct areas.
  • DropKey

    This paper presents DropKey, a method for dropout in Transformer's self-attention layers, addressing three main aspects ignored by previous studies. Firstly, it introduces a dropout-before-softmax scheme by applying dropout to the Key before attention matrix calculation, maintaining regularization and attention weight probability features. Secondly, it proposes a decreasing drop ratio schedule across layers to balance feature retention and prevent overfitting. Thirdly, it evaluates the necessity of structured dropout, like in CNNs, and concludes it's not essential for Vision Transformers.
  • Segment Anything

    The Segment Anything (SA) project introduces a novel task, model, and the largest image segmentation dataset to date, featuring over 1 billion masks across 11 million licensed and privacy-respecting images. The model consists of two parallel transformer encoders (text prompt and image) and a shared mask decoder. Its zero-shot performance has been evaluated across numerous tasks, often matching or surpassing previous fully supervised results.
  • Progressive Random Convolutions for Single Domain Generalization

    To address the limitations of Random Convolutions (RandConv) augmentation in single domain generalization, this paper introduce a Progressive Random Convolution (Pro-RandConv) method that layers random convolutions with small kernel sizes to maintain semantic integrity and enhance style diversity without increasing kernel size. Additionally, the authors enhance the random convolution block with deformable offsets and affine transformations for further texture and contrast diversification. This simple method surpasses the current state-of-the-art in single domain generalization benchmarks without relying on complex generators or adversarial learning.
  • Scaling Language-Image Pre-training via Masking

    This paper introduces Fast Language-Image Pre-training (FLIP), an efficient method for training CLIP that masks out a significant portion of image patches to allow more image-text pair learning in the same amount of time, enhancing sample contrast within similar memory usage. FLIP surpasses the original no-masking approach in accuracy and speed, and significantly outperformed CLIP models on various downstream tasks. The speedup from this method enables experimentation with larger model sizes, more data, or longer training periods.
  • From MNIST to ImageNet and Back: Benchmarking Continual Curriculum Learning

    Continual learning (CL) is a promising trend in machine learning, aiming to develop robust models and strategies for dynamic environments by incorporating new knowledge while retaining past knowledge. CL research is fragmented, with various protocols, tasks, datasets, and metrics, often not reflecting real-world complexity and tailored to specific strategies. This work addresses this gap by introducing two novel CL benchmarks with heterogeneous tasks from six image datasets, varying in complexity and quality. These benchmarks fairly evaluate state-of-the-art CL strategies in scenarios closer to real-world conditions. The benchmarks present tasks in both increasing and decreasing complexity, assessing models' ability to exploit task structure. The work emphasizes a rigorous, reproducible evaluation protocol for measuring generalization and memory retention in models. Experimental results show that popular CL strategies perform poorly on these benchmarks, exhibit high forgetting, and struggle with curriculum task ordering. These findings underscore the need for rigorous comparisons and the development of new CL strategies for complex scenarios.
  • Adversarially Self-supervised Pre-training Improves Accuracy and Robustness

    This paper explores using adversarial training, typically a defense against adversarial shifts, to enhance visual representation pre-training for transfer across tasks and distribution shifts, integrating it with self-supervised methods like BYOL, MAE, and RotNet. It finds that adversarial self-supervision improves fine-tuning accuracy both within and outside distributions, outperforming standard methods even without adversarial fine-tuning. Optimal performance requires method-specific perturbation radii and preserving early layer parameters during fine-tuning. While no single method excels in all scenarios, adversarial MAE performs best for in-distribution tasks, and adversarial BYOL is superior for out-of-distribution tasks.
  • Extreme Masking for Learning Instance and Distributed Visual Representations

    The paper introduces ExtreMA, a method for learning visual representations by using high levels of token masking (75%-90%) for data augmentation. It employs self-attention and cross-attention blocks to learn spatial and holistic instance representations. Its contributions include demonstrating the effectiveness of random masking for siamese learning, showing that extreme masking accelerates learning and enhances performance.
  • Denoising Masked Autoencoders Help Robust Classification

    This paper introduces Denoising Masked AutoEncoders (DMAE), a self-supervised method for developing robust image classifiers. By corrupting images with Gaussian noise and masking patches, then reconstructing them using a Transformer-based model, DMAE's encoder captures essential semantics resistant to Gaussian noise. This encoder serves as a base for Gaussian smoothed models, enabling the computation of a certified radius for robustness. The DMAE ViT-Base model achieves comparable or superior certified accuracy with fewer parameters than previous approaches, while the ViT-Large model sets a new benchmark on ImageNet. The model also shows high transferability to CIFAR-10, indicating its broad applicability.
  • (Certified!!) Adversarial Robustness for Free

    This paper demonstrates how to achieve state-of-the-art certified adversarial robustness to ℓ2-norm bounded perturbations using only off-the-shelf pretrained models. The authors instantiate the denoised smoothing approach by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This method certifies 71% accuracy on ImageNet under adversarial perturbations constrained to an ℓ2 norm of ε = 0.5, improving upon the prior certified state-of-the-art by 14 percentage points and denoised smoothing by 30 percentage points, without requiring any fine-tuning or retraining of model parameters.
  • Towards a Unified Theoretical Understanding of Non-contrastive Learning via Rank Differential Mechanism

    Recent non-contrastive learning methods like BYOL, SimSiam, SwAV, and DINO have shown that asymmetric architectural designs can achieve good self-supervised visual learning performance by aligning positive pairs alone, despite a lack of unified theoretical understanding on how these designs prevent feature collapse. This study introduces the Rank Differential Mechanism (RDM) as a unified theoretical framework for non-contrastive learning. RDM demonstrates that these asymmetric designs maintain a consistent rank difference in output features, enhancing effective dimensionality and preventing feature collapse. Unlike previous theories, RDM applies to designs both with and without a predictor, offering a comprehensive understanding of non-contrastive learning methods and guiding the development of new variants. Experiments confirm that these variants perform comparably, if not better than, existing methods on benchmark datasets.
  • Sharpness-Aware Training for Free

    Modern deep neural networks (DNNs) excel in performance but often suffer from over-parameterization, leading to increased generalization error without tailored training strategies. Sharpness-Aware Minimization (SAM) has proven effective in reducing generalization error by minimizing sharpness in the loss landscape. However, SAM incurs a significant computational overhead. This paper introduces Sharpness-Aware Training for Free (SAF), which mitigates sharpness at nearly zero additional computational cost. SAF achieves this by preventing sudden drops in loss within sharp local minima during weight updates. A novel trajectory loss, based on KL-divergence between current and past DNN outputs, replaces SAM's sharpness measure, guiding convergence towards flat minima for enhanced generalization. Empirical results demonstrate SAF's effectiveness on ImageNet with comparable computational efficiency to the base optimizer.
  • Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

    This study introduces a novel methodology utilizing circular harmonics to explore the design space of contour-oriented synthetic datasets for formula-driven supervised learning (FDSL). The authors identify the optimal FDSL parameters and maximize synthetic image variety, which is crucial for success. Using the newly created VisualAtom-21k for pre-training, ViT-Base achieves a top-1 accuracy of 83.7% on ImageNet-1k, nearing the 84.2% achieved with JFT-300M pre-training but with significantly fewer images. The authors demonstrate FDSL's potential for continuous improvement and its ability to avoid issues common to real images, such as privacy/copyright concerns, labeling costs/errors, and ethical biases.
  • AdaSAM: Boosting Sharpness-Aware Minimization with Adaptive Learning Rate and Momentum for Training Deep Neural Networks

    The Sharpness Aware Minimization (SAM) optimizer, known for enhancing the generalization of deep neural networks by introducing extra perturbation steps, is further developed into AdaSAM by integrating adaptive learning rates and momentum acceleration. Although AdaSAM has been empirically applied to large-scale networks, a theoretical understanding of its performance, considering the complexity of its components, was lacking. This study presents a theoretical analysis of AdaSAM's convergence in stochastic non-convex settings, demonstrating a convergence rate of O(1/√bT) that scales linearly with mini-batch size. By introducing a delayed second-order momentum term, the study successfully decouples and analyzes the intertwined effects of stochastic gradients, adaptive learning rates, and perturbations. This is the first work to offer a detailed convergence rate for SAM with adaptive mechanisms. Experimental results on various NLP tasks indicate AdaSAM's superior performance over SGD, AMSGrad, and SAM optimizers.
  • Efficient Masked Autoencoders with Self-Consistency

    This paper introduces efficient masked autoencoders with self-consistency (EMAE) to enhance pre-training efficiency and prediction consistency for masked image modeling (MIM). EMAE divides the image into non-overlapping parts, each subject to a random mask with a uniform mask ratio, to perform parallel MIM tasks and generate predictions. A self-consistency module ensures consistent predictions for overlapping masked patches. EMAE improves data utilization and achieves reliable representations, showing superior results on ImageNet with only 300 pre-training epochs under ViT-Base compared to MAE's 1600 epochs. EMAE also demonstrates top-tier transfer performance in various downstream tasks, such as object detection and semantic segmentation.
  • Learning RelU Networks To High Uniform Accuracy Is Intractable

    Statistical learning theory provides guidelines on the number of training samples needed for achieving desired accuracy in learning problems. However, this is not always adequate, particularly in security-sensitive areas or computational sciences, where uniform accuracy across all inputs is necessary. This paper quantifies the training samples required for uniform accuracy in learning problems involving ReLU neural networks, revealing that the number of samples needed exponentially increases with the network's depth and input dimension.
  • V-Jepa: Latent Video Prediction For Visual Representation Learning

    This paper introduces V-JEPA, a self-supervised video learning method that predicts masked spatio-temporal regions in latent space, effectively applying the masked-modelling principle of large language models to video. This approach generates visual features useful across various image and video tasks without needing model adjustment, achieving significant improvements on Kinetics-400 (82.1%) and Something-Something-v2 (71.2%) benchmarks, outperforming prior video models. V-JEPA also excels in motion understanding tasks, surpassing leading image models like DINOv2 and OpenCLIP, and achieves 77.9% on ImageNet classification with video training alone, setting a new standard for video models.
  • Transformers with Learnable Activation Functions

    Activation functions significantly affect model performance by reducing data complexity, yet their selection in Transformer-based language models is often overlooked. This paper explores the impact of using rational activation functions (RAFs), which unlike fixed activation functions (FAFs), can learn optimal functions from data. Our experiments demonstrate that Transformer models with RAFs (RAFT) outperform those with FAFs (FAFT), achieving a 5.71 point higher score on the GLUE benchmark with only 100 training examples and a 2.05 point increase on SQuAD with full data. The varied shapes of learned RAFs across layers and tasks suggest a new method for analyzing and understanding large pre-trained language models.
  • Scaling Vision Transformers to 22 Billion Parameters

    We introduce a method for efficiently training a 22B-parameter Vision Transformer (ViT-22B) and conduct various experiments to assess its performance. Compared to existing models, ViT-22B shows improved scalability and benefits such as enhanced fairness-performance tradeoff, alignment with human visual perception, and increased robustness. Our findings suggest promising avenues for achieving large-scale language model-like capabilities in vision tasks.
  • Mixed-order self-paced curriculum learning for universal lesion detection

    This paper introduces a novel approach called mixed-order self-paced curriculum learning (Mo-SCL) to address the challenges faced by self-paced curriculum learning (SCL) in medical image analysis tasks, such as universal lesion detection. These challenges include inaccurate difficulty estimation and the under-utilization of hard samples. Mo-SCL combines uncertainty and loss for better difficulty estimation and incorporates both hard and easy samples in training batches. Through theoretical analysis and experiments on the DeepLesion dataset, the authors demonstrate that Mo-SCL enhances lesion detection accuracy in state-of-the-art methods without requiring additional network modifications.
  • Energy-inspired Self-supervised Pretraining For Vision Models

    This paper introduces a self-supervised vision model pretraining framework inspired by energy-based models (EBMs), leveraging symmetric mappings in deep networks without auxiliary components. The framework models energy estimation and data restoration through the network's forward and backward passes, assigning low energy to unlabeled dataset samples and using gradient-based optimization to restore data from corrupted versions. This approach integrates the encoder-decoder architecture into a single model, supporting various pretext tasks. Experiments demonstrate that this method achieves comparable or better performance with fewer training epochs than current self-supervised pretraining methods, suggesting potential for further exploration in self-supervised vision model pretraining and pretext tasks.
  • Confidence-Aware Calibration and Scoring Functions for Curriculum Learning

    State-of-the-art deep neural networks often exhibit over-confidence in predictions, indicating miscalibration. Label Smoothing has been proposed to address this by softening hard targets during training, redistributing part of the probability mass from a ‘one-hot’ label uniformly to all other labels. However, neither model nor human confidence in a label is likely uniformly distributed, as some labels are more likely to be confused than others. This paper integrates model and human confidence with label smoothing, termed Model Confidence LS and Human Confidence LS, to improve model calibration and generalization. The study demonstrates how these confidence scores enhance curriculum learning, a strategy inspired by progressing from easier to harder tasks. Higher confidence scores indicate more recognizable and easier samples, serving as a scoring function to rank samples in curriculum learning. Evaluations using four state-of-the-art architectures for image and text classification, with multi-rater label annotations, show that integrating confidence information in label smoothing and curriculum learning improves both model performance and calibration. The code is available at https://github.com/AoShuang92/Confidence-Calibration-CL.
  • When Do Flat Minima Optimizers Work?

    Flat-minima optimizers, including Stochastic Weight Averaging (SWA) and Sharpness-Aware Minimization (SAM), enhance neural network generalization but lack thorough evaluation and cross-domain benchmarking. This study addresses this by comparing their loss surfaces and benchmarking across computer vision, natural language processing, and graph representation learning. The findings offer insights for optimizing deep learning optimizers and choosing suitable ones for specific problems.
  • DeiT III: Revenge of the ViT

    This paper explores the supervised training of Vision Transformers (ViTs) using a simplified training approach adapted from ResNet-50 that includes a novel data-augmentation method with just 3 augmentations. The study demonstrates that this method significantly improves ViTs' performance in image classification, transfer learning, and semantic segmentation over previous supervised training techniques. Moreover, it shows ViTs' performance can match newer architectures, providing a new benchmark for evaluating self-supervised methods on ViTs.
  • How Does Sharpness-Aware Minimization Minimize Sharpness?

    Sharpness-Aware Minimization (SAM) enhances deep neural networks' generalization across various settings. Although SAM aims to penalize model sharpness using a computationally efficient approach, its exact operational mechanism and the sharpness notion it regularizes remain unclear, partly due to differing sharpness concepts used in its theoretical framework and empirical validation. This study identifies the precise sharpness concept SAM regulates and explains its mechanism. It reveals that the combined effect of SAM's two-step approximations, despite being individually misleading, correctly improves generalization when using full-batch gradients. Moreover, we demonstrate that the stochastic SAM version indeed regularizes a third sharpness notion, aligning more closely with practical performance. This effectiveness is attributed to the gradient's alignment with the Hessian's top eigenvector under SAM.
  • Synthetic Image Data for Deep Learning

    This paper investigates using high-quality rendering software and domain randomization to generate a large synthetic dataset from 3D CAD models of a real vehicle. This synthetic dataset is used to augment limited real training data for image classification and semantic segmentation tasks. While models trained solely on synthetic images showed low accuracy on real validation data, including even small amounts of real data significantly improved performance. Augmenting real training data with synthetic images outperformed using only real images. Furthermore, pretraining models on the synthetic dataset before transfer learning significantly reduced training costs by allowing most of the training to be completed upfront using the synthetic data.
  • A SIMPLE, EFFICIENT AND SCALABLE CONTRASTIVE MASKED AUTOENCODER FOR LEARNING VISUAL REPRESENTATIONS

    Introuces CAN, a method that combines contrastive learning, masked autoencoders, and noise prediction for efficient and scalable self-supervised visual learning. CAN outperforms existing methods in transfer learning and robustness tasks, showing particularly strong performance when pre-training on large, uncurated datasets. It offers a significant efficiency improvement and reduces the computational load.
  • FineAction: A Fine-Grained Video Dataset for Temporal Action Localization

    This paper introduces FineAction, a large-scale, fine-grained video dataset designed to address the limitations of current temporal action localization (TAL) benchmarks, which rely on coarse action classes and lead to model overfitting and ambiguous annotations. FineAction contains 103K instances across 106 action categories in 17K videos, offering a diverse and densely annotated dataset with co-occurring actions that pose new challenges for TAL. The authors evaluate popular localization methods on FineAction, revealing the impact of fine-grained instances on performance, and propose a baseline method achieving a 13.17% mAP. FineAction aims to advance TAL research and is accessible online.
  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

    The authors introduce VideoMAE, a data-efficient self-supervised video pre-training (SSVP) approach that utilizes video masked autoencoders with a novel high-ratio video tube masking technique inspired by ImageMAE. Key findings include the effectiveness of high masking ratios (90-95%) due to video's temporal redundancy, strong performance on small datasets (~3k-4k videos) without extra data highlighting the importance of high-level structure learning, and data quality being more crucial than quantity with domain shift being significant. Notably, VideoMAE achieves strong performance on several benchmarks using a basic ViT backbone without extra data.
  • How to Train Vision Transformer on Small-scale Datasets

    This study demonstrates that self-supervised learning can introduce effective inductive biases directly from small datasets, enabling the fine-tuning of Vision Transformers (ViTs) without relying on large-scale pre-training datasets like ImageNet and JFT or requiring modifications to the architecture or loss functions. The authors show that this approach improves ViT performance on small datasets such as CIFAR10/100, CINIC10, SVHN, Tiny-ImageNet, Aircraft, and Cars, while maintaining ViT's attention to relevant regions and robustness, despite ViT's inherent lack of inductive biases and typical dependence on large-scale pre-training.
  • Turbo Training with Token Dropout

    This paper introduces Turbo training, an efficient method for training Transformers on video-related tasks. The authors present a simple yet versatile training paradigm applicable to multiple video tasks and demonstrate its effectiveness across action classification, video-language representation learning, and long-video activity classification. Turbo training achieves competitive performance with up to 4× speed-up and reduced memory usage, enabling long-schedule video-language training and end-to-end long-video training with limited resources, outperforming or matching previous resource-intensive methods..
  • Exploring the Role of Mean Teachers in Self- Supervised Masked Auto-encoders

    This paper examines the role of the student/teacher paradigm in masked image modeling (MIM) for self-supervised learning (SSL) with Vision Transformers, particularly in the context of the Masked Auto-Encoder (MAE). Analysis of a simple linear model reveals that the teacher model acts as a conditional momentum regularizer by selectively filtering gradient directions based on feature similarity. Building on this insight, the authors introduce the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE), which integrates an exponential moving average (EMA) teacher with MAE. RC-MAE demonstrates faster convergence, reduced memory requirements, greater robustness, and improved performance on tasks like ImageNet-1K classification, object detection, and instance segmentation compared to the original MAE.
  • PatchDropout: Economizing Vision Transformers Using Patch Dropout

    This paper introduces PatchDropout, a simple training technique for Vision Transformers ViT that drops random image patches, cutting computational and memory demands by at least 50% on datasets like IMAGENET and even more with larger images. On the high-resolution CSAW medical dataset, PatchDropout achieves a 5× reduction in resources and improves performance, enabling more efficient model scaling and parameter tuning within fixed computational or memory budgets
  • Vision Transformers for Action Recognition: a Survey

    This article provides an extensive review of vision transformer techniques applied specifically to human action recognition. These action transformers are analyzed and categorized based on their architecture, modality, and objectives. The review explores how action transformers encode spatio-temporal data, reduce dimensions, construct frame patches and spatio-temporal cubes, and various representation techniques. It also examines optimizing spatio-temporal attention in transformers for longer sequences and different learning strategies like self-supervised and zero-shot learning with their respective loss functions. Additionally, it assesses progress in benchmark evaluation scores and discusses challenges and future directions in this research area.
  • Learning Rate Perturbation: A Generic Plugin of Learning Rate schedule towards Flatter Local Minima

    Learning rate is crucial in neural network training, yet existing schedules lack theoretical backing, often leading to suboptimal choices made through trial and error. To address this, we propose LEAP, a plugin enhancing various learning rate schedules by introducing perturbations. This simple yet effective strategy favors flat minima, ensuring better generalization. Extensive experiments demonstrate LEAP's ability to improve performance across diverse datasets and learning rate schedules, including constant ones.
  • Deeper Insights into the Robustness of ViTs towards Common Corruptions

    This paper investigates the robustness of Vision Transformer (ViT) variants against common corruptions. The benchmarking reveals that overlapping patch embedding and convolutional feed-forward networks (FFNs) significantly enhance ViT robustness. The study also scrutinizes the effectiveness of CNN-based data augmentation strategies when applied to ViTs, finding that adversarial noise training is effective, while fourier-domain augmentation falls short. A new conditional method for generating dynamic augmentation parameters based on input images is proposed, achieving state-of-the-art robustness against common corruptions.
  • Model Generalization: A Sharpness Aware Optimization Perspective

    This study investigates the effectiveness of Sharpness-Aware Minimization (SAM) and adaptive Sharpness-Aware Minimization (ASAM) in enhancing model generalization. Through three experiments, we assess their impact from a sharpness-aware perspective. Results demonstrate that optimization techniques based on sharpness awareness can bolster model generalization. Furthermore, ASAM exhibits potential for enhancing generalization performance on un-normalized data, though additional research is required for validation.
  • 3D Gaussian Splatting for Real-Time Radiance Field Rendering

    This paper introduces a novel approach to Radiance Field methods for novel-view synthesis that addresses the challenges of high visual quality, costly training, and real-time rendering of unbounded, complete scenes at 1080p resolution. The authors propose three key innovations: 1) Utilizing sparse points from camera calibration to represent scenes with 3D Gaussians, optimizing scene fidelity while reducing computation in empty spaces. 2) Implementing interleaved optimization and density control of the 3D Gaussians, including anisotropic covariance adjustment for accurate scene depiction. 3) Developing a fast, visibility-aware rendering algorithm enabling anisotropic splatting, which speeds up training and supports real-time (≥ 30 fps) rendering at 1080p. The method demonstrates superior visual quality and real-time rendering capabilities across several datasets.
  • Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

    Masked Image Modeling (MIM) has been successful in self-supervised visual recognition, yet its working mechanism remains unclear, especially compared to siamese approaches like contrastive learning. This study introduces a new perspective that MIM implicitly learns occlusion-invariant features, similar to the invariances learned by siamese methods. The Authors show that MIM can be interpreted within a unified framework alongside traditional methods, differing only in data transformations and similarity measurements.
  • Efficiently Modeling Long Sequences with Structured State Spaces

    This paper introduces the Structured State Space sequence model (S4), a more efficient parameterization of State Space Models, demonstrating strong empirical performance across various benchmarks, including achieving state-of-the-art results and significantly outperforming prior models in efficiency and speed.
  • Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for Robotic Bin Picking

    This paper introduces an iterative self-training framework for sim-to-real 6D object pose estimation to enable cost-effective robotic bin-picking. The authors create a photo-realistic simulator to synthesize virtual data for training an initial pose estimation network (teacher model). This teacher predicts poses on unlabeled real data, and an adaptive selection scheme filters reliable predictions to generate pseudo-labels for training a student model on real data. Iteratively refining the teacher with the student improves pseudo-label quality. Evaluated on public benchmarks and a new dataset, their method shows 11.49% and 22.62% ADD(-S) improvements and a 19.54% increase in bin-picking success, demonstrating the effectiveness of this iterative sim-to-real approach.
  • TinyViT: Fast Pretraining Distillation for Small Vision Transformers

    This paper introduces TinyViT, a series of compact and efficient vision transformers (ViTs) designed for devices with limited resources. The authors employ a fast distillation framework to transfer knowledge from large pretrained models to smaller ones during pretraining, using sparsified logits from teacher models to minimize memory and computational costs. TinyViT models are scaled down from larger counterparts under specific computation and parameter limits. Experiments show TinyViT achieves 84.8% top-1 accuracy on ImageNet-1k with just 21M parameters, comparable to Swin-B pretrained on ImageNet-21k but with 4.2 times fewer parameters. With increased image resolution, TinyViT reaches 86.5% accuracy, outperforming Swin-L with only 11% of its parameters, and demonstrates strong transferability across various downstream tasks.
  • Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

    This paper proposes a method to enhance the generalization of deep neural networks (DNNs) by penalizing the gradient norm of the loss function during optimization. By constraining the gradient norm, the optimizers tend to find flat minima, improving generalization. We efficiently implement this method using first-order approximation within the gradient descent framework. Experimental results demonstrate improved generalization across various models and datasets. Additionally, we show that a recent method, sharpness-aware minimization, is a special case of our approach, with our method achieving new state-of-the-art performance on tested tasks.
  • How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

    This paper explores the interactions between training data amount, model regularization or data augmentation (AugReg), model size, and compute budget for Vision Transformers (ViT) in various vision tasks. The authors find that using more compute and AugReg can achieve the same performance as training with significantly more data. They demonstrate that ViTs of various sizes trained on the public ImageNet-21k dataset can match or surpass models trained on the larger, non-public JFT-300M dataset.
  • Replacing Labeled Real-image Datasets with Auto-generated Contours

    This study demonstrates that formula-driven supervised learning (FDSL) can achieve or surpass the performance of ImageNet-21k pre-training for Vision Transformers (ViTs) without using real images or relying on human or self-supervision. A ViT-Base model pre-trained with FDSL achieved 82.7% top-1 accuracy on ImageNet-1k, exceeding the 81.8% accuracy from ImageNet-21k pre-training. Synthetic images generated by FDSL circumvent issues associated with real images, offering a promising avenue for pre-training general models. Investigations suggest that simple object contours can match fractal-based dataset performance and increasing pre-training task difficulty enhances fine-tuning accuracy.
  • Eigencurve: Optimal Learning Rate Schedule For Sgd on Quadratic Objectives With Skewed Hessian Spectrums

    Learning rate schedulers are essential in deep neural network training, but there's a gap between practical usage and theoretical understanding. This paper introduces Eigencurve, the first set of learning rate schedules achieving minimax optimal convergence rates for SGD on quadratic objectives with skewed eigenvalue distributions of the Hessian matrix. This condition is common in practice. Experimental results on CIFAR-10 image classification tasks demonstrate Eigencurve's superiority over step decay, particularly with fewer epochs. The theory inspires practical schedulers approximating Eigencurve, resembling cosine decay for some problems while outperforming it in others.
  • Masked Autoencoders are Robust Data Augmentors

    This paper introduces Mask-Reconstruct Augmentation (MRA), an image augmentation technique that leverages self-supervised masked autoencoders to generate distorted inputs and address the issue of overfitting in deep neural networks. Inspired by the success of masked image modeling in self-supervised learning, MRA uses nonlinear transformations for regularization, as opposed to current hand-crafted linear techniques like scale, flip, and color jitter. Extensive testing across various image classification benchmarks demonstrates MRA's ability to significantly improve performance in supervised, semi-supervised, and few-shot classification tasks.
  • How Do Vision Transformers Work

    This study explores the operational mechanics of multi-head self-attentions (MSAs) and Vision Transformers (ViTs). The authors find that MSAs enhance accuracy and generalization by smoothing loss landscapes, attributed more to data specificity than managing long-range dependencies. ViTs, however, grapple with non-convex losses, mitigated by large datasets and specific smoothing techniques. The research contrasts MSAs and convolutional layers (Convs), noting their complementary nature as low-pass and high-pass filters, respectively. Multi-stage neural networks are found to function like a series of small models, with MSAs crucial for predictions at stage ends. The study introduces AlterNet, a model where Conv blocks are substituted with MSA blocks at stage ends, achieving superior performance over CNNs across both large and small data scenarios.
  • Efficient Sharpness-Aware Minimization For Improved Training of Neural Networks

    Overparametrized Deep Neural Networks (DNNs) can lead to severe generalization errors despite their impressive performances. It's been shown that the sharpness of the loss landscape is related to generalization error, leading to the development of the Sharpness Aware Minimizer (SAM) to improve generalization. However, SAM is computationally costly, doubling the time required compared to basic optimizers like Stochastic Gradient Descent (SGD). This paper introduces the Efficient Sharpness Aware Minimizer (ESAM), enhancing SAM's efficiency without sacrificing its generalization benefits. ESAM incorporates Stochastic Weight Perturbation and Sharpness-Sensitive Data Selection strategies for more efficient training. These methods approximate sharpness by perturbing selected weights and optimize the SAM loss with a carefully chosen subset of data, respectively. Theoretical justifications for these strategies are provided, and extensive testing on CIFAR and ImageNet shows that ESAM reduces the computational overhead of SAM from 100% to 40% while maintaining or improving test accuracies.
  • FNet: Mixing Tokens with Fourier Transforms

    We demonstrate that Transformer encoder architectures can be accelerated with minimal impact on accuracy by substituting self-attention sublayers with simple linear transformations. Remarkably, using a standard Fourier Transform instead of the self-attention sublayer in a Transformer encoder achieves 92-97% of BERT models' accuracy on the GLUE benchmark, while training 80% faster on GPUs and 70% faster on TPUs for standard 512 input lengths. Our FNet model significantly outperforms in speed at longer input lengths, matching the accuracy of the most accurate models on the Long Range Arena benchmark and surpassing the fastest models in speed across all sequence lengths on GPUs and shorter lengths on TPUs. Additionally, FNet is lightweight in memory use and exceptionally efficient in smaller sizes, outperforming Transformer models under the same speed and accuracy constraints.
  • Vision Transformers in 2022: An Update on Tiny Imagenet

    The paper focuses on evaluating the performance of recent advancements in image transformers, such as Vision Transformer (ViT), Data Efficient Image Transformer (DeiT), Class Attention in Image Transformer (CaiT), and Swin Transformers, on the Tiny ImageNet dataset for transfer learning tasks. While these models are typically trained on large datasets like ImageNet-21k and then fine-tuned on ImageNet-1k, their assessments often overlook the Tiny ImageNet benchmark. The study updates the performance of vision transformers on Tiny ImageNet, highlighting that Swin Transformers outperform existing benchmarks with a validation accuracy of 91.35%. This underscores the importance of evaluating these models on the Tiny ImageNet dataset for transfer learning performance.
  • Better plain ViT baselines for ImageNet-1k

    This paper challenges the common belief that the Vision Transformer (ViT) model requires sophisticated regularization techniques to excel on ImageNet-1k scale data. The authors present minor modifications to the original ViT vanilla training setting that significantly improve the performance of plain ViT models. They find that standard data augmentation is sufficient, with 90 epochs of training surpassing 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reaching 80% in less than one day.
  • Swin Transformer V2 Scaling Up Capacity and Resolution

    This paper explores large-scale models in computer vision, addressing training instability, resolution gaps, and data hunger. Techniques proposed include a residual-post-norm method with cosine attention for stability, log-spaced continuous position bias for resolution transfer, and SimMIM for self-supervised pre-training to reduce labeled data needs. The study successfully trains a 3 billion-parameter Swin Transformer V2 model, setting performance records on four vision tasks. Notably, it achieves higher efficiency than Google's billion-level visual models, requiring 40 times less labeled data and training time.
  • Task2Sim: Towards Effective Pre-training and Transfer from Synthetic Data

    The paper investigates the use of synthetic data generated by graphics simulators for pre-training computer vision models. The authors find that model performance on downstream tasks varies with different simulation parameters used to generate the synthetic data. They introduce Task2Sim, a model that maps the requirements of a downstream task to the optimal simulation parameters for generating synthetic pre-training data.
  • Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

    This paper explores alternatives to masked image modeling (MIM) for self-supervised visual representation in vision transformers ViT. Five different learning objectives were proposed, which degraded the input image in various ways similar to masking. Design principles are proposed for token-based pre-training of vision transformers. The most effective strategy combined preserving the original image style with spatial misalignment and masking. This approach outperformed traditional MIM on downstream recognition tasks without increasing computational demands.
  • Self-supervised Transformers for Unsupervised Object Discovery Using Normalized Cut

    This paper presents a graph-based method for object discovery in images using self-supervised transformer features, specifically those trained with self-distillation loss (DINO). The authors treat visual tokens as nodes within a weighted graph, with edges representing similarity between tokens. By applying a normalized graph-cut through spectral clustering and generalized eigendecomposition, they segment foreground objects based on the second smallest eigenvector. The method outperforms the state-of-the-art LOST on VOC07, VOC12, and COCO20K datasets, and incorporating a second stage class-agnostic detector (CAD) further improves performance. The approach also extends to unsupervised saliency detection and achieves competitive results in weakly supervised object detection on CUB and ImageNet.
  • Surrogate Gap Minimization Improves Sharpness-Aware Training

    Sharpness-Aware Minimization (SAM) enhances generalization by optimizing a neighborhood-based perturbed loss, but it doesn't always favor flat minima due to both sharp and flat minima having low perturbed losses. We introduce a new measure, the surrogate gap, reflecting the dominant Hessian eigenvalue at small neighborhood radii, which is simple to compute and can be minimized during training. We propose the Surrogate Gap Guided Sharpness-Aware Minimization (GSAM), an advancement over SAM with minimal additional computational cost. GSAM employs a dual-step approach: first, minimizing the perturbed loss similar to SAM, and second, reducing the surrogate gap without affecting the perturbed loss to target regions with low loss and sharpness, thus achieving superior generalization. GSAM is theoretically robust, showing better convergence and empirical generalization improvements, notably a +3.2% gain over SAM and +5.4% over AdamW in ImageNet accuracy for ViT-B/32.
  • Three things everyone should know about Vision Transformers

    This research presents three key findings using variants of vision transformers: (1) Vision transformers' residual layers can be processed in parallel to some extent without significantly impacting accuracy. (2) Fine-tuning attention layer weights alone effectively adapts transformers for higher resolution and different classification tasks, reducing compute and memory use while allowing weight sharing. (3) Incorporating MLP-based patch pre-processing enhances Bert-like self-supervised training with patch masking. The authors validate these approaches using the ImageNet-1k dataset and confirm their findings with the ImageNet-v2 test set, evaluating transfer performance across six additional datasets.
  • The loss landscape of deep linear neural networks: a second-order analysis

    This study investigates the optimization landscape of deep linear neural networks with square loss, focusing on the role of non-strict saddle points in algorithm dynamics, which has been previously understudied. the authors conduct a comprehensive second-order analysis, identifying global minimizers, strict, and non-strict saddle points among all critical points, alongside their critical values. Our findings, based on conditions related to the ranks of partial matrix products, contribute insights into global convergence and implicit regularization observed in optimizing linear neural networks. Additionally, the authors offer an explicit parameterization of global minimizers and identify extensive sets of strict and non-strict saddle points.
  • Do Vision Transformers See Like Convolutional Neural Networks?

    Convolutional Neural Networks (CNNs) have been prominent in processing visual data, but recent studies show Vision Transformers (ViTs) can match or outperform CNNs in image classification. This work examines how ViTs solve classification tasks, discovering significant differences from CNNs, including uniformity in ViT representations across layers due to self-attention and residual connections that enhance feature propagation. ViTs also maintain spatial information effectively, influenced by classification methods. The study further explores how dataset scale affects ViT features and their transferability, linking these findings to new architectures like MLP-Mixer.
  • DeepNet: Scaling Transformers to 1,000 Layers

    This paper introduces DEEPNORM, a novel method to stabilize extremely deep Transformers through a new normalization function and a theoretically derived initialization for the residual connection. DEEPNORM offers a stable and efficient training regime, blending the benefits of Post-LN's performance with Pre-LN's stability, allowing for the scaling of Transformers up to 1,000 layers without difficulty. This represents a significant advancement over previous models, with a 200-layer, 3.2B parameter model outperforming a 48-layer, 12B parameter state-of-the-art model by 5 BLEU points on a multilingual benchmark with 7,482 translation directions, suggesting a promising direction for scaling.
  • Optimal learning rate schedules in high-dimensional non-convex optimization problems

    This paper explores the role of learning rate schedules in high-dimensional and non-convex optimization problems, focusing on Langevin optimization with a decaying learning rate. Analyzing models with Gaussian random functions on N-dimensional spheres, the study reveals that to accelerate optimization without getting trapped in saddles, a decay rate β < 1 is optimal, contrary to convex settings where β = 1 is preferred. Introducing a signal recovery component, the dynamics involve an exploration phase navigating through rough landscape parts and a convergence phase entering convex basins. It's found optimal to maintain a large learning rate during exploration to swiftly exit non-convex regions, then transition to β = 1 for rapid convergence to the solution. These findings are validated in a neural network regression task.
  • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    This paper introduces the concept of grokking, a sudden onset of perfect generalisation long after overfitting, on small, algorithmically generated datasets. The study also finds that generalization on smaller datasets requires more optimization. The authors suggest that such datasets are ideal for investigating the puzzling phenomenon of how overparametrized neural networks generalize beyond simply memorizing their training data.
  • Masked Autoencoders Are Scalable Vision Learners

    This paper presents masked autoencoders (MAE) as efficient and scalable self-supervised learners for computer vision. MAE involves masking random patches of an input image and reconstructing the missing pixels using an asymmetric encoder-decoder architecture and a lightweight decoder. Masking a significant portion of the input image creates a challenging yet informative self-supervisory task. These innovations allow for efficient training of large models, tripling training speed and enhancing accuracy. A vanilla ViT-Huge model reaches top accuracy (87.8%) on ImageNet-1K data among similar methods. The model's transfer performance in downstream tasks surpasses that of supervised pre-training, indicating promising scaling potential.
  • SinLU: Sinu-Sigmoidal Linear Uni

    This paper introduces a non-linear activation function called the Sinu-sigmoidal Linear Unit (SinLU), formulated as SinLU(x) = (x + a sin bx) · σ(x). It incorporates a sine wave for added functionality over traditional linear units, with two trainable parameters controlling the sinusoidal participation. The authors compare SinLU's performance to ReLU, GELU, and SiLU across various domains, models, and standard datasets. The results demonstrate SinLU's robustness and superior performance due to the incorporation of the trainable sine wave parameters, facilitating easy training and fast convergence.
  • Intriguing Properties of Vision Transformers

    This study investigates how ViT's flexibility in contextual attention aids in overcoming challenges like occlusions, domain shifts, and perturbations in natural images. The authors find ViTs show remarkable resilience to occlusions, perturbations, and domain shifts, maintaining high accuracy even when most of the image is obscured. Unlike CNNs, ViTs exhibit less texture bias, focusing more on shape-based features, which enhances their shape recognition to levels comparable with the human visual system. Additionally, ViTs can perform accurate semantic segmentation without pixel-level supervision and create feature ensembles from a single model for improved classification performance in both traditional and few-shot learning settings. These advantages stem from their dynamic receptive fields enabled by self-attention mechanisms.
  • On Interaction Between Augmentations and Corruptions in Natural Corruption Robustness

    Building robust models in computer vision requires invariance to image corruptions like warping, noise, or color shifts. Despite new data augmentations improving performance on ImageNet-C, the correlation between data augmentations and test-time corruptions remains unclear. This paper introduces Minimal Sample Distance, and demonstrates a strong correlation between augmentation-corruption similarity and performance. Authors argue that training with augmentations that are perceptually similar to corruptions enhances test error.
  • Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding

    Recent advancements in vision transformers (ViTs) have shown superior performance across diverse visual tasks, surpassing convolutional neural networks (CNNs). Given ViT's distinct architecture, understanding its behavior and reliability is imperative. This paper investigates ViT's robustness by comparing it with CNNs under various image corruptions relevant to real-world vision tasks. While ViT generally exhibits comparable or improved robustness over CNNs, it consistently underperforms in contrast enhancement tasks. Analysis suggests that positional embedding in ViT's patch embedding may malfunction with color scale changes. We propose PreLayerNorm, a modified patch embedding structure, to address this issue and ensure scale-invariant behavior in ViT. ViT with PreLayerNorm demonstrates enhanced robustness across various corruptions, particularly in contrast-varying environments.
  • Localizing Objects with Self-Supervised Transformers and no Labels

    This paper introdices LOST, a method for unsupervised object localization in images. LOST utilizes activation features from a self-supervised pre-trained vision transformer. Unlike other methods, LOST works on individual images without relying on external object proposals or image collection exploration. It surpasses existing object discovery methods by up to 8 CorLoc points on PASCAL VOC 2012. Training a class-agnostic detector on the objects found by LOST further improves performance by 7 points, demonstrating its effectiveness in unsupervised object discovery.
  • Do Transformer Modifications Transfer Across Implementations and Applications?

    This paper evaluates numerous Transformer architecture modifications in a unified experimental framework, focusing on common natural language processing (NLP) applications. Surprisingly, it finds that most modifications do not significantly enhance performance. The beneficial variants are mostly minor or developed in the same code base used for testing. The study suggests that performance gains may largely depend on implementation details and offers recommendations for improving the generalizability of experimental results.
  • Universal Adversarial Robustness of Texture and Shape-biased Models

    This paper analyzes the adversarial robustness of deep neural networks (DNNs) with texture and shape biases against Universal Adversarial Perturbations (UAPs). Through evaluation, it finds that shape-biased models alone do not significantly enhance adversarial robustness. However, combining texture and shape-biased models into ensembles can increase universal adversarial robustness while retaining high performance.
  • An Empirical Study of Training Self-Supervised Vision Transformers

    This study explores fundamental training components for self-supervised ViT, identifying instability as a key issue that undermines accuracy despite seemingly successful outcomes. By addressing these instabilities, the paper demonstrates improvements in training stability and accuracy. Results and ablations are benchmarked across MoCo v3 and other self-supervised frameworks.
  • The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization

    We present four novel real-world distribution shift datasets encompassing changes in image style, blurriness, location, camera settings, and more. Evaluating existing methods for enhancing out-of-distribution robustness, we discover that employing larger models and artificial data augmentations can enhance robustness against real-world distribution shifts, contradicting prior claims. Our findings demonstrate that improvements in artificial robustness benchmarks can indeed transfer to real-world distribution shifts, contrary to prior assumptions. Additionally, we introduce a novel data augmentation technique that surpasses models pretrained with significantly more labeled data, emphasizing its efficacy in addressing real-world distribution shifts. While some methods consistently mitigate texture and local image statistics shifts, they fail to address other shifts like geographic changes. Our results underscore the necessity for future research to examine multiple distribution shifts concurrently, as no single method consistently improves robustness across all evaluated scenarios.
  • Delving Deep into Label Smoothing

    Label smoothing is a regularization technique that reduces overfitting and enhances classification performance. Label smothing involves creating soft labels via a weighted blend of the uniform distribution and the hard label. This paper introduces an Online Label Smoothing (OLS) strategy that generates soft labels using model prediction statistics for the target category, creating a more accurate probability distribution. This significantly improves classification accuracy and model robustness to noisy labels.
  • Automated Learning Rate Scheduler for Large-batch

    Large-batch training is crucial for deep learning with large datasets and models, but requires specific learning rate (LR) schedules for optimal performance, especially under limited training epochs. This work introduces an automated LR scheduling algorithm for large-batch neural network training within a fixed epoch budget, consisting of adaptive warmup and predefined decay phases. The LR is dynamically adjusted based on training loss monitored through Gaussian process smoothing, facilitating low computational overhead. When integrated with adaptive stochastic optimizers like AdamP and LAMB, this scheduler eliminates the need for extensive hyperparameter tuning and achieves competitive or superior results on various image classification tasks across different batch sizes and architectures.
  • Are Convolutional Neural Networks or Transformers more like human vision?

    This study investigates the performance of Convolutional Neural Networks (CNNs) and attention-based Vision Transformers (ViTs) in computer vision tasks. While CNNs excel in accuracy, ViTs offer a different approach with weaker inductive biases. By analyzing error patterns, we find that ViTs exhibit consistency with human errors, suggesting potential for more human-like vision models and insights into human object recognition.
  • ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Network

    Recent advances in learning algorithms have highlighted the importance of the sharpness of the loss surface as a key indicator of the generalization gap, achieving state-of-the-art results. However, traditional sharpness measures, defined within a fixed region, suffer from sensitivity to parameter scaling, weakening their predictive value for the generalization gap. This paper introduces a scale-invariant measure, adaptive sharpness, along with a new generalization bound. We present a new learning method, adaptive sharpness-aware minimization (ASAM), which leverages this bound. Our experiments across various benchmark datasets demonstrate that ASAM significantly enhances model generalization.
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    This study introduces the Vision Transformer ViT archetecture. The paper shows that a pure transformer applied directly to sequences of image patches can achieve impressive results. When pre-trained on large datasets and then applied to various mid-sized or small image recognition benchmarks (such as ImageNet, CIFAR-100, VTAB), ViT performs comparably or even better than the latest convolutional networks, with significantly less computational cost for training.
  • Dino: Emerging Properties in Self-Supervised Vision Transformers

    This paper shows self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs or convnets. Also proposes a modification to BYOL self-supervised learning with a momentum encoder, multi-crop training, and feature sharping for the teacher network.
  • How to decay your learning rate

    Empirical findings suggest that typical fine-tuned learning rate schedules decay the learning rate following weight norm fluctuations. This led to the development of ABEL, an automatic scheduler that adjusts the learning rate based on weight norm changes. ABEL performs comparably to tuned schedules but demonstrates greater robustness to parameter variations. Extensive experiments across various domains reveal that when the weight norm remains stable, simplified schedules yield equivalent performance to complex ones, resembling a constant learning rate with decay towards the end of training.
  • Rethinking Batch in BatchNorm

    BatchNorm, essential in convolutional neural networks, behaves uniquely due to its batch-based operation, leading to performance issues in visual recognition tasks. This paper identifies these issues and suggests reevaluating the concept of batch in BatchNorm for better performance, aiming to guide researchers in its effective use.
  • Robust and Generalizable Visual Representation Learning via Random Convolutions

    This study introduces random convolutions for data augmentation. Random convolutions approximately preserve shape while distorting local textures. This method significantly enhances performance on domain generalization and robustness / corruption benchmarks, including a substantial improvement in generalizing to sketch domains over current state-of-the-art methods.
  • AdaHessian: An Adaptive Second Order Optimizer for Machine Learning

    We present AdaHessian, a novel second-order stochastic optimization algorithm that dynamically incorporates the curvature of the loss function via adaptive estimates of the Hessian. Despite the superior convergence properties of second-order methods over first-order methods like SGD and Adam, traditional second-order methods suffer from heavier per-iteration computation and poor accuracy. AdaHessian addresses these issues through innovative approaches, including a fast Hutchinson-based method for low computational overhead in approximating the curvature matrix, a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal, and block diagonal averaging to reduce the variance of Hessian diagonal elements. Empirical results demonstrate that AdaHessian significantly outperforms other adaptive optimization methods, including variants of Adam, across various tasks such as computer vision, natural language processing, and recommendation systems. Specifically, AdaHessian achieves higher accuracy in image classification tasks, outperforms AdamW in transformer models, and achieves superior performance in tasks such as GLUE and recommendation systems. Importantly, AdaHessian demonstrates comparable per-iteration cost to first-order methods and robustness to hyperparameters.
  • Does Enhanced Shape Bias Improve Neural Network Robustness to Common Corruptions

    Incorporating diverse image styles into CNN training data reduces texture bias, enhances shape recognition, and improves resilience against common image corruptions. This is typically explained as decreasing model reliance on high frequency texture information. This paper challenges this explaniation through large scale testing with natural images, edge information, and stylization, finding no direct link between shape bias and robustness. The enhanced corruption robustness is instead attributed to style variation in data augmentation, with increased shape bias being an indirect effect.
  • TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation

    The paper introduces TrivialAugment, a parameter-free and surprisingly effective automatic augmentation method that applies a single augmentation to each image. Despite its minimal complexity and cost, TrivialAugment outperforms existing methods across various image classification scenarios, as validated through extensive experiments against state-of-the-art methods and multiple ablation studies involving different augmentation spaces and methods. The work includes a user-friendly interface and fully shared codebase to encourage adoption and reproducibility. Noting a stagnation in automatic augmentation research, the authors conclude by proposing best practices for future advancements in the field.
  • CLIP: Learning Transferable Visual Models From Natural Language Supervision

    This paper introduces a novel approach to computer vision that learns from raw text descriptions of images, moving beyond training on fixed object categories. By pre-training on 400 million (image, text) pairs using caption-image matching, the method achieves state-of-the-art image representations from scratch. This allows for zero-shot transfer to various downstream tasks using natural language, eliminating the need for additional labeled data. The model's performance was evaluated across over 30 diverse datasets, showing competitive results against fully supervised baselines without task-specific training, and matching the accuracy of ResNet-50 on ImageNet zero-shot without using its training examples.
  • Training data-efficient image transformers & distillation through attention

    This paper introduces DeiT, an efficient method for training vision transformers. The authors present a unique teacher-student strategy for transformers, leveraging a distillation token for efficient learning from a convolutional network teacher. This method achieves performance on par with convolutional networks, with up to 85.2% accuracy on ImageNet, and demonstrates effective transferability to other tasks.
  • Shrink and Perturb: On Warm-Starting Neural Network Training

    In machine learning systems where data arrives incrementally, it's common to build a sequence of models that incorporate progressively more data. It has been shown that simply continuing training of a model on new data often results in poorer generalization compared to models initialized randomly, despite similar training losses. This discrepancy persists even when hyperparameter adjustments are made, often at the expense of the time saved through warm starting. This work investigates this phenomenon and introduces shrink and perturb, a simple yet effective method to mitigate the issue, with experiments demonstrating its utility in various contexts.
  • Curriculum Learning by Dynamic Instance Hardness

    This paper improves curriculum learning via introduces Dynamic Instance Hardness (DIH), a method that measures a sample's learning difficulty over time. DIH provides a stable indicator of learning progress by tracking the exponential moving average of a sample's hardness. DIHCL enhances learning efficiency and model accuracy without extra computational costs by leveraging data from the training process itself. Tested on 11 datasets, DIHCL surpasses traditional training methods and recent curriculum learning techniques in both efficiency and effectiveness.
  • Scheduled DropHead: A Regularization Method for Transformer Models

    This paper introduces DropHead, a structured dropout method tailored for multi-head attention in transformers, offering a novel approach by dropping entire attention heads to avoid dominance by a few and reduce overfitting. Authors also propose a dropout rate scheduler to optimize training.
  • An Investigation of how Label Smoothing Affects Generalization

    Label smoothing has been empirically shown to reduce overfitting and improve generalization. However, its mathematical underpinnings remain unclear. This paper explains label smoothing's effectiveness in controlling generalization loss, especially in scenarios with partially incorrect training labels. The authors identify a method for calculating a label smoothing value that minimizes generalization loss.
  • Visual Identification of Individual Holstein-Friesian Cattle via Deep Metric Learning

    The study presents a novel approach for detection identification of Holstein-Friesian cattle. This technique offers a completely hands-off solution for automated detection, localization, and identification of cattle from overhead images in open herd settings, without the need for re-training when new cattle are introduced. The system utilizes convolutional neural networks and deep metric learning, achieving high accuracy with a 93.8% success rate in identifying cattle unseen during training, using only half of the population.
  • Bootstrap Your Own Latent A New Approach to Self-Supervised Learning

    This paper introduces BYOL, a self-supervised learning method that does not need negative pairs. BYOL uses a teacher student pair of networks, where the teacher is the EMA of the student. The teacher and student are given two differently augmented versionos the input, and the loss is the cross entropy difference between the outputs.
  • Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

    This paper introduces tools for training monocular depth estimation models with multiple datasets despite incompatible annotations. Their approach uses a robust training objective, multi-objective learning for data integration, and pretraining encoders on auxiliary tasks. Tested across five datasets, including 3D films as a novel data source, their methods achieve superior zero-shot cross-dataset generalization, outperforming existing benchmarks through principled multi-dataset training.
  • Are Labels Necessary for Neural Architecture Search?

    The paper introduces Unsupervised Neural Architecture Search (UnNAS), questioning the necessity of human-annotated labels for discovering effective neural architectures. Two experimental setups sample-based and search-based were investigated. The sample-based approach evaluated 500 diverse architectures trained with both supervised and unsupervised objectives, revealing a high correlation between the architecture rankings with and without labels. The search-based experiments employed DARTS, a recognized NAS algorithm, with various unsupervised objectives. The architectures identified without labels performed competitively compared to those found with labels. The findings suggested that image statistics alone could be sufficient for identifying efficient neural architectures, potentially eliminating the need for human-annotated labels.
  • Fourier neural networks as function approximators and differential equation solvers

    This paper introduces a Fourier neural network (FNN) that aligns with Fourier decomposition by utilizing specific activation and loss functions to accurately mimic Fourier series expansion within a simple, single-layer architecture. This design allows for easy integration with more complex networks for data processing tasks. The authors demonstrate the FNN's efficacy on both smooth and piecewise continuous periodic functions and its application in modeling or solving partial differential equations with periodic boundary conditions. Key benefits of the FNN include solution validity beyond the training scope, model interpretability, and ease of use.
  • A Simple Framework for Contrastive Learning of Visual Representations

    This paper introduces SimCLR, a method for contrastive learning of visual representations. Authors find that data augmentation composition, a learnable nonlinear transformation to the contrastive loss, and larger batch size with more training steps improve representation quality. A linear classifier using SimCLR's self-supervised representations reaches 76.5% top-1 accuracy, a 7% improvement over the prior best and equal to supervised ResNet-50. With only 1% of labels, it achieves 85.8% top-5 accuracy, outdoing AlexNet with 100 times fewer labels.
  • On Layer Normalization in the Transformer Architecture

    The Transformer is widely used in natural language processing. Training it typically requires a carefully designed learning rate warm-up stage, crucial for performance but slowing optimization and requiring more hyperparameter tuning. We provide theoretical insights into the necessity of the warm-up stage, demonstrating that the location of layer normalization affects gradient behavior. For Post-LN Transformers, with layer normalization between residual blocks, large gradients near the output layer at initialization make training unstable without warm-up. Conversely, for Pre-LN Transformers, with layer normalization inside residual blocks, gradients are well-behaved, suggesting the removal of the warm-up stage. Experimental results show that Pre-LN Transformers without warm-up achieve comparable performance with less training time and hyperparameter tuning.
  • On Layer Normalization in the Transformer Architectur

    This paper investigates the importance of the warm-up stage in training transformers, and the impact of layer normalization placement. Using mean field theory, the authors demonstrate that the original Post-LN Transformer's design results in large initial gradients, necessitating a warm-up stage for stability. Conversely, placing layer normalization inside residual blocks, as seen in Pre-LN Transformers, stabilizes initial gradients, allowing the elimination of the warm-up stage. Experiments reveal that Pre-LN Transformers achieve comparable performance to traditional models with less training time and fewer hyper-parameters across various applications.
  • Implicit Neural Representations with Periodic Activation Functions

    Implicit neural representations, parameterized by continuous, differentiable neural networks, offer numerous advantages over conventional representations. However, existing architectures struggle with fine detail and fail to capture spatial and temporal derivatives, crucial for many physical signals described by partial differential equations. We introduce sinusoidal representation networks (SIRENs) utilizing periodic activation functions, ideal for representing complex signals and their derivatives. Analyzing SIREN activation statistics informs a principled initialization scheme. We demonstrate SIRENs' efficacy in representing various signals and solving boundary value problems, including Eikonal, Poisson, Helmholtz, and wave equations. Additionally, we integrate SIRENs with hypernetworks to learn priors over the SIREN function space. Visit the project website for detailed demonstrations.
  • Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

    We investigate the challenge of training randomly initialized deep neural networks due to spectral instabilities in products of random matrices. Batch normalization emerges as an effective solution to prevent rank collapse in both linear and ReLU networks. Leveraging tools from Markov chain theory, we establish a lower rank bound for deep linear networks. Empirical findings show that this rank robustness extends to ReLU networks. Our experiments on real-world datasets underscore the significance of rank stability in training modern deep neural architectures.
  • When Does Label Smoothing Help?

    Label smoothing, which blends hard targets with a uniform distribution across labels, enhances the generalization and learning pace of multi-class neural networks. This study suggest that label smoothing not only boosts generalization but also enhances model calibration. However, it reduces the effectiveness of knowledge distillation when a teacher network employs label smoothing. Label smoothing encourages tighter clustering of same-class examples in the penultimate layer, impacting the model's ability to capture class resemblances necessary for distillation but not affecting generalization or prediction calibration.
  • Normalized Attention Without Probability Cage

    Softmax-attention architectures, especially popularized by Transformers, have seen significant advancements in various tasks. However, the geometric implications of softmax-attention remain underexplored. In this study, we demonstrate limitations arising from constraining attention weights to the probability simplex and its impact on the convex hull of value vectors. We reveal sequence length-dependent biases in Transformers towards token isolation at initialization and compare them with max- and sum-pooling, which are strong but often overlooked baselines. To address these issues, we propose a novel approach of replacing softmax with normalization in self-attention, resulting in a robust and widely applicable architecture. Our findings are supported by empirical results from over 25,000 trained models, and all results and implementations are publicly available.
  • X3D: Expanding Architectures for Efficient Video Recognitio

    This paper introduces X3D, a set of efficient video networks that expand a small 2D image classification structure across space, time, width, and depth dimensions. By adopting a stepwise expansion method inspired by feature selection in machine learning, X3D optimizes accuracy and complexity by expanding one dimension at a time. It achieves superior performance with significantly fewer operations and parameters than previous models, revealing that high spatiotemporal resolution networks can be both effective and lightweight. X3D delivers competitive results on video classification and detection benchmarks with unparalleled efficiency. The code is available at the provided GitHub link.
  • Designing Network Design Spaces

    This study introduces a novel approach to network design aimed at enhancing the understanding and generalizability of network design principles. The method involves creating spaces for network design that parameterize multiple network populations. A low-dimensional, efficient design space called RegNet is developed, based on modeling network widths and depths through a quantized linear function. The analysis of the RegNet space challenges existing design practices, offering simpler and faster networks effective across various computational budgets. RegNet models surpass EfficientNet models in performance and are up to five times faster on GPUs under similar conditions.
  • When Vision Transformers Outperform Resnets Without Pre-training or Strong Data Augmentations

    This paper explores the potential of Vision Transformers (ViTs) and MLP-Mixers to replace traditional neural architectures, which rely heavily on hand-wired features, by using a general-purpose approach. Despite previous models requiring massive datasets and strong data augmentations, they still faced optimization issues, such as sensitivity to initialization and learning rates. Through examining loss geometry, the study aims to enhance data efficiency and generalization of these models. Findings reveal that the models tend to converge to extremely sharp local minima. The application of a sharpness-aware optimizer significantly boosts the accuracy and robustness of ViTs and MLP-Mixers across a range of tasks, including supervised, adversarial, contrastive, and transfer learning, achieving substantial improvements in accuracy on ImageNet with simple preprocessing techniques. The improved performance is attributed to sparser active neurons in the initial layers, allowing ViTs to surpass the performance of similarly sized ResNets on ImageNet without the need for large-scale pre-training or intensive data augmentations.
  • GLU Variants Improve Transformer

    This paper explores Gated Linear Units (GLUs), which involve the component-wise product of two linear projections, with one undergoing a sigmoid function. The authors investigate GLU variants by substituting the sigmoid with other nonlinear or linear functions within the Transformer model's feed-forward sublayers. They find that certain variants outperform the conventional ReLU or GELU activations in terms of quality.
  • Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    This paper demonstrates that artificial neural networks (ANNs) are equivalent to Gaussian processes at initialization in the infinite-width limit and introduces the Neural Tangent Kernel (NTK), which describes ANNs' behavior during training. The NTK stabilizes to a constant in the infinite-width limit, allowing the study of ANNs in function space. The authors prove the positive-definiteness of the limiting NTK under certain conditions and show that the network function follows a linear differential equation during training for least-squares regression. Numerical studies on the NTK in wide networks confirm these theoretical findings.
  • FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

    This paper introduces FixMatch, a streamlined semi-supervised learning (SSL) algorithm. FixMatch generates pseudo-labels using a model's predictions on weakly-augmented unlabeled images, which are only used if they are highly confident. The model then learns from these pseudo-labels using strongly-augmented versions of the images. FixMatch demonstrates superior performance on several SSL benchmarks, achieving 94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with only 40 labels. An in-depth ablation study highlights the key factors behind FixMatch's effectiveness, and the code is publicly available.
  • On the Relationship Between Self-Attention And Convolutional Layers

    Recent studies have shown that attention mechanisms can rival or even outperform convolutional layers in vision tasks, challenging their dominance. Ramachandran et al. (2019) demonstrated that attention could entirely replace convolution, achieving top results. This work investigates whether attention layers function like convolutional layers, finding that multi-head self-attention layers, with enough heads, match the expressiveness of convolutional layers. Numerical experiments confirm that self-attention layers learn to focus on pixel-grid patterns akin to convolutional layers, supporting our findings.
  • Transformers without Tears: Improving the Normalization of Self-Attention

    The paper proposes three normalization-centric modifications to improve Transformer training: PRENORM which introduces pre-norm residual connections and smaller initializations, enabling warmup-free, validation-based training with large learning rates; SCALENORM which suggests L2 normalization with a single scale parameter for faster training and better performance; and FIXNORM which reaffirms the efficacy of normalizing word embeddings to a fixed length.
  • Understanding and Improving Layer Normalization

    Layer normalization (LayerNorm) enhances gradient smoothness, accelerates training, and improves generalization accuracy. While its effectiveness has been attributed to forward normalization in previous studies, our research reveals that re-centering and re-scaling backward gradients through derivatives of mean and variance play a crucial role. Moreover, we find that parameters like bias and gain in LayerNorm exacerbate overfitting and are often ineffective. Experiments demonstrate that a simplified version of LayerNorm (LayerNorm-simple) without bias and gain outperforms traditional LayerNorm on multiple datasets, achieving state-of-the-art results in En-Vi machine translation. To mitigate overfitting, we introduce Adaptive Normalization (AdaNorm), which replaces bias and gain with a new transformation function. Experimental results indicate that AdaNorm outperforms LayerNorm on the majority of datasets, suggesting its efficacy in addressing overfitting concerns.
  • Online Normalization for Training Neural Networks

    This paper introduces Online Normalization, a novel method for normalizing neural network hidden activations that offers a batch-independent alternative with comparable accuracy to Batch Normalization. Online Normalization addresses Batch Normalization's theoretical flaw by employing an unbiased method for gradient normalization of activations, integrating seamlessly with automatic differentiation. The method is applicable to recurrent, fully connected networks, and those with high activation memory requirements. The authors demonstrate its effectiveness in image classification, segmentation, and language modeling, supported by proofs and experimental data from ImageNet, CIFAR, and PTB datasets.
  • RandAugment: Practical automated data augmentation with a reduced search space

    Data augmentation improves generalization, and robustness against image corruptions. However, the use of complex augmentation pipelines typically involves tuning of an enormous number of hyper parameters. RandAugment addresses this by reducing the search space. RandAugment's interpretable hyperparameter also facilitates exploration of data augmentation's impact across different models and datasets.
  • SlowFast Networks for Video Recognition

    This paper introduces SlowFast networks, a new convolutional architecture for video recognition tasks. These networks have two pathways: a Slow pathway operating at low frame rates to capture spatial semantics, and a Fast pathway working at high frame rates with reduced channel capacity to efficiently capture motion details. The SlowFast models achieve top accuracy on major video benchmarks like Kinetics, Charades, and AVA, significantly improving performance for action classification and detection tasks.
  • Root Mean Square Layer Normalization

    Layer normalization (LayerNorm) enhances deep neural network stability and convergence by re-centering and re-scaling inputs and weight matrices. However, its computational overhead slows networks, particularly RNNs. We introduce RMSNorm, which replaces re-centering with root mean square (RMS) regularization. RMSNorm maintains re-scaling invariance and adapts learning rates implicitly, while being computationally simpler than LayerNorm. We also propose partial RMSNorm (pRMSNorm), estimating RMS from a subset of inputs. Empirical results across various tasks and architectures demonstrate that RMSNorm achieves comparable performance to LayerNorm while reducing running time by 7%∼64%.
  • Weight Agnostic Neural Networks

    This study investigates the importance of neural network architectures versus weight parameters for task performance. We introduce a method to find architectures capable of performing tasks without weight training. By assigning random weights, we show that minimal architectures can achieve notable performance on various tasks, including reinforcement learning and MNIST classification.
  • DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

    This paper introduces DropAttention, a novel dropout method for fully-connected self-attention layers in Transformers, aiming to prevent overfitting by regularizing attention weights. DropAttention uses a mask to zero out elements in the attention matrix. Experiments across various tasks demonstrate that DropAttention not only enhances performance but also mitigates overfitting, providing a significant advancement in the regularization of Transformers.
  • Benchmarking Neural Network Robustness to Common Corruptions and Peturbations

    This paper introduces benchmarks for evaluating the robustness of image classifiers, focusing on common corruptions and perturbations rather than adversarial ones. The benchmarks include IMAGENET-C, which assesses corruption robustness and IMAGENET-P, which assesses perturbation robustness.
  • How Does Batch Normalization Help Optimization?

    The Batch Normalization paper argued the method's effectiveness came from reducing internal covariate shift. In this paper the authors argue that actually the main benifit of Batch Norm is loss landscape smoothing.
  • The Bitter Lesson

    Rich Sutton's bitter lesson. Methods and techniques that leverage computation over human knowledge are, in the long term, always proven to be better. This is primarly due to Moore's law. Over the 3.5 years of a PhD computer power will more than double, making methods that were unfeasable at the start of the program entirely feasible by the end of it.
  • ADAMW: Decoupled Weight Decay Regularization

    This paper shows that L2 regularization and weight decay are equivalent for standard SGD but not for adaptive methods like ADAM. To account of this, these authors separate weight decay from loss optimization steps. This change allows for independent optimization of the weight decay factor and learning rate.
  • Visualizing the Loss Landscape of Neural Nets

    This paper delves into the complexities of neural network training, focusing on the quest for effective minimizers of non-convex loss functions. It investigates the impact of network architecture and training parameters on the loss landscape and generalization capabilities. Introducing a filter normalization technique for visualizing loss function curvature, the study explores the influence of architecture and parameters on the shape of minimizers.
  • Group Normalization

    Batch Normalization (BN) helps train various deep learning networks but struggles with small batch sizes due to inaccurate statistics estimation, limiting its use in memory-constrained tasks. This paper introduces Group Normalization (GN) as an alternative that divides channels into groups for normalization, independent of batch sizes, offering stable accuracy across various batch sizes. GN shows lower error rates compared to BN in small batches and comparable performance in typical batch sizes.
  • AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Action

    This paper introduces the AVA dataset, which features 1.58M labels for 80 atomic visual actions across 430 15-minute video clips, with precise spatio-temporal and person-specific annotations. Unlike previous datasets, AVA emphasizes atomic actions, detailed annotations throughout longer videos, continuity of persons across clips, and varied action representations from movies. The authors highlight the challenges in action recognition and introduce a novel localization approach that surpasses existing benchmarks but shows modest performance on AVA (15.6% mAP), indicating the need for advanced video understanding methods.
  • Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomizatio

    The paper introduces a system that trains deep neural networks for object detection using only synthetic images generated with domain randomization. This involves randomizing non-realistic simulator parameters like lighting and object textures, allowing the network to identify essential object features. Their findings show that networks can achieve impressive performance using just synthetic data, and further improve with fine-tuning on real data. This suggests the potential of using cost-effective synthetic data for training, avoiding the need to acquire vast amounts of real-world data or create detailed synthetic environments. They validate their method on car bounding box detection using the KITTI dataset.
  • Quo Vadis, Action Recognition A New Model and the Kinetics Datase

    This paper introduces the Kinetics Human Action Video dataset. The authors note that the limited number of videos in existing datasets like UCF-101 and HMDB-51 hinders the ability to effectively evaluate architectures due to similar performance across these small benchmarks. Kinetics contains 400 classes with over 400 video clips per class from challenging YouTube videos. The paper analyzes the impact of this larger dataset on the performance of existing architectures as well as the benefits of pre-training models on Kinetics. The authors introduce the Two-Stream Inflated 3D ConvNet (I3D), an extension of 2D ConvNets to 3D for improved video feature extraction. When pre-trained on Kinetics, this I3D model sets new benchmarks for action classification, achieving 80.9% accuracy on HMDB-51 and 98.0% on UCF-101.
  • Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

    This study examines why deep learning models, despite having more parameters than training samples, generalize well. It highlights the importance of the loss function's landscape, showing that areas leading to better generalization (good minima) are more prevalent than those leading to poor outcomes. This predominance facilitates the convergence of optimization methods to these beneficial minima. The research provides theoretical backing by analyzing 2-layer neural networks, noting that solutions with better generalization have a smaller Hessian matrix norm. For deeper networks, extensive numerical data supports these conclusions.
  • Instance Normalization: The Missing Ingredient for Fast Stylization

    This paper itroduces Instance Normalization, an online normalization technique that calculates the mean and std across individual data instances. This is shown to work well in situations where there is very high vatiation within batches, such as in style transfer.
  • Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

    This study introduces two new activation functions, SiLU and dSiLU. It challenges the need for experience replay and separate target networks in deep reinforcement learning. By using on-policy learning with eligibility traces and softmax action selection, it achieves state-of-the-art results in stochastic SZ-Tetris and small-board Tetris with TD(λ) learning and shallow dSiLU agents. It also outperforms DQN in Atari 2600 with a deep Sarsa(λ) agent using SiLU/dSiLU, suggesting an alternative to traditional DQN approaches.
  • Don't Decay the Learning Rate, Increase the Batch Size

    This study demonstrates that increasing the batch size during training achieves similar learning outcomes as the common practice of decaying the learning rate, applicable to stochastic gradient descent and its variants, including with momentum and Adam optimization. This approach not only matches test accuracies within the same number of epochs but also enhances parallelism and reduces training time due to fewer parameter updates. Efficiency can be further improved by adjusting the learning rate and batch size proportionally, and although increasing the momentum coefficient and scaling the batch size accordingly may slightly lower test accuracy, it enables the use of large batch training without needing to tune hyper-parameters. Using these methods, ResNet-50 was trained on ImageNet to a 76.1% validation accuracy in less than 30 minutes.
  • Searching for Activation Functions

    This study explores the impact of activation functions on deep network training and performance. While Rectified Linear Unit (ReLU) is widely used, alternatives have not consistently outperformed it. We propose using automatic search techniques to discover new activation functions. Through exhaustive and reinforcement learning-based searches, we identify novel functions. Empirical evaluation shows that our best discovered function, Swish (f(x) = x · sigmoid(βx)), performs better than ReLU on deeper models across challenging datasets. Replacing ReLUs with Swish units improves classification accuracy on ImageNet, for example, by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. Swish's simplicity and similarity to ReLU facilitate its adoption in neural networks.
  • Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

    In traditional multi-layer neural networks, the dot product between the output and weight vectors of preceding layers serves as input to the activation function, resulting in unbounded outputs and increased variance. This variance can lead to poor generalization and hinder training by exacerbating internal covariate shift. To address this, we propose cosine normalization, which replaces the dot product with cosine similarity or centered cosine similarity (Pearson Correlation Coefficient). We evaluate cosine normalization against batch, weight, and layer normalization in fully-connected and convolutional neural networks across various datasets. Our experiments demonstrate that cosine normalization outperforms other normalization techniques.
  • Large Batch Training of Convolutional Networks

    To expedite the training of large convolutional networks, computational units are added and trained with data-parallel synchronous Stochastic Gradient Descent (SGD) across units, increasing batch size with node count. However, larger batch sizes can reduce model accuracy. The existing method of large batch training—linear learning rate scaling with warm-up—is not universally effective and may cause training divergence. To address these challenges, we introduce a novel training algorithm, Layer-wise Adaptive Rate Scaling (LARS), enabling us to train Alexnet with a batch size of 8K and Resnet-50 with a batch size of 32K, without compromising accuracy.
  • Language Modeling with Gated Convolutional Networks

    This paper introduces a novel language modeling approach using stacked convolutions that enable efficient parallel processing, in contrast to the traditionally used recurrent neural networks known for handling unbounded context. The proposed method features a simplified gating mechanism that outperforms previous models and demonstrates superior performance on the WikiText-103 benchmark, showing its capability to manage long-term dependencies. It also delivers competitive results on the Google Billion Words benchmark and significantly reduces sentence scoring latency compared to recurrent models. This marks the first instance of a non-recurrent model achieving comparable success to strong recurrent models on major language tasks.
  • A downsampled variant of imagenet as an alternative to the cifar datasets

    This paper proposes downsampled versions of ImageNet: ImageNet64x64, ImageNet32x32, and ImageNet16x16, which maintain the same number of classes and images but with reduced resolution. This approach significantly speeds up experiments while preserving similar optimal hyperparameters characteristics.
  • Colorization as a Proxy Task for Visual Understanding

    This study explores self-supervision through automatic colorization as an alternative to ImageNet pretraining. Self-supervised training achieved state-of-the-art results on VOC segmentation and classification tasks without relying on ImageNet labels. This paper highlights the significance of loss formulation, training specifics, and network architecture when pretraining through colorization. It also revisits and questions the ImageNet pretraining approach, including the necessity of training data volume, label quantity, and feature adaptability upon fine-tuning. The findings suggest that colorization offers a better supervisory signal comparable to various types of ImageNet pretraining.
  • Deformable Convolutional Networks

    This work introduces deformable convolution and deformable RoI pooling modules to improve the geometric transformation capability of CNNs by augmenting spatial sampling with learned offsets, without extra supervision. These modules can replace standard ones in CNNs for end-to-end training. The approach, validated by extensive experiments, effectively learns dense spatial transformations for complex vision tasks like object detection and semantic segmentation.
  • SGDR: Stochastic Gradient Descent With Warm Restarts

    This paper introduces a warm restart technique for stochastic gradient descent aimed at enhancing anytime performance in deep neural network training. It showcases empirical performance improvements on CIFAR-10 and CIFAR-100 datasets, achieving state-of-the-art results with 3.14% and 16.21% error rates, respectively. Additionally, the technique's benefits are demonstrated on an EEG dataset and a downsampled ImageNet dataset.
  • Understanding Deep Learning Requires Re- Thinking Generalization

    Extensive experiments reveal that traditional explanations for the strong generalization performance of large neural networks fall short. Despite conventional wisdom attributing success to model properties or regularization techniques, our findings show that state-of-the-art convolutional networks for image classification can easily overfit random labels. This phenomenon persists regardless of explicit regularization or substituting true images with random noise. Theoretical analysis suggests that shallow neural networks achieve perfect finite sample expressivity when parameters outnumber data points, as commonly observed. We interpret these results in contrast to traditional models.
  • ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

    ADAM is an optimization algorithm that imrpoves stochastic gradient decent with first and second order momentum terms. ADAM is easy to implement, requires minimal memory, and adapts well to non-stationary objectives and noisy or sparse gradients. Adam's hyper-parameters are intuitive and usually need little adjustment. Additionally, the variant AdaMax, based on the infinity norm, is introduced.
  • Taming the Waves: Sine as Activation Function in Deep Neural Networks

    This paper explores the challenges in training deep neural networks that use sinusoidal activation functions. The authors explain that the difficulty arises from the emergence of numerous shallow local minima in the loss landscape. The study reveals that successful learning in typical classification tasks occurs when the network effectively ignores the periodic cycles of the sinusoidal functions. However, for certain non-trivial tasks, networks with sinusoidal activations can outperform those using traditional monotonic activation functions.
  • Colorful Image Colorization

    This paper presents a novel approach to automatically colorize grayscale photos with vibrant and realistic results. The method frames colorization as a classification task and employs class rebalancing during training to enhance color diversity. A convolutional neural network (CNN) was trained on over a million color images, operating as a feed-forward pass at test time. The algorithm's effectiveness was validated through a colorization Turing test, where it deceived participants in 32% of trials. The paper also illustrates how colorization can serve as an effective pretraining task for self-supervised feature learning, achieving state-of-the-art results on several benchmarks.
  • Layer Normalization

    Training deep neural networks is computationally intensive. Normalizing neuron activitivations can speed up training, with batch normalization being a popular method that uses mini-batch data to normalize neuron inputs, reducing training time for feed-forward networks. However, its effectiveness varies with mini-batch size and adapting it to recurrent neural networks (RNNs) is challenging. This paper introduces layer normalization as an alternative, normalizing inputs across a single training case's entire layer and maintaining consistent computations across training and testing phases.
  • Deep neural networks are robust to weight binarization and other non-linear distortions

    Recent studies reveal that deep neural networks maintain high performance levels even when trained with binary quantized weights. This paper shows deep networks have significant resilience to various test-time distortions, including noise and non-linear projections, with robustness extending beyond binary quantization. The authors propose a stochastic projection rule that sets a new benchmark CIFAR-10 without data augmentation.
  • Dropout as Data Augmentation

    Dropout is typically interpreted as turning the test time network into an ensemble of the thinner training networks. This paper argues that dropout can also be interpreted as a kind of data augmentation in image space. Authors present an approach to project the dropout noise within a network back into the input space, visualising the augmented versions of the training data, and show that training a deterministic network on the augmented samples yields similar results. Authors then propose a new dropout noise scheme and show that it improves dropout results without adding significant computational cos
  • Deep Residual Learning for Image Recognition

    We introduce a residual learning framework to train deeper neural networks effectively. By reformulating layers to learn residual functions relative to inputs, our approach facilitates optimization and achieves higher accuracy with increased depth. Evaluations on ImageNet demonstrate the effectiveness of residual networks up to 152 layers deep, outperforming previous architectures. Our method achieves 3.57% error on the ImageNet test set, winning 1st place in the ILSVRC 2015 classification task. Additionally, our deep representations lead to a 28% improvement on the COCO object detection dataset. In the ILSVRC & COCO 2015 competitions, our approach secured 1st place in ImageNet detection, localization, COCO detection, and segmentation tasks.
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    This paper introduces Batch Normalization, an online normalization technique which calculates mean and std across the batch dimension. Batch Norm improves training efficiency, allowing for higher learning rates, and decreasing hyper-parameter sensitivity, and sometimes removing the need for Dropout regularization. This paper suggests Batch Norm improves internal covariant shift, however later works have questioned this.
  • Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classificatio

    This paper intriduces the Kaiming He initialisation strategy for RELU like activation functions. This the default init in pytorch, improving model stability and decreasing training time for RELU networks. This study also introduces a Parametric Rectified Linear Unit (PReLU) activation function that extends traditional units, offering improved model fitting with negligible additional computational cost.
  • Self-Paced Curriculum Learning

    This paper introduces self-paced curriculum learning (SPCL), a unified framework combining curriculum learning (CL) and self-paced learning (SPL). SPCL leverages prior knowledge and ongoing learning progress through an optimization problem. It mimics collaborative instructor-student learning, exhibiting empirical advantages in two tasks.
  • Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    This study aims to enhance our comprehension of deep learning dynamics by analyzing deep linear neural networks. Despite their linear input-output mapping, these networks display nonlinear gradient descent dynamics, resulting in phenomena like prolonged plateaus and swift transitions to improved solutions. Through analytical exploration, we reveal that as network depth increases indefinitely, learning speed can remain finite under specific initial conditions. Unsupervised pretraining, under certain data conditions, can discover these favorable initial conditions, unlike random Gaussian initializations. We introduce a novel class of random orthogonal initial conditions that, akin to unsupervised pretraining, enable depth-independent learning. These conditions also ensure gradient propagation in deep nonlinear networks, particularly when operating at the edge of chaos.
  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting

    This paper introduces Dropout, a technique that addresses overfitting by randomly omitting units during training, leading to an exponential number of thinned networks. This effectively turns the test time network into an ensemble.
  • SVHN: Reading Digits in Natural Images With Unsupervised Feature Learning

    While character recognition in documents is largely solved, recognizing characters in complex scenes, such as photographs, is much harder. This paper tackles digit recognition from street level photos, introducing a new dataset of over 600,000 labeled digits from Street View House Numbers.
  • Deep learning via Hessian-free optimization

    We present a novel 2nd-order optimization method inspired by the Hessian-free approach, applied to deep auto-encoders. Achieving superior results without pre-training compared to Hinton & Salakhutdinov (2006), our method is practical, user-friendly, scalable to large datasets, and versatile across various model classes. Additionally, we address the challenge of pathological curvature in deep learning, highlighting how our method effectively mitigates this issue.
  • Understanding the difficulty of training deep feedforward neural networks

    This paper introduces Xavier Glorot init for Sigmoid like activation functions. The study investigates the impact of non-linear activation functions, finding the logistic sigmoid activation unsuitable for deep networks due to saturation issues in top hidden layers. It discovers that saturated units can desaturate over time, which explains training plateaus. A new non-linearity that saturates less is suggested as beneficial.
  • Simplifying Neural Nets by Discovering Flat Minima

    This paper introduces an algorithm that identifies simple and highly generalizable neural networks by searching for extensive flat minima regions in the error function, where the error rate is relatively stable. Such flat minima are associated with lower overfitting risks based on minimum description length (MDL) principles. Despite requiring second-order derivative calculations, the algorithm has a complexity level comparable to backpropagation. When tested on feedforward and recurrent networks, as well as stock market prediction tasks, this algorithm outperformed traditional backpropagation, weight decay, and the optimal brain surgeon methods.
  • Universal and Transferable Adversarial Attacks on Aligned Language Models

    This paper introduces a novel and effective method for prompting aligned large language models (LLMs) to generate objectionable content by attaching a specific suffixes to various queries. The approach is shown to be highly transferable, even to black-box, publicly released LLMs like ChatGPT, Bard, Claude, as well as open-source models such as LLaMA-2-Chat, Pythia, Falcon, etc., particularly demonstrating a higher success rate with GPT-based models. This advancement in adversarial attacks against LLMs highlights critical security concerns, urging the need for robust defenses against the generation of objectionable content. The research, along with the code, is shared for further exploration and mitigation efforts.
  • Implicit Neural Representations with Periodic Activation Functions

    This paper introduces sinusoidal representation networks (SIRENs), which utilize periodic activation functions like Sin to effectively capture complex natural signals and their derivatives, addressing the limitations of neural networks parameterized for continuous, differentiable signal representations. The authors' analysis leads to a principled initialization strategy, enabling the representation of images, wavefields, video, sound, and derivatives. SIRENs are also applied in solving boundary value problems like Eikonal equations, the Poisson equation, and the Helmholtz and wave equations. The authors extend SIRENs' use with hypernetworks to learn priors for SIREN functions.
  • Imagenet-trained Cnns Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness

    This study challenges the traditional understanding of how Convolutional Neural Networks (CNNs) recognize objects, revealing that they are more biased towards recognizing textures rather than learning complex shapes. The authors found that CNNs trained on ImageNet favor texture over shape, differing significantly from human visual processing. However, training a ResNet-50 on a stylized version of ImageNet, designed to emphasize shape, aligned the network's performance more closely with human behavior. This shape-based training matched human performance in controlled experiments and enhanced object detection and robustness against image distortions, highlighting the benefits of shape-based representations in visual recognition systems.
  • Gaussian Error Linear Units (Gelus)

    This paper introduces the Gaussian Error Linear Unit (GELU), a neural network activation function that outperforms existing functions by weighting inputs by their magnitude using the standard Gaussian cumulative distribution function, unlike ReLU which gates inputs by sign. The authors' empirical evaluation across computer vision, natural language processing, and speech tasks demonstrates that GELU offers performance improvements over ReLU and ELU activations.
  • On Separate Normalization in Self-supervised Transformers

    This paper introduces a novel normalization technique for self-supervised transformer training that normalizes the [CLS] token and regular tokens separately. The authors find that using separate normalization for [CLS] embeddings results in more effective global context encoding and a more uniform distribution in anisotropic space, leading to a 2.7% average performance boost in image, natural language, and graph learning tasks compared to previous models like masked autoencoders (MAE) that use a single normalization layer for all tokens.