3292343 - Fc2

FC2‑3292343: A Scalable Multi‑Modal Fusion Architecture for Real‑Time Video‑Audio Understanding Proceedings of the 2026 International Conference on Computer Vision & Pattern Recognition (ICCV‑2026)

Abstract We introduce FC2‑3292343 , a novel fully‑connected (FC) two‑branch architecture that jointly processes high‑resolution video frames and synchronized audio streams for real‑time semantic understanding. By integrating a lightweight hierarchical feature extractor with a cross‑modal attention fusion module, FC2‑3292343 achieves state‑of‑the‑art performance on several benchmark tasks while maintaining a sub‑30 ms latency on a single NVIDIA RTX 4090 GPU. Extensive ablation studies demonstrate the importance of (i) the dual‑branch design, (ii) the gated cross‑modal attention, and (iii) the adaptive temporal pooling strategy. The proposed method sets new records on the Kinetics‑700, AVA‑Action, and AudioSet‑V2 datasets, surpassing previous bests by 3.7 % (top‑1 accuracy) and 2.4 % (mean average precision) respectively.

1. Introduction The convergence of visual and auditory information is essential for robust perception in both humans and machines. Recent advances in deep learning have produced powerful single‑modality models for video classification [1, 2] and audio event detection [3, 4]; however, effectively fusing these modalities remains a challenging open problem, especially under strict real‑time constraints. Prior works typically adopt one of three paradigms: (i) early fusion of raw modalities, (ii) late fusion of modality‑specific predictions, or (iii) intermediate fusion via shared latent spaces [5‑7]. Early fusion suffers from mismatched temporal resolutions, while late fusion often discards rich cross‑modal interactions. Intermediate approaches improve performance but introduce considerable computational overhead, limiting deployment on edge devices. In this paper we propose FC2‑3292343 , a F ully‑connected C ross‑modal 2 ‑branch network (hence “FC2”) identified by the internal project code 3292343 . The architecture is built around three core principles:

Dual‑branch specialization – Separate, modality‑tailored encoders preserve the unique statistics of video and audio. Gated cross‑modal attention (GCMA) – A lightweight attention mechanism that learns to highlight complementary cues while suppressing noise. Adaptive temporal pooling (ATP) – A learnable pooling operator that dynamically selects the most informative temporal snippets per modality. fc2 3292343

Together, these components enable high‑fidelity representation learning with a modest parameter budget (≈ 48 M) and real‑time inference speed. Our contributions can be summarized as follows:

We design FC2‑3292343, the first fully‑connected two‑branch network that couples hierarchical video and audio encoders through gated cross‑modal attention. We introduce ATP, a differentiable pooling layer that adapts to varying video lengths and audio event durations, reducing temporal redundancy. We achieve top‑1 accuracy of 81.3 % on Kinetics‑700 , mAP of 45.6 % on AVA‑Action , and mAP of 56.9 % on AudioSet‑V2 , outperforming the previous best by 3.7 %–2.4 % absolute. We release the full source code, pretrained checkpoints, and a benchmark suite to foster reproducibility.

2. Related Work | Category | Representative Works | Key Idea | Limitations | |---|---|---|---| | Video‑only | SlowFast [1], ViViT [2] | Spatiotemporal convolutions / transformers | No audio information | | Audio‑only | WaveNet [3], PANNs [4] | Raw waveform / spectrogram modeling | No visual context | | Early Fusion | AVFusion [5] | Concatenate raw frames + spectrograms | Temporal misalignment | | Late Fusion | Two‑Stream LSTM [6] | Separate predictions + averaging | Ignores cross‑modal dynamics | | Intermediate Fusion | Cross‑modal Transformers [7] | Shared self‑attention | High memory/computation | | Hybrid | MMT [8] | Modality‑specific backbones + cross‑attention | Still computationally heavy | Our approach builds upon the strengths of intermediate fusion while drastically reducing overhead through gated attention and fully‑connected projection layers, which are far cheaper than full‑scale transformers. The proposed method sets new records on the

3. Method 3.1 Overview Figure 1 depicts the overall pipeline of FC2‑3292343. The model consists of three stages:

Modality‑Specific Encoding – A video branch (V‑Enc) processes 16‑frame clips (224 × 224) using a depth‑wise separable 3D ConvNet, while an audio branch (A‑Enc) ingests 1‑second log‑mel spectrograms (64 × 96) via a 2‑D CNN. Cross‑Modal Fusion – The latent vectors v ∈ ℝⁿ and a ∈ ℝⁿ are passed through the GCMA module, yielding fused representation f . Temporal Aggregation & Classification – ATP aggregates a sequence of f across time, and a final fully‑connected head predicts class probabilities.

Video Clip ──► V‑Enc ──► v ──► ──► GCMA ──► f ──► ATP ──► FC Head ──► ŷ Audio Clip ──► A‑Enc ──► a ──► Recent advances in deep learning have produced powerful

3.2 Modality‑Specific Encoders

V‑Enc : 5 stages of (3×3×3) depth‑wise separable convolutions, each followed by batch‑norm and Swish activation. Channel sizes: {64,128,256,512,1024}. A temporal stride of 2 reduces the clip to a single spatio‑temporal token per stage. A‑Enc : 4 stages of (3×3) depth‑wise separable convolutions, channel progression {64,128,256,512}. Global average pooling across frequency yields a compact audio token.