Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring 50,000 times fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.
The CSA framework consists of two main components: unimodal encoders and a similarity analysis module. The unimodal encoders independently process different modalities, such as images and text, to extract feature representations. These features are then mapped into a shared multimodal space using the similarity analysis module (canonical similarity analysis). It then employs a similarity score, just like how CLIP does, retain only the relevant multimodal information among the modalities.
CSA outperforms CLIP in efficiency by using pre-trained unimodal encoders to map features with exceptional data efficiency—50,000 times more efficient than CLIP. We showcase CSA's effectiveness in tasks like zero-shot classification and misinformation detection, achieving performance on par with CLIP.
CSA is not limited to image and text modalities. It can be extended to other modality pairs such as lidar and text, audio, and time series, and more. By leveraging pre-trained unimodal encoders for these modalities, CSA can efficiently map features into a shared multimodal space, enabling robust cross-modal understanding and retrieval. Check out our paper for more details.
@inproceedings{li2025csa,
title={{CSA}: Data-efficient Mapping of Unimodal Features to Multimodal Features},
author={Po-han Li and Sandeep P. Chinchali and Ufuk Topcu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=6Mg7pjG7Sw}
}