AI model looks for missing pieces to puzzle
Nathan Jacobs’ lab builds multimodal embedding model
Artificial intelligence models are designed to quickly solve problems and answer questions. A team of computer scientists at Washington University in St. Louis has developed a model to help identify plant and animal species in nature.
Srikumar Sastry, a doctoral student in the lab of Nathan Jacobs, professor of computer science & engineering, and collaborators developed ProM3E, a model that accepts any combination of inputs — a photograph, an audio recording, a satellite image, geographic location and more — and uses whatever it's given to help identify the species observed. This “any-to-any” model learns to infer missing information from context, combining available inputs into a shared embedding space.
Sastry will present the research at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 in June.
The ProM3E model builds on the team’s Taxabind model published in 2025 that combined six modalities into one cohesive framework to address a diverse range of ecological tasks. In this model, they used six modalities: ground-level images of species, satellite images, geographic location, species audio, taxonomic text and environmental covariates.
Sastry said the ProM3E model was used with data from citizen science observations from iNaturalist and eBird, which has millions of pictures of plants, animals and birds uploaded by users that includes metadata, time stamps, geographic and other information. In addition, they used data from satellite providers that provide free and open-source imagery.
“Previous works don’t consider arbitrary combinations of modalities — you might have a photo and a location, or audio and a satellite image — but ProM3E works with whatever you give it,” Sastry said. “Our model is trained in a self-supervised manner to extract representations and learn the embedding space.”
Sastry said the model is built on a deceptively simple idea: infer what's missing and quantify how confident that inference should be.
“We train the model to predict what a missing input might look like — and not just one answer, but a distribution of plausible answers," Sastry said. "If I give it satellite imagery, it doesn't just guess what the audio sounds like; it tells you how confident it is in that guess.”
The model is designed to generate insights about habitat and climate conditions of different geographic locations worldwide. Sastry said the model could be adapted to address remote sensing and ecological challenges, such as fine-tuning it on additional datasets to adapt it to future uses.
Sastry S, Khanal S, Dhakal A, Lin J, Cher D, Jarosz P, Jacobs N. Pro M3E: Probabilistic Masked MultiModal Embedding Model for Ecology. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026, June 3-7, 2026.
Support for this research was provided by the National Science Foundation (OAC-2232860) and the Taylor Geospatial Institute.