Expectation learning is a unsupervised learning process which uses multisensory bindings to enhance unisensory perception. For instance, as humans, we learn to associate a barking sound with the visual appearance of a dog, and we continuously fine-tune this association over time, as we learn, e.g., to associate high-pitched barking with small dogs. In this work, we address the problem of developing a computational model that addresses important properties of expectation learning, in particular focusing on the lack of explicit external supervision other than temporal co-occurrence. To this end, we present a novel hybrid neural model based on audio-visual autoencoders and a recurrent self-organizing network for multisensory bindings that facilitate stimulus reconstructions across different sensory modalities. We refer to this mechanism as stimulus prediction across modalities and demonstrate that the proposed model is capable of learning concept bindings by evaluating it on unisensory classification tasks for audio-visual stimuli using the 43,500 Youtube videos from the animal subset of the AudioSet corpus.