The method to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). The embedding of YouTube videos to construct a reward function that encourages an agent to imitate human gameplay
Continue ReadingSide-notes to share