AIM consists of a ResNet34 image encoder with an autoregressive GRU-based waypoint prediction network. This is equivalent to adapting CILRS to predict waypoints conditioned on goal locations rather than predicting vehicle controls conditioned on navigational commmands.
[CILRS](https://arxiv.org/pdf/1904.08980.pdf) is a conditional imitation learning method in which the agent learns to predict vehicle controls from RGB image and measured speed while being conditioned on the navigational command. In addition, the output of the image encoder is also used for predicted the vehicle speed.