Hey RL Reading Group,
We’re back to Substack! This semester has been a bit rough for the reading group since it just started and we don’t have many presenters, so forgive us for our inconsistency. The dates in the header are wonky just because we haven’t made this a weekly thing yet.
Anyway, thanks for showing up to our previous meeting on November 4th! Short recap below:
Recap
During our previous meeting we discussed Reinforcement Learning as One Big Sequence Modeling Problem (which has since been accepted to NeurIPS with an updated title…). The paper is interesting from a practical perspective because it essentially is just throwing a transformer at an RL problem and making it generate feasible trajectories. At scale, a robust iteration of this work could see significant adoption, especially coupled with the already massive efforts in deploying transformers in other domains.
The paper itself was pretty short, so we were able to discuss it in its entirety. The key motivation of the paper was to replace the standard RL pipeline (such as actor-critic methods, model-based RL, and offline RL) with a single transformer (which they call the “Trajectory Transformer”) that learns a distribution over input trajectories. The distribution could then be applied to imitation learning, goal-conditioned RL, and offline RL tasks.
You can view the recording here (starts around the 10 minute mark, ran into some technical difficulties; UMich only).
Questions
Some questions came up during our meeting:
What is goal-conditioned RL?
What kinds of problems does the Trajectory Transformer work on?
Who else is working on this?
(1) Goal-conditioned RL considers a specific goal state and corrects its generated trajectories toward that goal state. In practice, we include the goal state as a variable that the learned trajectory distribution conditions upon. Goal-conditioned RL looks to be practical; for example, we can imagine that during inference, a successful goal-conditioned RL algorithm will correctly execute instructions given by a human who specifies a goal they want the robot to achieve.
(2) We discussed their results, which looked to evaluate error accumulation (i.e. how much deviance do algorithms that use the trajectory transformer’s trajectory distribution exhibit?) and whether the algorithm is viable for control. They were able to demonstrate strong performance on environments from the D4RL offline benchmark suite, which includes locomotion problems and maze-solving.
(3) The most notable example is the Decision Transformer from Lili Chen et al. The method is very similar, with some slight deviations in their input representation and their evaluation, and most notably a different method for producing control outputs (they use a linear layer for action prediction, whereas the Janner et al. paper uses beam search). I know that some teams within industry are working on large-scale versions of this as we speak, so we’ll probably see a lot more of this work and its adaptations to new domains in 2022.
Coming Up
This week we did not have a meeting, because no one reached out to present and I (Nikhil) do not have the bandwidth to prepare new material for every meeting.
Our next meeting will be the last of the semester, on December 3rd, 2021. Please sign up to facilitate — it’s a great way to force yourself to read papers related to your work and/or interests and learn about RL! Also feel free to reach out to me if you’re looking for topics to present on :)
Miscellany
We’re looking for any recommendations regarding the reading group that you might have — reach out to me (Nikhil) if you have any :)
Also, make sure to add our calendar if you haven’t already!
That’s all for this week.
Working on goal conditioning,
RL Reading Group Coordinators