Imbue Paper Party Questions
Feel free to write your questions here ahead of time
Add your initials under a question to vote for it
Check off when answered

Norms:
  • Paper Host reads their paper deeply, so that they can answer our questions as much as possible.
  • Queue up 3 papers each time, in case some papers aren’t very good.
  • Everyone else is expected to skim the papers ahead of time.
  • Write down your questions when you skim the paper.
  • No need to go deep, confusion on basic things is okay - it clarifies things for everyone else.
  • Paper Host gives a 3 minute intro:
  • What’s the most important takeaway, and why is it important?
  • Why might we care about this paper? How much would our life change if this is true or false?
  • What is our overall assessment of this paper?
  • What questions do we have of the paper?
  • Write down all questions we have, and prioritize/vote for them before diving in.


2024/10/04

  • How come your (Bas’s) Figure 1 is missing the Llama?
  • must be cache invalidation
  • What surprised you about this paper?
  • Good questions! 
  • Which ones most interested you?
  • Sorry I started writing a different response
  • I meant to say Good question! 
  • I think the biggest thing is that intervening in a very specific and small direction (only 1/80 layers, only 1 direction in a high-dimensional space) actually has a pretty large effect on the behavior. 
  • I don’t understand where the RL comes in.  Are they actually doing RL training on the LLAMA3.1 weights, or just using some RL method as the baseline to compare against?
  • My understanding is that they’re trying to accomplish some “RL” tasks in-context. That is, it’s like: do this grid world thing, but in text. Then you textually are told what reward you got. You roll out several episodes like this. And conditioned on the text, can you solve the task? So no RL training. Purely in-context updates.
  • Yes indeed! And they find that the policies that Llama comes up with resemble what you’d do if you implemented a simple Q-learning algorithm
  • Can you remind me the difference between Q-learning and myopic?  (Or is that too much detail?)
  • Myopic only cares about the reward that you get on average on the next state in the graph, Q-learning is able to capture the graph structure. So for example, if choose left goes to a state where one choice is +1, and one is -1, then myopic would treat this as average 0, but Q-learning would be able to understand that if you reach that state, you will choose the +1, so it assigns the correct value which is +1.
  • My succinct summary is: myopic = next reward; Q-learning = returns? 
  • yes
  • I think the representation learning is pretty neat!  Somehow it reminds me a bit of grammar or type induction.
  • What other experiments would you like to do with this setup (monosemanticity + LLMs + RL type things?)
  • Another good question
  • Off the cuff: making it more real world? If you actually are able to show that internal SAE-identified dimensions correlate with things we care about in problems we care about (besides these toy problems), then showing that manipulating those dimensions can make meaningful difference will be great
  • Aah yeah, that seems tricky but would be really cool to see
  • But also, showing that you can improve things. Loss-of-function is cool but gain-of-function is cooler
  • I wonder if the representations provided by the SAE code provide a representation for the intuition of “something isn’t right with this code”.  Can you extract a useful similarity embedding space from the SAE?
  • I guess initially it’s kind of a cool and interesting result that the activations correspond to Q-functions. But then if I think about it more: If the model is able to solve these simple MDPs, isn’t it necessary that some latent states correspond to some sufficient quantities for the optimal policy? I think the technique is cool, though
  • I think that the fact that we can detect them this way is interesting!
  • I would like to see Llama’s accuracy conditioned on transitioning from one of the central bridging nodes (that are connected outside the fully connected cliques).  I wonder if the accuracy is lower or higher there?
  • Yeah that’s interesting, I’m not sure what I’d predict actually
  • I suspect it’d be worse than inside the cliques, because it has to remember “which one” of the members of the neighboring clique is the one you’ll transition to.

2024/09/27


  • Can you go over the two stages again? I’m a little confused.
  • Do they explain why the collapse occurs? Or is it more empirical that they notice it doesn’t?
  • Check out section 4, it’s largely empirical
  • My head canon is this is the difference b/w supervised and on-policy RL
  • Seems like RL is responsible, but why? Is it something like the model is pushed to explore?
  • “By probing trained models, we find that these failures largely stem from supervised fine-tuning amplifying the initial bias of the base model resulting in only minor changes to its first-attempt response.”
  • Did they attempt to address this by fine-tuning it to produce something different, to directly push back against that?
  • Is their repair model outputting diffs?
  • No it’s outputting a full solution
  • In the example, the original solution is actually correct expect for the final line (192 mod 7 is not 1 but 3). Is there any way to get token-level or sentence-chunk -level signal with this approach?
  • for human eval, their prompt is 
  • # There might be an error in the code above because of lack of understanding of the question. Please correct the error, if any, and rewrite the solution. Only output the final correct Python program!
  • It seems like you’d be able to get much more information from actually running the code (using some generated invocations), or even running a verifier first - what is the model supposed to do when the original code is already correct? It seems like these extensions are quite easy, do you see any obstacles? 
  • I think maybe they want to get signal from more documents using autoregression, than just the relatively few they can run?
  • For the STaR and SFT baselines, what’s the fine-tuning setup? Do you just predict the y^+ conditioned on x and y^-?
  • Yeah, basically. See section 4.1.
  • But is there no fine-tuning for the first response? That doesn’t seem correct. I don’t see the details described in 4.1. I would guess that you do some sort of fine-tuning for each step independently, but I don’t see that described.
  • Yeah I wish they wrote the learning objective but I’m guessing they used the same as in the original STaR paper :/
  • Wait, what does that mean? What do you do given a (x, y^-, y^+) triplet?
  • I’m guessing the loss is predicted on y^+ conditioned on x and y^- as you originally wrote! So E(y^+ | x, y^-) with a self-correction instruction inserted between x and y^-
  • But what do you do for the first step? I assume there should be some objective to encourage you to to directly predict y^+ from x, otherwise the t1 accuracy shouldn’t be good
  • Ohh I see your question
  • I think section 3 equation 1
  • They train on a multi-step RL objective in parallel, so each of the l attempts are supervised simultaneously and intermediate turns are supervised indirectly?
  • I don’t think it’s the objective in S3.1, though. Shouldn’t that just be the problem objective, not necessarily what they’re using to train STaR? That’s not an SFT objective
  • Yeah you’re right, that’s their multi-turn RL problem setup. Sorry I have no idea man! I’m reading the same paper as you 🤷 
  • Sorry, I don’t mean to be antagonistic. I’m genuinely not sure what the answer to this question is and just trying to find out!
  • No worries, sorry I don’t have a clearer answer. My guess remains that they simultaneously trained on the entire trajectories to optimize for both (y^- | x) and (y^+ | x, y^-)
  • From table 5 they get large lift from using RL instead of STaR for stage 2 (56.2 → 60.0%accuracy at t1), I’m wondering what accounts for this? Would it be related to mode collapse? Or maybe being out-of-distribution?
  • Was there any results of third-attempt, or more accuracy or if their method could extend beyond t2? 
  • E.g. GPT-o1 seems to do much more thinking and presumably correction
  • I don’t think they did, they say “An interesting avenue for future work is to train with more than two attempts via RL…”
  • What is SFT in Figure 3c?
  • What are your key takeaways from this paper?
  • It seems like a vindication of RL for multi-step editing, but also a warning that the details of RL matter a lot here
  • Fig 5/6 indicates there they interleave Stage I/Stage II/Stage I/Stage II, why is that?

2024/09/13

Strawberry delight

Experiments: 
  • Wordle: 
  • GPT-4o (no CoT): failed and made lots of dumb repetitions
  • o1: passed, with messy reasoning, and only a couple of dumb repetitions
  • MADDPG

async observations
  • It is good at pretending to plan, but it doesn’t actually plan, at least not in the sense that it can now suddenly solve all kinds of planning tasks. 

  • I’d be curious to see how o1 does on our interview problems
  • Probably pretty well since they’re basically just competitive programming questions
  • I saw that they tried to have it solve the openai interview problems and it did well

flip/flop
  • when I used o1-preview in implement.py, or at least tried to, I saw no notable change to its behavior, but I’m not convinced it was actually using that model
  • when I use the app and pose the same problem interactively, it thinks a really long time and claims to be doing a lot of things, and then concludes that the problem is unsolvable because only one letter can be changed; so it has concluded that each operation can be performed only once
  • if I nudge it out of this error, it does solve the problem straightforwardly
  • this feels very much like they mostly just put a feedback loop into the interface
  • yeah, its introspection looks a lot like an implement.py sequence
  • oh, yes, the API doesn’t support o1 yet

  • 1o seems much more blackbox-ey, since (1) it does chain of thought by default, and (2) they don’t expose the chain of thought reasoning. 
  • It might be harder to craft complicated multi-stage prompts, where we have precise control over its chain of thought
  • For example, it would be hard to efficiently insert ground truths like test results or log outputs into the prompt. 
  • !
  • I wonder whether the scaling laws for training compute vs inference compute correspond to different capabilities 
  • It seems like scaling inference compute unlocks more ‘agency’ (alternatively ‘goal-directedness’), i.e. coherent reasoning over long time periods
  • Wheras scaling training compute provides more ‘intelligence’. One way to interpret this is as better ‘intuition’, or better heuristics for exploring the search space.

demos
  • Bas: games
  • Bryden: comparing o1 and o at coding

2024/09/06

  • Thad’s summary
  • They’ve built a re-ranker that takes several coding solutions and picks the best one.
  • Despite saying that the agents “collaborate” (“enable scalable management and collaboration among specialized agents”),  I don’t think there’s any information shared between agents, nor any combination of outputs from different agents.
  • They DO build up evidence that having diversity in the pool of candidate solutions is a good thing.
  • Particularly: "We wish for our evidence set to cover a large, representative portion of the search space to obtain a more accurate estimate of risk."
  • Figure 1: Current systems are quite diverse!
  • Let’s use some fancy words, Latex, and greek letters somehow! Contextual Markov decision process (CMDP), ℳ=(𝒮,𝒞,𝒜,ℛ,𝒫,p0,ρ)
  • Figure 2: This system is a reranker / rescorer.
  • Figure 3: The main point of the paper is that the pink line for the reranker (n@k) is above the green line (average performance).
  • Something weird is going on with the combination of re-ranking and Aider?
  • Same as Table 2.
  • Ablation:
  • Running their own system multiple times helps.
  • I wish they’d said whether they held the individual agents output constant or not.
  • Chain of thought helps a bit.
  • Thad’s detailed notes
  • Interesting: Alibaba Lingma Agent (Ma et al., 2024) constructs a repository knowledge graph to represent code and dependencies, using a Monte Carlo tree search-based strategy for repository exploration
  • What does this mean: “spectrum-based fault localization to the agent for enhancing context understanding and issue resolution”
  • Current multi-agent frameworks are categorized into three types based on their execution patterns.
  • Firstly, static agent working flow (Wu et al., 2024; Github, 2023), which pre-defines the agent execution flows and ignites agent transitions via specified conditions. Controlling a multi-agent system with pre-determined states is robust, though losing flexibility in terms of unseen states or conditions.
  • Secondly, ensemble via group chatting (Wu et al., 2023; Hong et al., 2024; Wang et al., 2024a; Chen et al., 2023). This is built upon an environment where multiple agents send messages to each other in a group channel such that their thoughts are ensembled. Variants of group chatting includes debating (Liang et al., 2023; Chan et al., 2023) and model-wise ensembling (Wang et al., 2024a).
  • Last but not least, hierarchical task assignment (Liu et al., 2024; 2023). Organizing multi-agent in a hierarchical structure benefits the top-down task decomposition and thus enables efficient multi-agent collaboration.
  • Strikes me as a weird thing to say: “LLMs often excel at evaluating solutions when evaluation is easier than generation.”
  • Four inputs are given to DeiBase for each patch:
  • the issue description itself
  • relevant context (code snippets identified by an SWE agent as relevant to the issue),
  • code before the patch,
  • code after the patch
  • n@k measures the performance of any reranking mechanism by computing the number of problems solved by n chosen submissions from a total of k samples. (they do n=1)
  • Don’t totally understand this: “Note that the order – in which different runs are added – maters as k  gets larger, especially when the k candidate solutions come from k  different agents. "
  • Is this part of why the “n@k” curve is not monotonically increasing?
  • Our Large Language Monkey friends appear: As k – the number of candidates – gets larger, the gap also gets larger.
  • This suggests that given a limited budget of candidates, it would be better to choose a diversity of agents over multiple runs of the same agent.
  • Online Discussion

Questions

  • Do they quantify diversity here? 
  • I do not remember seeing any quantification of diversity in the paper
  • Do they have some measure of how diverse the patches they rank are? 
  • ‘DEI Committee’ is an…amusing name
  • This paper is banned in Florida
  • 🙂 
  • How do they ensure diversity among their patches?
  • As I understand, this is just re-running the systems with different seeds, or using different systems all together.
  • What exactly are they using for the reranker?
  • See here for prompts:
  • Interesting, do we know which ones they actually used? Eg the pairwise prompt means 100 queries for 10 patches, which seems bad… but not unusable
  • For future readers it’s just SINGLE_SCORING_WITH_IDENTIFIED_SPANS_TEMPLATE that gets used
  • And I thought they mentioned some kind of training data, is there any, and how is it used?
  • I am a bit surprised that this works. We have tried re-rankers as well and with just the code it doesn’t do particularly well on lbpp type problems. Is there something about the problems that makes it easier to recognize a good fix? 
  • I’m curious to hear more about our attempts at something similar.
  • We have taken attempts from gpt4 at lbpp problems, and have gpt4 vote on which looks the most promising - akin to “tree of thought” experiments. This is Eric’s work. We did not see much lift when just presenting the code, but if we present code + execution information (i.e., invocations), we saw some lift but nothing too crazy. 
  • Also Evan has worked on verifiers which are somewhat similar but those are actually trained models which provides a lot more opportunity to do well
  • I suspect it has to do with them using multiple agents in their options, given their in-agent resolve increases don’t seem as high (except for Moatless???)
  • I wonder if we would get some small lift from even just using a few different prompts
  • I think I missed this – is their system built off of an existing LLM or do they do some kind of training?
  • I don’t think they train anything.
  • (answered by above)
  • Well, there are some finetuned models inside the closed version of their system which hits SOTA, but I don’t think they trained anything themselves.
  • Interesting, most of these agents are using the same subset of LLMs – is this basically just mixing prompts/doing pass@k?
  • Yes, that’s how I read it.
  • Yeah like bas mentioned above, pretty surprised this works. Maybe this means we should be trying a few different LLM approaches (e.g. test first, multiple iterations, idk) in our loop?
  • It seems like the main lift might be coming from multiple agents, since they don’t seem to do that much better when picking from generations in the same agent
  • Just to clarify, for figure 1a, for the questions that pass for one of the non-DEI models but fails on DEI - this just means the reranker didn’t identify the best patch correctly, right?
  • Yeah I believe so, e.g. oracle vs DEI resolve %
  • Yes, I agree.
  • What exactly are the votes?
  • :0 like GPT assigns the score multiple times and each time is a vote?
  • Yes, GPT assigns a score (1-10 I think?).  
  • Looks like it

  • Thad’s summary
  • Agents can break down a large problem and communicate state at various levels of abstraction.
  • If you want to sample diversity in LLM outputs, here’s one way to get it!
  • Thad’s notes
  • What do these mean: Addition-by-Subtraction Collaboration and Trilateral Collaboration
  • Diversity generation:
  • To enhance the realism and efficacy of our simulation in the translation process, we strategically utilize gpt-4-turbo to generate a diverse set of 30 virtual agent profiles for each distinct role. 
  • On the Importance of the Judgment Agent
  • Although recent advances in large language models (LLMs) claim that LLMs are capable of processing extremely lengthy sequences of up to millions of tokens, we still observe that our agents are not able to effectively leverage the information in the context as the conversation expands. 
  • Additionally, we observe that the meaning of translations tends to deviate from the original text after several iterations of revision. Therefore, it is critical to have the Judgment agent within the Trilateral Collaboration to ensure the overall quality of the response.

Questions

  • How did they handle race conditions–were the agents parallel with shared state or sequential?
  • There’s never a race.  They have these collaboration patterns called “Addition-Subtraction” and “Trilateral Collaboration”, and they’re turn-based, and embedded within a larger management structure.
  • So I guess you could simulate this entire framework within a single LLM prompt and response?
  • They talk about how the system breaks down with very long input context windows.
  • So the agents are just the same model prompted differently?
  • Yes!
  • How much is due to planning vs the different personas? Did they do any ablations?
  • No, I wish they had made everyone a boring person in a gray suit and run the same thing!
  • (Or just not prompted with the personas at all.)
  • How different are the personas?
  • Kind of cute, funny that they have such an arbitrarily complex system that does seem to produce good diversity
  • I want to see their personas and see which ones do the best…maybe I’m just describing a performance review
  • That is a fun idea!
  • Also I’d be interested in the distribution of which agents are picked
  • Do you think any of the lift is coming from the “management structure” of the agents, or is it mostly coming from the increased diversity?
  • “Lift” is perhaps a weird way to talk about a drop in BLEU score, but … 🙂 
  • They do talk about this a bit, in particular how the localization and proofreading affect various parts of the score.  Let me see if I can find that…

  • I wonder if the context window doesn’t work partially since they’re assigning such human personalities to their agents…wouldn’t be the weirdest thing an LLM has done
  • Yeah, I imagine that it would be harder for an LLM to faithfully simulate a persona if it has to constantly switch within the same context window
  • My understanding was that each persona is isolated in a prompt and just passes inputs/outputs to the others? I was thinking moreso that the LLM might be affected by the personas themselves
  • That’s my understanding, too, that there’s particular kinds of message passing, but nobody sees the whole LLM history.
  • What personas do we need for writing code?
  • The infra engineer

8/30/2024


  • Just checking my understanding:
  • we generate instruction-following examples by prompting GPT-4-Turbo to combine random combinations of k skills (and a random choice of query type for the seed-dataset agnostic version). Since the number of k-tuples scales as N^k , where N is the total number of skills
  • Should N^k be N choose k? Or am I misunderstanding?
  • Yeah, looks like it’s definitely a typo, written later on.
  • Thoughts on analogous techniques for code generation?
  • +1
  • There are definitely lots of “skills” in software engineering/coding; maybe we can similarly extract “topics” from Nix/code gen datasets or just generate these topics, cluster skills, then combine to generate synthetic data?
  • I’m confused by Table 4: are they saying that the improvements over Alpaca-1K Longest are mainly due to improvements in GPT-4 to GPT-4 Turbo?
  • My understanding is that table-4 shows that the improvements are independent of the answer gen model - like you see improvements from their mix vs alpaca-1K for both GPT-4 models (I could be misunderstanding though)
  • ah, yes. Thanks, I agree. Sorry, I phrased this question poorly. Resolved in online discussions.

8/23/2024


  • What’s an example of a stream of search for code generation?
  • I think you could imagine things like: see the whole trajectory of how a human generated a piece of code. e.g., You first wrote an implementation that’s broken, then you had some series of edits that you made to fix it. Or maybe you even took the wrong high-level implementation strategy and only realized it after some amount of development. Maybe there are others, too! Curious to hear about those.
  • Does the model actually spit out this entire trajectory? I’m sort of confused how it comes about
  • Like, with delete tokens? 
  • Current models don’t do this, no. But I think the idea would be to train a model to output in a format like this
  • The model in the paper was trained to produce these kinds of “backtracking” operations I think?
  • It seems strange to think that a model would be good at knowing when to backtrack, given that humans usually only know when to do this by executing code?
  • I agree that I don’t think the model in the paper got this kind of feedback as it produced tokens: While we leave the evaluation of states to be done implicitly by the network in our current work, explicitly representing state evaluations (Gandhi et al., 2023) and introducing other formalizable operations such as limits, summarization, cycle checks, and subgoal setting could enhance the SoS framework
  • I don’t see how you can evaluate the states any other way, given that this is all part of pretraining? Wouldn’t you need to invent a different type of pretraining? Like you can’t pretrain it to output a good search path which factors in external feedback that doesn’t yet exist
  • True, I think we wouldn’t be able to apply this exact approach, especially at the pretraining level but the idea of backtracking data may be worth exploring
  • I guess you could do this based on how the code changes after execution and errors, kind of like how some existing methods do debugging/repair, but the debugging/repair tokens becomes part of the pretraining data
  • What do they mean by “implicit” heuristics?
  • For deciding what states to visit next the search strategies have some heuristics for which states may be better to visit, in the case of the model training data they do not explicitly list out these heuristics (instead leaving it implicit) so that the model may learn its own heuristic
  • “Each of these operations can be left implicit, affecting how the trajectory unfolds, or made explicit in language as part of the search trajectory T . When operations are implicit, a model is more likely to internalize abstract representations for them that can be improved with training”
  • I think it’s this: While we leave the evaluation of states to be done implicitly by the network in our current work, explicitly representing state evaluations (Gandhi et al., 2023) and introducing other formalizable operations such as limits, summarization, cycle checks, and subgoal setting could enhance the SoS framework
  • What’s the intuition for why the SoS does better than pretraining on the optimal solutions only?
  • I think the intuition is that seeing data in which the correct answer is not immediately produced allows the model to get better at error correction and backtracking to be able to recover from taking one bad step
  • I guess the point was that you get to train on some search data. And I guess on this particular task, it’s probably difficult to know a priori what to do, so you just have to try a bunch of stuff
  • I think this is beneficial for problems with high branching factor, where backtracking and searching at test-time is more likely to succeed than trying to one-shot it
  • Anyone have a hypthothesis of how the results would pan out if you fine-tuned the Optimal Paths model (Optimal Paths + STAR/APA) compared to SoS +StaR/APA? Presumably worse, but curios if people have thoughts.
  • There’s this other paper (don’t remember where I saw it) that showed training on negative reward trajectories was 8x more efficient than only optimal trajectories?
  • Interesting, was that the paper where they extend and improve on STaR, by chance?
  • It seems strange to ask the model to perform backtracking if the model isn’t smart enough to test out a node in the search tree. Like with the 24 game it seems plausible that a model can try a candidate solution, but I feel like it can’t do the same for verifying that code works. Am I thinking about this right?
  • Yes, I think that would definitely be a challenge for bringing a similar approach to coding, when you create the training data you could have access to these traces but since the model doesn’t have access the problem definitely seems more difficult
  • what if you interrupted the model output right before it simulates the ground truth test result, then run an actual ground truth test, and insert it back into the prompt, then have it keep predicting next token
  • What are the big assumptions baked into their approach?
  • Like definitely need a way to produces search traces, so it’s only relevant for specific types of problems
  • Yeah, relatively easy for problems where there are suboptimal symbolic search algorithms (BFS/DFS)
  • How might we generate these for code? 
  • I’d be curious to see if pre-training on code, execute, error, new code, traces would be similar
  • (1) Have an LLM generate some code, (2) Run that code on ground truth test. (3) While the tests fail, try to repair the code. Do this for a bunch of different initial prompts, and generate a bunch of synthetic data. Then filter out the traces that took too long, or never ended up solving the problem.
  • This aligns more with my mental model of an agent than my mental model of a model. Like we’re training an agent to be good at efficiently searching the space of solutions given some external feedback. It seems like the main challenge is training a model to get good at making the right decisions given feedback. Maybe a model that’s good at knowing how to backtrack intelligently on simple problems like the 24 game would somehow also be good at knowing how to decide when to backtrack given external feedback?
  • I’m not sure I understand, if a model was capable of backtracking effectively wouldn’t we able to replace the agent (as in the agent could just be the model directly)
  • What I mean is what if you pretrained a model that’s good at spitting out full search trajectories for the 24 game, then used that model as part of an agent. Like if we had an agent that creates code, runs it, then decides what code to try next, maybe that agent would work better if it had a “brain” that was pretrained on the 24 game thing and therefore “good at planning and backtracking” (please ignore if this doesn’t make sense)
  • Ah I see, I think that would be an interesting thing to try, I think they were also initially motivated by the model going down an increasingly bad path after one bad token generation (which may be a smaller mistake than we would expect an agent loop to solve?)
  • Not sure if I understand your question—I think the point of this paper is that most agent approaches today use a symbolic algorithm like BFS/DFS in the outer loop, but this paper was interested in seeing if the model could also internalize the search algorithm itself (or even improve on it) by training on many such search traces
  • It’s impossible to internalize the search algorithm for code generation though, so I was wondering if the skills obtained by getting good at internalized search for the 24 game would make the model better at serving as an agent for code generation. Like maybe a model pretrained this way would react more strategically to tracebacks
  • I don’t think I fully agree that it’s impossible to internalize a search algorithm since you could imagine a model trained in this way may be more likely to produce better code (for example if it started to go down the wrong path it could self correct) even without explicit access to a way to run code
  • This is a good point, like it can definitely catch syntax errors on its own for example
  • For our purposes reacting to tracebacks does seem useful, I don’t think I have good intuition for how much transfer there would be to this task though
  • Does their model output the entire search trajectory, including the backtracking, all in one output response, or is there some type of prompting loop?
  • There is no prompting loop, you can see an example in figure 8 in the appendix
  • Yeah, they mention how they limit the size of the problem so that the whole trace fits in the context window.
  • What are these “ previously unsolved problems (by the symbolic search strategies) “?
  • Shouldn’t both BFS and DFS always find a path to a solution if it exists?  Do they mean with some limited search budget?
  • Yeah I was also confused by this, I assume they did have some budget
  • Maybe just unsolved within a budget that matches the model’s context window?

8/16/2024


  • So the reward model is a worse verifier than majority voting
  • yes - probs due to more samples meaning you are more likely to go out-of-distribution of the RM
  • scaling sampling doesn’t get you to coverage of 1 on some problems, what is the limitation here?
  • do we think this would be useful for creating better solutions for training data in cases where we already have tests? what are the other applications?
  • Josh said that we should investigate the lift from a non perfect verifier

  • This would definitely make it easier for me to use and trust Crafty when I ask it to do complicated things. Are we already doing something similar?
  • We aren’t — it’s a very interesting idea, maybe a feature worth building if you think it would help you be able to use Crafty
  • I’m not sure how this differs from just asking the agent to leave comments in the code that it writes?
  • I mean right now we’re already doing the thing of leaving comments to get crafty to write code
  • I find it intriguing that constrained decoding helped them avoid unintended file edits. What was the argument against using this in crafty?
  • It’s super complicated to implement, and only works for cases where you are adding comments without wanting to change the python at all
  • I’m not sure how helpful the change outline to change code functionality would be - I feel like most cases changing the outline would result in rewriting the code rather than making small diffs… 
  • we did kinda find that, yeah
  • Google making a coding play?
  • nah, this was clearly an internal prototype
  • I like the prospect of having natural-language documentation that is necessarily synchronized with the code. It seems like one of the main drawbacks to extensive code comments is the chance for the documentation to fall out of date with the implementation.
  • +1
  • Wait, how does this get enforced?
  • Open question in the current paper — I think they just kinda did it before you made a PR-ish
  • I think they interleave the LLM-generated outline in the code, based on one of the figures in the paper.
  • Right, but when do they get updated?  I wasn’t totally sure how to best keep them up to date. Like, imagine distant code changes, but that changes what this line means… do you update the comment for this line, even though this file wasn’t changed?
  • CodeSearch has a notion of “indexed” revisions; so they could probably have generated the outline in the code indexing pipeline and then for newer versions of the code, the outline would be out of date? Just one possibility, but that’s how it works for x-references, at least.
  • Random comment: I used to share a 2-person office with the first author
  • Interesting! Do you think they are worth interviewing on the podcast?
  • I’ve always been excited about the idea of viewing code and software systems at multiple levels of abstraction, and I think this idea of summarizing what’s going on gives you part of that.  I’ve wondered if, when generating code (like in the monkeys paper), it might be helpful to generate these kinds of comments first using a very high temperature, so as to sample the space well, then generate the implementations for each outline using a much lower temperature.  (Even better if your verifier can operate at the summarized level and you don’t need to generate implementations for all the summaries.)
  • This idea reminds me a little of how diffusion models work, turning noise into an image conditioned on a prompt.
  • An idea: if we have golden solutions at the Python/unit test level of abstraction, we could use that that to train a verifier at the higher-up “English/pseudo-code” level of abstraction by asking the “English/pseudocode” verifier to state whether the pseudocode leads to a generation of Python that will pass the tests.  This reminds me of the “backtranslation” that Llama 3 did.

8/9/2024

Questions
  • @Amy H this might be a nice approach for de-duplicating training data:
  • Semantic deduplication: Finally, we perform semantic deduplication (Abbas et al., 2023; Liu et al., 2024c). We first cluster complete dialogs using RoBERTa (Liu et al., 2019b) and within each cluster sort them by quality score × difficulty score. We then do greedy selection by iterating through all sorted examples, and only keeping the ones that have maximum cosine similarity less than a threshold to the examples seen so far in the cluster.
  • Re: “In Llama 3 405B pre-training, we increased context length gradually in six stages, starting from the original 8K context window and ending in the final 128K context window. “
  • Was there a plot showing how this behaved as the model dynamically adjusted to the longer context windows?
  • Bowei: are we doing or planning on doing the “averaging” technique they mentioned in post-training? Is that standard nowadays?
  • I don’t think we’re doing this (?) We had some short discussion on it last time
  • Re: “To address this limitation, we introduced execution feedback as a source of truth, enabling the model to learn from its mistakes and stay on track.”
  • I love this!  I even wonder if it could be extended to allow model training to follow the scientific method and propose general “hypotheses” during training and then test them out.  As in, model says: “here’s a code snippet where I’m not very certain what the output will be, let’s run it and find out and fold that into my training data.”
  • That sounds interesting, sort of combining CoT and some of the code repair techniques
  • I’m super curious how many people-months were spent on this (since llama2).  It feels like ~500 people working for all that time.  Anyone have a guess or know?
  • I got 220 core contributors (2/3 of runtime) and 310 contributors (1/5)
  • Was that in the paper?
  • In acknowledgements and then I just approximated by counting commas
  • Nice, thank you!!  I would guess there is also some core infra support that might not get listed possibly.
  • Coordinating an effort of that size with such nice focus is impressive!
  • I’m curious to see if some of these things they’ve tried (having it learn refusals, code-sided tasks, etc.) could make it easier to work with that GPT-4 or Claude-3.5. Maybe some lift that doesn’t show in evals
  • “ For more challenging prompts, we use Monte Carlo Tree Search (MCTS) with learned step-wise reward models to generate valid reasoning traces, further enhancing the collection of high-quality reasoning data (Xie et al., 2024).”
  • This is a bit more like the CoT stuff I’m working on, but the tool use stuff might be relevant to Zack
  • What is the “mathetmatical and reasoning” data they use for pretraining?
  • Are we trying to make/finetune our own model that can do a better job on these benchmarks? I am unclear on the high level utility of this paper to our company
  • We observe that adding general rules of good programming to the prompt improves the generated solution quality. Also, we find it is helpful to require the model to explain its thought process in comments.
  • @Hynek U @Moishe L looks like the OSS instruct-style stuff is important for them:
  • First, we generate a large collection of programming problem descriptions that span a diverse range of topics, including those in the long tail distribution. To achieve this diversity, we sample random code snippets from various sources and prompt the model to generate programming problems inspired by these examples. This allowed us to tap into a wide range of topics and create a comprehensive set of problem descriptions (Wei et al., 2024).

8/2/2024

Questions
  • can you explain this from the intro - "responses from a similar model are 'easier-to-fit' than those from a more capable model, resulting in reduced memorization”
  • (abe explained in person that reasoning traces from large models may be hard for the small model to reason about so it instead memorizes them)
  • follow-up: i wonder if we can detect when the model learns to memorize vs “reason about” a new example
  • note: RFT uses GPT4 for golden answer
  • why did they call it “rejection finetuning” (from the paper … RFT; positive self-generated synthetic data from the SFT model) — is it because they’re rejecting the incorrect solutions?
  • Yes, there is a paper about this RFT method they are referencing, it’s very similar to STaR, without the step of generating post-hoc reasoning traces
  • Can you explain this more “An alternate option would be to skip the computation of advantage estimates but instead rely on implicit approaches that optimize the advantage-weighted objective without computing their values”. My understanding is the easiest way to compute the advantage estimates you need to complete many partial solutions. This seems like a pretty expensive process? How does it compare to just producing 8x the amount of synthetic data to begin with?
  • I think what they’re roughly saying is that if you directly compute all the advantage estimates, you can just use AWR out of the box. But computing these advantage estimates can be expensive, as you’re saying, and so you can use methods that implicitly optimize this objective without having to directly compute the advantage estimates. And the claim is that the per-step DPO does this.
  • Right, and doing these rollouts on our own model are not very expensive (vs getting them from GPT4)
  • Is the original synthetic data produced by GPT4 or our own model?
  • The prompts/problems are produced by GPT4 but the reasoning traces are all produced by our own model
  • So for the per-step DPO do we still need to do rollouts?
  • Yes the rollouts are all on our own model, and we need to do a lot of them (I think they do 4 rollouts for each step of the reasoning trace) to compute the advantage

07/26/2024

The Llama3 Herd of Models

Questions
  • Glossing over the details tremendously, it seems like what they do is “a sensible approach to each subproblem”, and put it together. Is that your impression?
  • yes, i thought very little of this was surprising (other than perhaps the % of post-training data that was generated) / most of our pretraining was planning to go in similar directions
  • Wasn’t one of our big conclusions from trying to replicate llama2 that their data was better than ours + had improved more than we expected
  • The annealing on gsm8k is something I didn’t know about and we didn’t do. We underperformed relatively to Llama 2 & 3 most on gsm8k and arc.
  • Generally, I would assume that somewhere in this entire report is some technique that secretly makes a big difference, we only did one big run so getting close is pretty cool already
  • yeah but we didn’t use our best data lol
  • 😞 
  • +1 to this. My takeaway was just do all the simple things, but that requires a lot of human engineering
  • I’m very surprised they released something this detailed/curious what their motivations are. Is it since it doesn’t seem to perform as well as GPT-4o/Claude-3.5 that motivated them to write a paper and open source it?
  • Interesting, my sense is that they have committed to the open-source track and they are essentially celebrating “the best open-source model in existence”. They also are not that far behind gpt-4 and claude on benchmarks it seems
  • Hm this makes sense, but the delays in this model release makes me wonder if they were pushing out some things last minute. Regardless, probalby pointless to speculate here
  • I think one thing to consider and which mark zuckerberg mentioned is that unlike anthropic and openAI, meta do not rely on closed-source models financially so they can afford to release open-source which also helps them by allowing them to essentially adapt any improvements found by the open-source community, like getting a whole bunch of free engineers/researchers 
  • Yeah idk if people like ourselves will really be free engineers for them, but there is a decent chance that some companies will prefer to build on fine-tuned Llama rather than fiddle with APIs and prompts. I think they’re making a sensible bet, especially because they are side-stepping the competition from Anthropic, Google and OpenAI simultaneously. 
  • certainly all the big open-source projects will make an effort to support (and optimize for) llama3 models, though i agree people like ourselves will not really be helping
  • They seem to have a fully automatic system for generating coding evals (actually this is for generating finetuning data). Am I reading this right?
  • The grounding in code snippets is interesting, we have previously used wikipedia articles as sources to base problem statements on, but that may not be as large a pool. 
  • We do have wild code snippets as sources now (as one of the options), but there’s still work to do on expanding and diversifying sources
  • From my understanding, yes. At least for code generation (i.e. CoPilot) my guess is it’s likely that there are a bunching of coding challenges they don’t have generate evals working for, like SWE type tasks
  • Interesting, is this how we’re generating our own evals? Just ask an LLM to make a problem, run it through our system, and check it against model-generated unit tests?
  • Yes
  • To some extent, I think this is our vision, but we still do need human oversight to fix them up or correct them. 
  • for training data, we should potentially consider doing multiple iterations of making code + tests and running them against each other bc manual filtering will no longer be an option for large datasets
  • This is the vision 😉 Your work will be invaluable for this
  • Agreed but I wonder if we only want filtering or we also want to be able to fix things in an automated way (not everything can be fixed of course but basic things to cut down on human time)
  • wait yeah sorry I meant using multiple iterations to fix either the code or the tests depending on which is more problematic at the moment
  • Zack did some work with this on exercism 
  • Which section is this?
  • page 20, section 4.3.1
  • Oh isn’t that training data not evals, sorry the term confused me
  • Ah yes, you’re right. That makes much more sense!
  • Is annealing and the approximation of checkpoints a technique that also applies to SFT?
  • I don’t think they do much annealing in SFT, but they do mention something about using averaging of checkpoints (4.1.3) during post-training
  • Yeah I guess this was a question to see how relevant it was to us since we don’t have plans for more pretraining
  • This was a post-training thing they tacked on, right? On a smaller set of math-only data?
  • for the annealing yes
  • Interesting nugget for @Amy H
  • Unit test generation and execution: For each problem and solution, we prompt the model to generate unit tests, executed in a containerized environment together with the solution, catching run-time execution errors and some semantic errors.
  • The static analysis is interesting, we can do this too (and to some extent already do: if it doesn’t libcst-parse, it scores 0). 
  • is this different from what we are currently doing?
  • I think it’s conceptually the same idea
  • we do this but we aren’t really doing anything about the passing and failing tests yet - wait is this abt the data gen funnel?
  • We generate/filter/score but don’t fix
  • True, but my understanding is that this is just because we are still manually reviewing data and would be different for training data for example 
  • Right, I was thinking about training data (since it would reuse a lot of the pipeline features)
  • yep, I guess this feels like a very easy tweak to make which I assumed would happen for training data but we should definitely keep it in mind when we get to that point
  • +1 (tbh could lead to some delta in human effort for evals too, but more necessary for training data for sure)
  • I do agree that the model self-correct is probably an interesting area we haven’t explored much yet
  • infra comments
  • we already knew this I think but oof they use RoCE @Bowei L 
  • 466 job interruptions in 54 day period 
  • 78% of uninterrupted were hardware failures - seems about in line with us but I haven’t done the math
  • Our collective communication library for Llama 3 is based on a fork of Nvidia’s NCCL library, called NCCLX. NCCLX significantly improves the performance of NCCL, especially for higher latency networks.
  • Have we tried this? I know we have the bawr NCCL fork but I think that’s off of base NCCL and was mostly for debugging
  • 43% MFU on 8k gpus (and 38-41 on 16k)
  • no activation checkpointing → 33% relative speedup
  • 4DP? I assume that would boost MFU decently cc @Vincent H 
  • yeah
  • FP8? jk that’s just for inference
  • They said they have 90% effective training time, is that good?
  • pretty good
  • do you know what we were at, I don’t think much less than that the last few weeks, before that much lower
  • yeah that sounds about right for the last few weeks. earlier weeks were probably anywhere from 30-50%
  • wait why is this? just since we aren’t utilizing GPUs?
  • and relaunching from crashes wastes like 30 minutes each time
  • yeah this mostly I think, not sure what you meant by not utilizing GPUs? I’m not thinking about idle machines if that’s what you mean
  • oh yeah I was just confused what effective training time was measured by

07/19/2024

Prover-verifier games improved legibility LLM outputs
  • Questions
  • Why make the sneaky and helpful models the same? That seems pointlessly confounding. I mean, it probably works fine, but still…
  • +1
  • I think it’s supposed to be a feature, in that the verifier has to pay attention to the actual trace and not surface level details of the two models. 
  • I guess maybe you get more stable training (since you don’t need to rely on the sneaky verifier learning to mimic the honest one)
  • Yes I think so
  • It does feel like you force the model into having a split personality 🙂 
  • This seems very common in papers, I think it comes from some hand-waving/relaxation of constraints by assuming that a single large NN approximates/represents the family of models. Since they’re all NNs you can’t guarantee completeness/soundness anyway
  • This feels like a text version of GANs, is that a reasonable interpretation?
  • It does! Only both the correct/incorrect samples are sampled from the model, not partly from the ground truth dataset
  • But the correct ones must have their final answer coming from some kind of ground truth, right?
  • Yes you need a ground truth correctness label. They speculate ways of doing it unsupervised but they don’t actually do those
  • Yes
  • Yes it is a similar idea
  • Is there(/does anyone have) an explanation for why we generally see longer answers for the correctness-only/pure RL model?
  • I’ve wondered if “thinking for longer” gives more compute to figure out an answer, so RL biases towards this behavior.
  • That sounds reasonable to me. It’s also pretty cool that the model learns to do this, like based purely on a binary reward it optimizes it’s reasoning process
  • It might also be that there is no penalty for longer generations, and sometimes it does help, if the chain of thought makes sense. So it’s a weakly positive signal that the model overoptimizes. Or more accurately, it is never made aware that there is a maximum length, so it has no reason to be concise. 
  • Presumably if you continued RL it would become aware of the risk of over generating and never finishing?
  • This may be hard to learn but yes
  • How are the answers checked to be correcty? Is it just final output (e.g. 45), or does it also check each step (e.g. 3x3=9, 9x5=45)
  • Final answer, as far as I can tell
  • I’m a bit baffled by the fact that increased legibility doesn’t help (and rather hurts) accuracy/correctness…
  • I think it’s supposed to be a consequence of training for a joint objective - it will always do worse if the two objectives aren’t aligned. 
  • They discuss training a model for correctness first, and then doing this procedure, but they don’t do that in this work. 
  • +1 - my thought is that being more verifiable biases towards shorter reasoning traces → worse max accuracy. But why does it increase then decrease during training rounds.
  • The prover is no longer only optimizing for accuracy so I guess that hurts the accuracy?
  • Yeah this would make sense
  • I wonder what the correlation is between legibility + correctness is
  • Is there an intuition for why we see the model accuracy go down during each round of training? Like I can understand why it doesn’t reach as high as pure-correctness, but why does it go up early in the round then decrease?
  • I wonder if decreasing the weighting on the verifier component of the prover reward would help here
  • +1
  • Re above two questions about why legibility hurt accuracy/correctness:
  • Here they’re training the prover, but do people adversarially train verifiers using this sort of setting? 
  • They do train the verifier as well - both are mutually trained on each other
  • Which part do you find more exciting - provers with better legibility or verifiers that have been adversarially trained? 
  • The helpful prover in the end is what matters - the human rated scores of that
  • I’d be interested to see how this particular type of verifier performs compared to previous baselines
  • +1
  • as in training verifiers with using helpful + sneaky?
  • Yes
  • I don’t know the answer to your question but I think it could be an interesting idea to test out + explore if we haven’t already done it
  • re “why publish this instead of redoing experiments” - i think it came from the superalignment team which was disbanded 😛 
  • lol
  • +1 this is the real point. This isn't actually about making the model better, otherwise they wouldn't have released it
  • There’s a rich literature from GANs on images, I wonder if other tricks from GANs can be borrowed here
  • e.g. double discriminator updates, spectral normalization, r1 regularization, etc.
  • not everything is transferrable bc the CoT traces aren’t differentiable…
  • +1
  • I feel like “legibility” is a confusing term to use - it feels more like both clarity and accuracy of chain of thought vs accuracy of the final answer to me
  • re “optimizing for accuracy makes the model output longer and longer things until it runs out of context” - do you know how big the models / context window here is? i imagine this wouldn’t happen for modern large-ish context windows
  • Why does the verifier have to be smaller?
  • I find their explanation for why verifiers that match the size of the prover perform worse in Sec 4.2 to be unsatisfying: “If the verifiers are very similar to the provers, the sneaky prover can more easily simulate the verifier’s mistakes (Burns et al., 2023), making it more difficult for the verifier to become robust.” 
  • +1, this doesn’t make much sense to me either.
  • Why wouldn’t this happen when the verifier is 1-3 ooms smaller?
  • What happens if the verifier is bigger? Isn’t that the interesting direction? I imagine they’re constrained by their RL/ability to make the prover stronger

07/12/2024


  • This is what we’re doing. ^

  • I think the DPO vs ORM might be that DPO takes advantage of pairs of correct/incorrect solutions, see “The verifier V t is trained using DPO with a preference dataset constructed from pairs of correct and incorrect solutions from DVER”. Naive ORM would treat (x,y1, check) and (x,y2, no-check) as separate, correct? @Evan L?
  • The Related work lists Self-training and self-improvement”, which are different ways of using generated solutions and correctness pairs to do RL. This seems more obvious to me than the “adding tuples to the training dataset”. The section suggests that the Vstar method gets most of its benefit from the verification, so can we simplify the approach?
  • I’m not quite sure what you mean?
  • In figure 1, you have a set of (x,y,checkmark) which is exactly what you would need to do an RL update: analogous to (state,action,reward). So we can just take the gradient of P(y | x, w) and increment/decrement our weights. Instead, they suggest this thing which seems to be “if checkmark == true than add to original dataset” which is strange and not obviously better. See Abe’s comment below. 
  • Could we maybe get better performance by also using the negative examples together with positive examples for the policy training, using DPO or similar?
  • Yes this is what I am thinking as well, see above
  • I still am surprised we aren’t seeing comparison to a baseline (non-CoT response) in STaR papers, we might be accepting a lot of bad reasoning because the model is just good enough to solve the problem without using the reasoning. Maybe it’s not that important?
  • +1 that it seems worth checking for bad reasoning in easy problems
  • How is the ORM baseline trained?
  • I find the ORM results to be extremely hard to believe. They aren’t at all consistent with our GSM8K ORM results.
  • Maybe because they have to use LoRa? Their ORMs might not be particularly good?
  • If I’m interpreting their plot correctly, it seems to be saying that the ORM is worse than random for larger k’s.
  • That might make some sense - because you’re selecting the highest ORM score, if the model is not good, highest score happens in places where the ORM is miscalibrated or high-variance
  • But not good → worse than untrained?
  • Yes, because of the overoptimization argument. Taking the argmax_solution(score(solution)) can be bad even if there is a correlation between score and correctness of the solution. Extreme cases are outliers
  • Like, I believe that this can be true, but I also think that surely you can train an ORM that is at least as good as an untrained ORM
  • Yes, I can do it
  • :bufo-amaze
  • If you are training it for the objective of making argmax good
  • I do think it argues that the ORM is bad or not carefully regularized
  • Which plot? I’m not sure I follow. Ah, because 1 == random? 
  • → that one
  • Yeah… maybe their ORM was weirdly overfit or trained incorrectly? I guess I’m also confused why it would do so much workse than DPO / what the substantial important is there
  • Just noting: Our pipeline is pretty much the same as this paper, except we don’t use DPO and do “RL” instead
  • How many iterations do they train their full V-star for? From skimming I couldn’t see how performance changes with the number of iterations. 
  • They train for 3 epochs for MBPP - they have a V-star[1] baseline which is for 1 iteration. The difference in performance isn’t that big tbh - about 3%
  • Yea I agree, do they say why they didn’t train for more epochs? I guess lack of data? Like they’d have to reuse questions or something.
  • Why aren’t the Verifier@1 rates the same for all the approaches? Shouldn’t that be doing no verification, so that if you have the same generator, you get the same results?
  • Verifier@0 is with no verification
  • I think you mean @1? Figure 5 says: “Verifier@1 is equivalent to not having a verifier and is equal to Pass@1 of the generator.”
  • They say @0 just above section 4.6
  • In Fig 5, they say “Verifier@1 is equivalent to not having a verifier and is equal to Pass@1 of the generator.”
  • So this raises Evan’s question
  • Hmm must be a typo then
  • Yes I assume so
  • in Fig 5 isn’t vstar for verifier@1 essentially the same as just using star? so the difference in success rate is from having multiple iterations of star versus 1 iter
  • Interesting tidbit in 4.7: “We also tried combining verifier scores with reranking strategies, such as weighted reranking and weighted majority voting (Liu et al., 2023), but did not observe performance gains”
  • I was optimistic about using verifiers for reranking or weighted voting :(
  • Wait — isn’t this good news? They are using the verifier to re-rank, they just do it in a simple way (by taking the max)?
  • I think this might be talking about specific forms of more complicated re-ranking.
  • They also said running the verifier in training loop (to filter correct solutions to include in D_gen, D_ver) does not help because you can just make k bigger to sufficiently explore
  • What do we think are the main takeaways?
  • This seems pretty reasonable and easy to do and works for both code and multiple choice? I don’t see a particularly good reason why you wouldn’t “make your model better by using more data” and then “make a verifier that works well and then use it to select the best result”
  • Yes I am not seeing things that are very different from what we have been doing on gsm8k or what we discussed in our research goals earlier today, except some details that we don’t know if 
  • We’re already doing this 🙂 , except the DPO part → some PPO thing for us
  • I’m not sure tbh. 


06/28/2024

  • In their code generations, do they get a worse score since the generations actively degenerate and introduce bugs? I think Evan, Maksis, and I noticed this could occur when we tried out code repair on Exercism
  • This could also explain why coding is specifically worse than Alfworld? Since in Alfworld it would reset its state. Not sure
  • They exit early if everything passes, but in some cases it will just get into a bit of a loop where it can’t get all the tests to pass because the tests are wrong
  • Right, I think we saw this too. Sometimes it can repair but if it fails it can just introduce fake fixes/syntax errors/new bugs
  • How frequent are cases where the model gives an incorrect reflection then fixes the actual mistake? Is it that the model doesn’t pay attention to the reflection at all and gets it right on a second try based on the history of the first?
  • I think it could, maybe it is just leveraging the fact that it gets the failed tests as well. But I think they did an ablation on this and the reflection actually helps — it’s hard to understand
  • Huh, so it does significantly worse if you only give it the reflection but not the test cases? That makes sense I guess given that the reflections often don’t help. But also not sure why it does better with both??
  • I was confused about this too. Why would the reflections hurt though?
  • I would assume that they can be lies, but this also suggests that they don’t early stop if it solves the problem
  •  Maybe even if the self reflection is wrong, it helps it do something different which is helpful?
  • Reflexion seems to hinge a lot on having the models make accurate self-evaluations, which appears to be the common bottleneck for all agents techniques today. Is there any contribution towards that in this paper?
  • For example, OpenAI’s CriticGPT blog post show that GPT-4 fine-tuned to make accurate critiques still hallucinates ~2x as much as a human annotator would (but is more comprehensive)
  • That’s pretty interesting!
  • I would have imagined that using GPT to self-critique would plateau sooner, but it seems they were still getting returns by running up to 10 trials on a given environment
  • Wait, for CriticGPT do they have the critic give self critique to GPT-4, which generates fixes/new code? If so it would make sense that you get some additional returns, since CriticGPT should have seen different data than GPT-4 and probably wouldn’t make the same mistakes
  • Yes it’s the case that CriticGPT has been fine-tuned on synthetic mistakes with human critiques
  • Wait, so on human eval, the "base" doesn't use the doctests? Doesn't use meaning "doesn't actually execute"
  • Right, it gets to see them, statically but never execute. This also uses them as a verifier (will retry if they fail but not if they pass)
  • I wonder how much of the 11% lift is from just executing these
  • +1
  • Yeah, I feel like this is why it worked on HumanEval but made it worse on MBPP — maybe all of the lift is from this
  • How can we improve on Reflexion?
  • What’s the gold in Reflexion?
  • They got this internal review and feedback to do something
  • They released the code so we can see what they did
  • Seems like a normal critic feedback system to me
  • Same; at a meta level, my gripe with this paper is that they’re obfuscating a pretty straightforward concept by abstracting it behind the same names as PyTorch primitives
  • +1 - ie loss/error is a precise technical term != “here’s some improvement function”
  • Yeah, I disliked it because of that, but it actually seems better written and more thoughtful than most of the papers in this class (we slightly changed the prompt flow and got different results)
  • Yeah, it feels like a lot of the paper is really fluff around a simple concept. Are there any specific results that are impressive that there’s gold in? (or maybe I just don’t understand what they’re doing well enough?)
  • Is the only difference that this system performs reflection on segments of its answer the proposed prompt, rather than the entire answer proposed prompt at once?
  • I think it’s always trying to optimize just one piece of text, but I haven’t dug too deep into all their test cases
  • Dumb question: how do they actually compute the loss? What’s the formula?
  • There’s no loss in here, it’s simply an analogy to the loss used for machine learning — the “loss function” is the “task you’re trying to do” and the “gradient” is “how you might do it better”
  • How do they call loss.backward()? are they just doing feedback under the hood and nothing like what pytorch actually does under the hood
  • What’s the analogy of a negative gradient here?  Is there some kind of signal that an output from some part of the system is irrelevant and should be silenced/ignored?
  • I think at some point they tell the LLM “if there’s nothing to improve don’t output anything” which would be a zero gradient type thing (or maybe I’m misremembering)
  • What is the gold in this paper?
  • Even if it’s just better prompts, they’re clearly doing something right because they get 36% zero-shot on LeetcodeHard vs. Reflexion’s 1-shot 31%
  • Shockingly, are they doing anything that isn’t just better prompts? They tend to beat a lot of another things just purely based on that …
  • Yes, they treat the solution instance as something to be optimized, not just the prompts:
  • There are two classes of optimization problems we explore. In instance optimization, we directly treat a solution to a problem—e.g., a code snippet, the solution to a problem or a molecule—as an optimization variable. For instance, in Equation 13, we have a code instance that we would like to improve at test time. Our framework produces the gradients for and directly optimizes the code variable. In prompt optimization, the goal is to find a prompt that improves the performance of an LLM across multiple queries for a task. For example, we may want to find a system prompt to an LLM that improve the performance on mathematical reasoning questions (see Section 3.3 for examples). In particular, we want the system prompt to generalize, in contrast to instance optimization where the only goal is to improve the solution for a given query at test time. Crucially, both types of problems can be solved without hand-crafting the framework.
  • Thanks for this — I clearly could not grok what was going on in this paper at all 😅 
  • They write about the prompt optimization work but I’m not familiar with the area, is is very different from what others do?
  • Prompt optimization is across many examples in a type of task; as Thad pasted above, instance optimization is where they take the feedback from the specific problem
  • What specifically about the prompt optimization gets so much lift though? I haven’t looked at DSPy so wasn’t sure what they do in comparison
  • DSPy seems mostly optimized around picking the best few-shot examples from a collection you happen to have of them.  What I wanted DSPy to be, and what this system seems to do, is provide a mechanism to propagate the failure modes backwards through the prompts, helping system figure out what to “watch out for” at each step.
  • Anything especially successful in their prompting that we could extract and learn from? If they’re able to get this much lift out of prompt optimization maybe we should see what we can pull from that
  • Isn't that the point though?
  • Do you think this would generalize to other models? It seems like they rely on having a larger / better model in order to do prompt optimization
  • They get a big boost from prompt optimization:
  • I feel like the fancy packaging is making it harder to understand not easier, once you peel that back it might be clearer
  • Yes I agree
  • lol i feel like this is just playing off the stereotype that “bio people don’t understand calculus so differentiation is a big deal for them”. see: their chemical and medical applications

06/14/2024


  • What are the main details that you think we should take from this paper?
  • Well it’s a bit hard to say because the current problem we’re working on is a little different (less agentic, more code completion) but I think in general it makes sense to reconsider how we’re presenting information to the LLM and how to use context most effectively, possibly by removing a lot of the things we currently put in context
  • I guess I'm asking if there are techniques that we should try to lift wholesale or if we should just think about this problem more
  • Like which ones do you think we should copy and which ones do you feel like are interesting but need more iteration?
  • I think we should probably try copying the thing where we only show the last few steps of environment/llm interaction and provide brief summaries of earlier steps
  • One crazy idea I had, relatedly… you could imagine giving a “consolidate my previous context” tool/operator to the agent, and just forcing it to do its own context management… (then it could, in theory, be learned, in the same way that we learn other tool use)
  • Showing the results of edits / refactors is probably a good idea too. Eg. when we were prompting GPT4 earlier this week it would edit a function and then we never presented that function along with the rest of the code
  • +1
  • Are we not working on agents? Near-term that might be true but I really hope we move in that direction; that’s my understanding of what the company vision was
  • I think Evan Vincent means specifically the code eval sets we’re working on right now?
  • Did you mean Vincent not Evan?
  • yeah sure something like that
  • I agree that for the current state of our eval datasets, SWE agent is slightly different. Zooming out and looking at agents in general it may be useful (depending on what exactly we want to make).
  • Are there particular techniques that we think are potentially promising for solving our current evals? Or, perhaps, our new set of more complex evals?
  • (Opinion) I think having an abstracted interface with less noise for the agent will become table stakes soon
  • what makes you say this? i feel like i don’t know anyone working on this
  • Think people have been thinking about figuring out how to present code information to llm
  • It’s too bad @Michael R isn’t here today—I feel like he’s been thinking about this a lot.
  • I guess there are a few directions things could go; a natively multimodal model might be able to navigate & use a computer like a human in >2 yrs from now but anything near-term will need the engineering lift to make computer interface more ergonomic for LLMs
  • They’re just not good enough at managing their own context, tracking and reasoning about external state, ignore red herrings, etc. It’s that much harder when you let them fill their context with nonsense because they accidentally ran cat on some huge file
  • The action space on computers is also huge, and the trajectory for productively solving these problems is relatively narrow
  • We’re effortlessly chunking and blocking out a lot of noise when we use computers today (some at the visual field level, some at the cortex level); I think it intuitively would help LLMs a lot if we simplified the things they had to pay attention to and the actions they had to select from
  • I agree based on the basic issue of currently trying to fit multiple files into the context window and having way too much information in there - there needs to be a way for LLMs to navigate repos themselves and this is one of them.
  • huh, this is an interesting framing. I never really thought of agent-enabled navigation as a way to manage context, but I guess that’s what it’s all about. I wonder if there are patterns from human navigation design that make sense for LLMs and how many novel patterns we’ll have to uncover to enable agents to navigate code in a way that makes sense for them
  • Yeah I wish I wrote a doc on this instead of the dumb reactive/proactive async idea; we do this the same way with menu navigation in all our software, it’d be overwhelming for every option in an app to be flattened out in full all the time
  • Absolutely agreed—this is why I’m upset they didn’t talk about IA at all! The way you collapse information seems really important.
  • I’m not sure they don’t talk about it—it’s a 118 pg pdf, there’s a ton of appendix
  • I think the the problem of “how should an LLM edit code” is under-explored.  SEARCH/REPLACE blocks is one way; is it similar to the “Compact, efficient file editing is critical to performance.” that they espouse?
  • Yeah, search/replace is an alternative to their edit (which is very line number based). We could certainly try theirs out!
  • How “chat-like” is the interface? Are they showing all the previous steps in the context, or is each action considered independently?
  • They say this in the paper: “It also tracks the history of all previous commands and observations and, at each step, manages how these should be formatted and combined with high-level instructions into a single input for the LM.”
  • In elementary school math, we have problems with purposely irrelevant / distracting context to teach us to ignore it.  I wonder how much that’s been done to LLM training sets?  (Constructing these questions with distracting context is an art!)
  • Same thing in interview questions
  • Huh, interesting framing — I know that Gemini did a lot to work on the “needle in the haystack” problem, which I think is equivalent to your formulation, and they did manage to improve it a lot vs. GPT-4. Twitter is not loading so I can’t find the thread.
  • Is “success quickly, fail slowly” a property of the agents, or of the difficulty of the task? (Pasted Thad’s question from Zoom chat)
  • Andrew: “My guess is that the LLM gets stuck in failure loops, like trying to do the same thing repeatedly”
  • Eric: “I think it’s a property of high action space environments with path dependency. And the fact that LLMs are kind of like imitation learning agents. So once you’re off the beaten path, you’re OOD.”
  • Thad: “Yes, and some tasks take you off the beaten path and others don’t.”
  • “One solution is to have a differentiable model of the environment and minimize a loss based on the difference in end-state of the environment after each agent action in an episode vs. the state from human demonstration/ground truth”
  • interesting, are people doing this with code / is it possible? i’ve only heard about this in physics-based / robotics settings i think
  • That’s where I know it from; I don’t know if it’s possible with code but on the surface it seems plausible? Unit tests are kind of that way, though they’re not differentiable. Maybe a straight-thru estimator is enough
  • They say in the paper that as the LLM makes more failed edits, the chances of recovery drop off. Looking at the action space, they don’t seem to have an undo/revert, so the only option is to keep editing. Could having undo/revert options help?
  • +1, some actions don’t have a universal undo but undoing even some actions would help a lot!
  • What is there in the state that’s not part of a git repository?  You can always return a git repo to a prior state, right?
  • True! I was thinking of POST request to a 3rd party API, anything external
  • Yeah, that’s a cool idea 🙂 
  • Or more generally, some kind of search mechanism that can explore different parts of the state space.  (You could imagine going pretty far, hitting a dead end, and deciding to backtrack a bit, all using git branches or something similar.)
  • this seems to suggest that some breadth-first exploration of possible approaches might be a good early step
  • multiple branches?
  • +1 This is a really interesting idea!
  • I get that ACI’s are useful as a concept, but I wonder what properties of ACI’s are “best”, and what dimensions of control you have in over an ACI. Like the principles of ACI design. Did you get any of those details from the paper?
  • I guess they list 4 principles in Section 2, but they’re very action-based. I don’t really understand what they’re suggesting about structuring the information architecture of the ACI, which seems like the most (or at least a very) important part.
  • The paper doesn’t talk about how they came up with the set of actions - in the main paper at least, maybe the appendix has some more detail?
  • I don’t see principles of ACI design anywhere, I was hoping you could think about it 😀
  • On it 😉 
  • It seems a little weird that forcing the agent to scroll up and down in a file would be super helpful for it… do they give any reasons why this is better than just showing the whole file?
  • I assumed this is also an issue of having too much in the context?
  • But if you have to scroll, won’t you see a bunch of irrelevant stuff?
  • Well you can do a search first and then start scrolling from the search
  • Ahhh, that helps contextualize this a bit — because these are bugfinding tasks, it might be helpful to search and just be working with that smaller amount of context
  • Daunting that the previous success rate was ~4% and they were pleased to reach 12% on SWE-bench, but I don’t know that dataset
  • I know we talked about different axes of complexity / difficulty for evaluation data earlier this week. Is SWE-bench hard because it requires a lot more context?
  • It’s worth looking at this actual dataset — it’s pretty hard (and messy)
  • Also worth mentioning that the authors of SWE-Bench and SWE-Agent are the same
  • oh, wow, I just looked at the leaderboard; nobody had gotten to 10% before a couple months ago
  • I thought Devin claimed to get 13%
  • yes that was around the same time as this paper
  • Is swe-agent a reasonable baseline to have for our future more complex evals? 🤔 
  • Sure? It’s probably a bit specialized to this format, but perhaps we could hack it up to work and try it out
  • Eric says he set this up yesterday and it’s pretty easy to set up, but may need a bit of hacking to work with our evals
  • +1, this makes a lot of sense to me!  I think it would drive us towards building some things that would be useful!
  • Specifically: I think “how to connect an LLM to a code tree + its environment” is an important question to tackle.
  • There’s a startup called “greptile” that’s working on this right now
  • I tried this one when it launched, and it did an okay job answering questions about the codebase. I thought it had a surface level understanding, but it did a better job than just grepping
  • Are we sure the swe agent is the right interface for our future eval? It seems potentially more productive to lift the relevant techniques over
  • This is a good question — it is probably more productive to lift the relevant techniques over, but 
  • This bit about syntax highlighting is interesting
  • what is that
  • humans can use tools like syntax highlighting to help them notice format errors when editing files in an IDE,
  • More generally syntax highlighting, gives you some information about whether things functions or variable or so on
  • See also the discussion about using color to present “interleaved side channels” of information below.
  • oh yes i thought “bitabout syntax” was a term
  • No, sorry that's an artifact of my audio dictation
  • Seems to me like making the agent more like an imitation human is probably not the right direction in the short term, as appealing as it is as a hypothetical future, for the same reason that most of the useful robots so far in history have not been humanoid. But it’s interesting to think about the work that the IDE is doing to provide syntax highlighting, and whether there’s another way to give the agent access to that information and insight other than the roundabout way through the screen.
  • My point was just that syntax highlighting is useful for a writing code for humans, and this information is not available to agents
  • Ha, yes, I was agreeing with you, but taking more words to reach the point where that was clear!
  • Ah sorry I responded without seeing the second half of your message
  • In general, waht do e think about this interface? should agents be mainly working in like a single file or editing code? or should it actually be kind of working in the command line doing whatever it wants?
  • If I’m understanding you properly — it seems like the agent should be able to use the command line — it feels like using the command line is a special case of “run code”, and the agent should be able to implement but also run code.
  • Evan clarified that if you only need 3 or 4 bash commands then you don’t need it to be able to do arbitrary command line things, you just hardcode those things.
  • Or have a different interface rfr calling those
  • I might have misunderstood, but I thought that this paper was suggesting a sort of decoupling of the human interface (e.g. the ones you mentioned) and the agent interface, so the answers is probably neither—the agent should be working in its own unique space that’s optimized for its own problem solving ability?
  • Sorry, I didn't mean about the agents coding interface. I meant at what level of abstraction should the agent be operating? In this benchmark, the agent is allowed to do arbitrary things in the command line, whereas, in other eval is all you have to do is implement code
  • well it seems like if the agent is only allowed to implement code in a file then the human has to drive the agent around and decide which file to edit at any time? i guess that’s ok (basically copilot/cursor)
  • yeah I was a bit surprised that no one used Crafty for multi-file projects even though you could have just moved Crafty around with you. I guess it requires you to break down the solution even more?
  • People did — Sam copied functions from other files as strings into the current file in order to make Crafty have context from the other file.
  • oh, okay, interesting! it seems a little bit different to me than architecting across multiple files though.
  • Oooh. Yeah that’s one of the biggest questions I have from this paper too. They didn’t talk at all about information architecture, which is like the secret sauce to success with this approach and has a lot to do with level of abstraction.
  • what is information architecture
  • it’s the level of information you choose to present to the user (or, in this case, agent) and how you choose to organize that information so it actually makes sense
  • Do they present code to the LLM with line numbers?  Is there a nice way to do that that doesn’t confuse it?
  • IDEs use a gutter for humans, but that kind of side channel isn’t available in a pure-text interface.  I think even text-only LLMs might benefit from more channels, and ways to correlate them together and reference each other.
  • Stack traces should be able to point directly at the lines they reference somehow, too.  (In fact I really wish VS Code would underline all the lines related to a particular stack trace!)
  • I think you can command + click into the lines referenced in the stack trace in vscode
  • Yes, you can, in the log window, BUT NOT (AFAIK) for the stack trace that gets shown over top of the unit test when it fails!  — drives me nuts 🙂 (And I would love it if the lines of code in the stack trace were somehow called out in the text editor)
  • :0 
  • They do use line numbers - I think figure 3 (a) shows this
  • You are right!  I really like this!
  • I assume the color coding in Figure 3(a) is only there for the paper and not somehow presented to the LLM?  Although they do say something about syntax highlighting, too, right?
  • I think they just provide information about linting errors to give information similar to what syntax highlighting gives to humans
  • Humans use color coding for interleaved side channels…
  • That’s true—unfortunately LLMs don’t have access to that. What are some other channels that an LLM can get information from?
  • This is an issue for so many types of data right? Any data with structure has to be flattened and have additional text added to indicate that structure
  • Same goes for images and text together — it seems hard for them to directly reference each other.  Maybe attention can figure some of it out, but I’m not totally convinced, especially given how much we’re seeing that extra stuff in the context sucks up attention when it shouldn’t.
  • this is super interesting! makes references super concrete
  • I really want LLMs to be able to explicitly reference things!
  • It’s interesting that the hyperparameter search (B.1) for GPT-4 and Claude yielded pretty different results, how specific do these techniques seem to models?
  • I think the %resolved is pretty different (probably because they optimized prompts for GPT4) but the choice of which hyperparameters are best doesn’t change that much across models
  • It seems like giving limited history to GPT4 provided more lift and some of these techniques like limiting the search results helps with that
  • Looks like from their appendix three finding that the best temperature for BBB is greedy decoding

05/31/2024

  • This seemed reasonable… are there any reasons to be skeptical of the results?  Ex: the exact match stuff, the particular problem set, etc?
  • hmmm what striked me is the fact that this is only rather obscure fact-answering questions, rather than anything reasoning-related
  • that’s a good point — I’d be curious to see how this applies with questions that are less about making the transformer memorize things
  • I think this part was deliberately chosen to be a setting where it’s easier to study this effect cleanly
  • +1
  • I didn’t carefully look at some of these plots, but how do we reconcile these results with some of the weak-to-strong supervision results — SFT is run with mildly incorrect labels (which can look like ‘new’ information) but the performance on true task increases. I guess maybe the pre-trained model already knows how to solve the task; the ‘new information’ is just eliciting the knowledge…
  • Can you elaborate on the weak-to-strong supervision results you have in mind?
  • There’s a paper where they show that you can use a model that has poor accuracy on a task (e.g. multiple choice task) to provide SFT labels for a much more capable pretrained model (GPT2 -→ GPT4) and the performance of SFT’d GPT4 with these bad labels ends up being very strong compared to the weak GPT2 supervisor. I think the hypothesis is just that, even though the labels are bad, they effectively provide some kind of ‘task identification’. But I think there is also a crucial early-stopping that’s necessary so that GPT4 doesn’t overfit to GPT2
  • Didn’t they show that they did something that makes it better than naive fine-tuning on the weak supervisor pseudolabels?
  • Yes I think there was some confidence-based fine-tuning method that’s better but I don’t remember all the details
  • That seems like it could be consistent with this paper? If there were enough ‘known’ examples in the noisy data, the effect of training on good ‘known’ examples could outweigh the effect of training on the noisy examples?
  • A question that comes up for me based on the high level conclusions: what is it about pre-training specifically (vs other kinds of learning SFT, RL, etc) that makes it capable of more robustly encoding factual knowledge without hallucinations? For ‘small scale’ SFT and RLHF, I can see hallucinations being an issue, but does this mean if we attempted to do something like RL to learn entirely new capabilities, we would have to deal with these kinds of problems?
  • I think the learning rate is lower during pretraining, and there’s much more data diversity pulling in different directions
  • I think this is a really good question that I don’t know the answer to. Is it perhaps about that you see this information many times? I’m not sure.
  • I guess the two main differences are 1) seeing the same data multiple times, and 2) having a much higher learning rate. It seems like both of those things could be avoided in RL (or even in this setting), and potentially could help the model learn new knowledge rather than hallucinating?  Though I’m not sure even doing both of those would be sufficient in this setting (of just training on answers). It’d be interesting to see.
  • Seeing the same data multiple times raises the effective learning rate, I think
  • One other difference could be which tokens get loss applied… in SFT the density is different, i.e. (i believe) you only apply loss to the “answer” portion, which is like <10% of your tokens right? so you’re getting this very concentrated gradient on “make this answer correct”
  • but yeah i’m also very unsure
  • MJR: Practically, does this mean there’s just a particular ideal data distribution split to find when fine tuning?
  • The paper seems to suggest that the ideal fine-tuning data is stuff that is moderately known (Table 2).
  • according to that table , the 100% ‘moderatelyknown’ category is around the same as natural (all 4 categories mixed so you don’t even have to bother tagging stuff :/)
  • but the fact that other categories do poorly suggest that maybe it’s just that you happen to have a nice natural dataset in this case, but should be aware of it in general
  • +1 yeah you’re right — it means you kinda want the same distribution of known-ness as whatever your test set is.
  • yeah, like Figure 7 in the paper is their measure on out-of-distribution tasks (so train + dev on a certain type of question e.g. “whats the capital”, and then test on other new types of questions, e.g. “what’s this person’s job” and there’s a bit more dropoff there even for small percentages of unknowns mixed in.
  • Michael: Hmm, yeah I guess because the fine-tuning purpose here is new knowledge adoption, it can’t really answer my question. I’m mostly curious about more practical situations, ie “we want to train to get it to behave in a certain way. It might cost factual accuracy, but additional examples in the target domain may be worth it.”
  • Similarly: Is the cost of avoiding the hallucination that it doesn’t get as well fine-tuned to the fine tuning set for general/heuristic usage? I wasn’t clear on if avoiding the problem was essentially free or only free WRT their assessment.
  • I think this is mainly just illustrating that in this case, hallucination is arising as mainly a byproduct of fine-tuning on unknown or poorly-known info.
  • Michael: Ok, yeah I think I understand the fundamentals here better now, thanks (also thanks Bowei)
  • Summarizing some of my personal takeaways from this paper:
  • Fine-tuning does indeed appear to lead to hallucinations, and the culprit mainly appears to be from fine-tuning on previously unknown information. Instead, fine-tuning on stuff that you kind of knew before is generally beneficial. This kind-of-known stuff is the most beneficial (more than well-known and poorly-known).
  • There are several ways to detect previously-unknown stuff: (a) you can use this SLiCk procedure; (b) there’s also this observation that stuff that you didn’t know is actually fit in the training data less quickly, e.g., Fig. 1
  • This seems to be in general support of the Superficial Alignment Hypothesis, where most of what a model learns is via pre-training, and fine-tuning is primarily a way to surface this information
  • +1
  • +1
  • There’s a section they specifically call out though, that training on ONLY highly-known stuff is actually not as good as either a mix, or only maybe-known stuff. Superficial alignment would kinda suggest that highlyKnown is the best to train on since it literally only contains formatting guidance, not content, so this paper sorta suggests there’s more going on.
  • that’s true. good point
  • the “natural” data condition seems to hide some of these effects (when early stopping, does well on both known and unknown) — does this argue for just being reasonable and looking at the performance on a held out set as you train?  It seems like, as long as you like your dataset, that actually gives you the best result…
  • Wait, but they find that if you filter your fine-tuning data, you can actually do better, e.g., row 2 vs. row 5 in Table 2 (at least for convergence).
  • I’m looking at early stopping, since that seems to do quite a bit better in general
  • Yep, thanks. Fixed.
  • But re the point: it may be that the natural data here tends to be mostly the right type of data, whereas another SFT dataset might have a bunch of unknown data that is harmful
  • Yeah, fair — so then it does seem worth potentially putting some effort in to make sure that such examples are not included in your fine tuning data set.  Which is kinda what we did with our CARBS metric maybe? By making the tasks less about knowledge memorization, we’re probably less likely to have the “unknown” example in our fine tuning data?
  • And that also accords with my own experience and intution that having a bunch of noisy, wrong labels in the dataset also hurts — then the model needs to learn to do weird nonsense to fit the train data
  • The final outcome accuracy still seems to be maxed on the natural mix though — so yeah this doesn’t provide any good way to improve accuracy even after you do all this work tagging your data
  • Do they state the mix of the natural data?
  • yeah, it varies depend on which type of Q&A we’re looking at, but it’s between 15% and 30% per category overall. Table 3 in appendix E
  • 19528 20674 13825 27673 = 81700, so pretty even in aggregate.
  • some categories are really imbalanced
  • What is the capital of [E]? 4160 1634 449 572 = 6815
  • What, if any, are the implications for fine-tuning on code? Are there analogies to hallucination to be aware of?
  • I also have this question…  a very hand-wavy transfer of this might be something like “should we be training on problems that the model never gets right?”  (very tenuous)
  • I don’t think I know what hallucination means for code generation, is that like, imagining new algorithms? (seems unlikely).
  • Their technique of sampling a couple of answers and measuring response consistency - this paper and some of their cited papers implies that that’s a decent way to measure hallucination effect
  • Wouldn’t response consistency be expected to be much less in a more open ended task than coding? Would this still apply well?
  • yeah that’d be a big problem, how to measure response consistency
  • Is there a way to do finetuning on novel information in a way that doesn’t have this problem? Like mixing it with random pretraining data or something? Does this paper speculate about that? 
  • I guess the natural answer would be to pretrain on that knowledge, rather than trying to do it during fine tuning?
  • But sometimes new knowledge comes up after pretraining has ended. Google wants to compete with Twitter’s Grok LLM to be up-to-date on newsy topics
  • Slightly tangential, but is there more research on “Does M know the answer to q?” Having to sample answers to figure out what knowledge the LLM has is a long, roundabout procedure.
  • Yes! they directly compared to a method where you prompt the LLM on “question: <question>, answer: <answer>. is this answer correct? A: true B: false. <predict and scrape logits>” 
  • This method was significantly worse, judging from the fact that around ~10% of the test data that ^ the P(A vs B) method indicated was unknown, the model actually got after finetuning. Ideally, if the model really doesn’t know it, then the test data should have 0% accuracy.
  • [josh] what does this mean for code
  • hallucinations might be WIP that gets fulfilled.
  • WHAT IF : you structure the code dependency so the LLM always gets the function definition before it’s used?
  • deepseekcoder: deps come before calls (across files) but not within files
  • Actually actionable:
  • evan’s point, of we could SFT to early convergence and then take a closer look on the examples which the model has trouble fitting
  • and either send that training data to be human-cleaned, or tag it using inference

05/17/2024

  • How do you get it to generate the chain-of-thoughts? Is it via few-shot prompting?
  • Yes — few shot, then greedy decoding
  • Methodologically pretty siimlar to MathShepherd
  • +1
  • The question from the paper seemed pretty ambiguous (i.e., a pool could hold a small dog). Wonder how much this could help with some of Bas’ ambiguity work.
  • Yeah, CQA is literally what bas is working on cleaning up right now I think 🙂 
  • I think there’s a lot to be done with giving better rationalizations, and with sampling as well

  • Does the depth 8 refer to the the number of primes ‘ that are added?
  • Yes
  • Sounds like all thoughts are of length exactly 8, am i reading that right?
  • It is a hyperparameter, I think the main experiments used that
  • +1 
  • maybe i missed something - how are the thoughts bootstrapped? Or it’s entirely self-supervised and there’s no bootstrapping data, it’s just “hook up loss during finetuning and pray” ?
  • ans: yes it’s just during training
  • Also, this makes inference quite a bit slower right?
  • seems like it could be done or not at inference?
  • I guess the main reason you’d want to excess compute to do this is if you ran out of data (or didn’t want to re-use data)? Otherwise it seems like maybe just overtraining more might be better? Or creating more data and then continuing to train on that?
  • This is unrelated to the super-hyped “Q Star” system I’ve heard about on twitter, right?
  • I think Josh mentioned that there are reasons to think it is this
  • Goofy if true. I mean it’s interesting work, but not like, AGI-solving work lol
  • Yeah it does seem a little underwhelming unless there’s some really effective augmentations that you can come up with
  • I’d want to see this compared to other “co-train on limited related synthetic data” approaches, feels like a lot going on at once and it isn’t clear what is making the difference.
  • The thinker rollout looks to be RLed? NVM they use “REINFORCE”
  • IMO It would be more interesting to, for every document in a corpus, synthesize/backout:
  • a prompt / spec from that document
  • an intermediate scratchpad document for an entire work, that might encode an outline / the overall less linear plan for authorship
  • and then train on generating these intermediate representations to allow for non-linear structures.
  • I don’t understand the discussion of the pause token finetuning, it seems to imply the opposite result of this paper
  • I see, the pause paper only generates <pause> tokens but it can produce multiple of these in a row

05/03/2024

  • I’m confused why predicting tokens in the future is actually useful for code. Let’s say I have def some_func( and I predict the next N tokens ahead, why is it helpful to know the next 4 characters ahead?
  • I wish they shared which problems failed before and after, I’m almost wondering if it has something to do with dealing with “boundary points” like ):\n better
  • +1 that’s an interesting possible explanation
  • I’m imagining a string like num_cats: int = 3, where the name num_cats benefits from knowing that its type is int, which is information in the lookahead
  • ^ is what I wanted to say but better stated. It’s valuable to predict a name and a type at the same time - they benefit from the mutual information.
  • Is MBPP mostly typed? my sense is that this would have a marginal effect
  • no (it’s not typed)
  • No but this effect probably shows up elsewhere, ie in this string: person["age"] = sum(eras), where it's helpful to predict the dict index simultaneously with what's being put into it
  • But the tokens are predicted conditionally independently from each other. I don’t understand how there’s benefits in this example.
  • Well, it’s predicted conditioned on the Transformer trunk embedding
  • My model is something like: the pressure of needing to predict future tokens pushes the model to allocate its attention in middle layers in such a way as to improve its performance at predicting the next token. So while they're predicted independently, they’re sort of entangled in that way
  • This feels prone to overfitting though no? I actually wonder if this would make it much worse at “out of distribution” problems
  • A related question: What are “consequential decision points” in code? I guess inconsequential ones are while you’re in some boiler plate — so the complement is consequential ones?
  • Maybe a somewhat inconsequential one is what to name a variable? Not sure if this is their same definition of inconsequential, but saying “index” vs “i” won’t change the logic
  • A consequential one could be whether to start writing “for “ or “whil”, although this arguably might change fewer of the future tokens than the naming variable example
  • I feel like 
  • Excuse my naivety, is this similar in a sense to MoE? By this I mean MoE is looking at a consensus over the next token while this is doing something similar but with future tokens as well?
  • Excused
  • Where some of the experts have expertise in looking ahead further? I think MoE is typically implemented differently. This paper just puts a token decoder onto the end layer, whereas MoE typically has multiple experts at each layer of the network
  • That’s my understanding
  • This gives a great idea of doing MoE with multi-token output layers! Maybe different experts will help, e.g. bump best-performing n from 4→8
  • Does it make sense to do this technique but not look ahead? 
  • What do you mean?
  • All N heads predict the next token but
  • Ah and consensus voting? It would eliminate the inference speedup and arguably the benefit of performing better at choice points because of no look-ahead
  • To me, there’s actually nothing special going on with the performance improvements with this paper, mostly feels within the noise so the big draw is the inference speed improvements BUT I was wondering if the consensus voting helps in cases where the model “picks a bad token” due to noise
  • I wouldn’t dismiss it—where does the perf improvement cap out wrt model size? Maybe 33B or 70B models would have even bigger gains. This is unclear to me. For what it’s worth, I think there are other papers that do this kind of consensus voting you suggest but I don’t recall
  • The problem I have with looking at the performance it that it depends on their training data, hyperparameters, etc. so is it really a fair comparison against the baseline? I need to double check what their baseline even is (did they train it?) yes they did
  • FWIW, this is kind of like Bootstrap DQN
  •  Here’s my understanding of the “inconsequential” vs “consequential” decision point argument:
  • Let’s define a consequential token as one where the following tokens depend on the consequential tokens, i.e., it’s an index t such that p(x_{t + k} | x_t) != p(x_{t + k} | x_{t - 1}) for some k = 1, …, N.
  • In contrast, define an inconsequential token where the following tokens are independent of it: i.e., the opposite p(x_{t + k} | x_t) = p(x_{t + k} | x_{t - 1}) for some k = 1, …, N.
  • Then consider what happens in the loss when you have a consequential token vs. an inconsequential token:
  • For a consequential token, when you are N steps out, you hit it with the loss 1 time…
  • Then when you are N - 1 steps out, your N-step predictor has to predict the next token after the consequential token, so it depends on the consequential token again. And your N - 1 step predictor has to predict the consequential token again, so it affects the loss 2 times
  • … Overall, it affects the loss 1 + 2 + … + N = N (N - 1) / 2 times
  • For an inconsequential token, when you are N steps out, you hit with the loss 1 time
  • But then when you are N - 1 steps out, your N-step predictor again has to predict the next token after the inconsequential token. But since that next token is independent of the inconsequential token, the loss for the N-step predictor is independent of the inconsequential token. Hence, it only affects the loss 1 time again
  • So overall, it only affects the loss N * 1 = N times
  • That’s a nice argument
  • OK, I think I understand that bit. My question after that is roughly - Are choice points really such a binary distinction or a question of degree? Similarly, wouldn’t training with a higher learning rate (at a very large size to smooth out small mathematics differences) give the same result?
  • An inconsequential token doesn’t have a loss of 0, right?
  • yep, that’s what i thought as well
  • Having heads for the next X tokens ahead makes me think of N-gram predictors. I wonder if the tuning of an optimal X matches the tuning for an optimal N-gram predictor. 
  • I wonder that also. This feels like everything old is new again in ML. I wonder if there are tips and tricks for working with N-grams that would apply here
  • They cite ProphetNet and Blockwise Parallel Decoding for Deep AR! I’m sure there are more ideas to be tried here. The difference seems to be in the “flexibility” of speculative decoding: n-gram predictors needed to always guess exactly N tokens ahead but here they are mutually independent
  • It’s kind of different because the tokens are predicted conditionally independent of each other, right?
  • Yeah, it’s not an exact match, just rhymes.
  • What is the acceptance scheme?
  • Same as Medusa's and other spec decoding
  • Pick candidates likely enough according to original model
  • Min of a hard threshold and entropy-dependent threshold (temperature)
  • Always accept first token using greedy decoding to ensure at least one token is generated each step
  • Final output is longest sequence that passes acceptance test
  • Is the acceptor a smaller classifier model?
  • T=0 == greedy decoding, T>0 => more efficient
  • This acceptance scheme is only for self-speculative decoding right?
  • Abstract claims "13B parameter models solves 17% more on MBPP than comparable next-token models" but for what k in pass@k?
  • What if the model did exponential lookahead? Like instead of n=1,2,3,4, do n=1,2,4,8
  • Great question! No clue, you should run that experiment
  • +1! My guess though is that super far lookahead might be somewhat meaningless because it is so heavily conditioned on the intermediate tokens. I wonder if there is a different way to calculate the loss that might advantage this approach?
  • Yeah, maybe something like computing the “average” token over some longer lookahead?
  • Does bigger n work better for bigger models?
  • It seems to be an upside-down U, where 2-4 seems to work best and 1 and 8 do worse
  • See Section 3.4 and Appendix E
  • Appendix E seems to suggest that larger models are better able to handle a larger N
  • What's the point of reporting pass@k?
  • It lets you know if the model is actually capable of generating the solution if you generate enough times?
  • What causes the U-shaped curve for the utility of larger n? Do we think that future research will push n out to much bigger numbers like 16, 256, or 2048?
  • It seems like sometimes tokens 1 and 2 might require looking at places A and B in the context, where A and B might be quite far apart. Is that an area where this model would struggle?
  • Could they do inference by averaging token estimates across the 4 times that a token was predicted? This is probably a bad idea, I wonder if other people tried it
  • What are other experiments that could test the computation-sharing hypothesis?
  • Prediction difficulty of different tokens vary greatly.
  • LMs w/ residual connections refine output token distribution with each successive layer, and can be trained with "early exit strategies." Multi-token prediction loss explicitly encourages information-sharing between adjacent token positions, which allocates computational resources more efficiently to tokens that need it
  • Create a language/problem/structure that has a tune-able level of shared information and see how well it actually trains on generated data.
  • As a terrible example, mostly random characters, but slightly tweaked such that the characters specifically 2 and 4 back have some predictive power over the current, vs specifically 3 and 4 back.

April 19, 2024

  • Lol at the title
  • Why do Qwen and Llama seem so different in their sidedness?
  • I think Qwen is Chinese (Alibaba), maybe the tokenizer & training data distribution is different?
  • These graphs make it seem like the learning rule is different; that learning is accumulating later in the network instead of earlier
  • Why is the y axis “block size” in those graphs? What does that mean?
  • That’s the number of layers you’re cutting out, I think? (n in figure 1b)
  • Yes it is
  • Oh, that’s why there’s a linear relationship between the (starting) layer number and the number of blocks cut
  • I would expect there to be more continuity at layer N as the block size increases. That’s weird
  • +1 I also don’t really understand Figure 4 — looking at eg the 30th column of 4(c), it seems to get lighter as we go up the graph, indicating that a larger block has lower angular distance? This seems counterintuitive but I’m probably misunderstanding
  • That does seem counterintuitive. Are they normalizing for the block size? 
  • I think that’s right. I think that’s mostly a function of angular distance being a poor metric of how the activations are transformed
  • I’m surprised that so many layers are so close to 0 on the angular distance scale!
  • Other research has shown the layers to vary sort of "smoothly" from layer to layer, so this makes sense.
  • Some previous work that they mentioned hypothesized that later layers stored knowledge “non-locally.” Is that true for the 70b models? Why can we delete them then? 
  • Would Llama 3 have the same issue, because it is trained on way more tokens?
  • How do we get the final layers to be useful?
  • I think they are useful for language modeling, just not for in context learning
  • Let’s say we accept that cutting out 30% of the layers  has nearly the same final performance - how would that compare to just training a model from scratch with those same layers removed from the architecture? Or roughly the same number of final parameters?
  • On MMLU, removing 40% of Llama 13B yields a model which still has much higher MMLU perf than LLama 7B. So, assuming 13B isn’t suboptimal wrt depth vs width, we could not obtain the same result by just training a shallower 13B
  • Oh yeah, now that I see those graphs it’s definitely better before the sharp dropoff starts, although around 0.5 it seems close to where 7B started.
  • Why do we know 13B isn’t suboptimal wrt depth vs width - they say removing layers beginning at the penultimate layer yields similar performance in 4.4
  • I wonder if this is tied to lottery ticket hypo, seems consistent with that
  • Definitely! I think it is more of a performance enhancer since lottery ticket doesn’t delete blocks
  • Why is the performance in terms of MMLU etc. so much better than the perf in terms of perplexity — removing 20% of Llama 70B increases its perplexity to 13B levels
  • +1, I don’t have good explanation for this, curious to hear others’ thoughts
  • It seems like a follow-up would be to delete unimportant layers and then add layers back in (maybe by splitting up the layers with the largest difference?) Do they allude to this at all?
  • That could be interesting, maybe deleting like two blocks that are next to each other but keeping their intermediate layers
  • I don’t think they do, they want to destroy as much of the model as possible haha
  • Is this going to become a common method for making smaller models? 
  • I think so, it is so bluntly effective that I don’t see why you wouldn’t use it for small ones
  • Also, the fact that they can do fine tune a 70b model with a single A100 is huge!
  • I actually don’t think so because the language modeling performance (validation XE) is much worse. I also think full rank fine tuning will also perform better than the PEFT performance.
  • I guess my variant of the original question would be - is this the best / best easy way of optimizing output quality with a fixed size for the inference GPU?
  • +1
  • So… if we wanted to bolt another network on to a pretrained one and then continue training, this suggests lopping off the send half before bolting on the new model :)
  • I love this, Frankenstein models

April 12, 2024

  • Sadly, I missed a lot of the intro presentation. If anyone would find it helpful to write down key takeaways here under this (not Evan), I would find it really helpful! In particular:
  • What have we learned about data generation from these papers?
  • Ellie: it is many small things, so not easy to summarize. I will just watch it.
  • What things did these papers do or conclude that seem suspect?
  • What feel like interesting / potentially high impact things to try with data generation given this? e.g. Evan’s idea of generating verifiers/test cases.
  • I think there are a few different things we can pull like that that stood out for me:
  • The idea of asking the LLM to make a question progressively more complicated, I liked that and thought that worked well
  • The idea of using real seed data (wikipedia, wild code, etc) to condition the generations
  • Finding places where we could just generate verifications, eg, “make a neural network that achieves < X loss on this dataset” or “make a function that passes these test cases”
  • Stupid question: I can’t find the “Stop sharing” button. Was I not sharing slides this whole time?
  • You were sharing slides! Don’t worry.
  • The papers are using this for fine tuning but this approach is just as good for evals too right?
  • Do we expect this path of data generation to eventually lead to model collapse?
  • +1
  • Rylan Schaeffer, whom we talked to recently, has a hot take (and 3 papers coming out) that imply that model collapse is not real, as long as you are retraining on the combination of the original dataset + the generated data. He says all of the papers that showed model collapse just retrained future generations of models on the newly generated data only, rather than augmenting the original training data with generated data. It makes sense that generated data-only trained models would lead to collapse because a batch of generations might have some weird features (this happened with image models developing weird artifacts). I have not read those source papers or his new papers so cannot verify this.
  • No, we do not expect this (see above answer from KJ) if done in a reasonable way (ie, not ONLY training on the generated outputs)
  • This approach is pretty similar to what we’re doing. I wonder if it’s worth downloading their code and running it. 
  • Yeah, I think so. I think the broad scaffolding behind these things makes sense to me, but I’m wondering if we can improve upon it. I also think we should just play with (1) the post instruction-tuned models; (2) the data itself; and see where the data should be better.
  • I think we can definitely improve on these. A lot of these prompts looked fairly weird and hacky, and there were some pretty obvious places for improvement just left on the table (ex: longer generations, filtering for quality, etc)
  • +1
  • ^ Agree that some of the prompts looked like they had a lot of places to improve. I wonder how much we can improve various results from these papers/other papers just by improving the prompt quality
  • 💯 And maybe adding some filters on the data?
  • I definitely want to copy their approach of using random code snippets. Does anyone have ideas for how to get the best and most interesting code snippets? 
  • Yes! We have excellent pipelines for this thanks to michael and maksis
  • Really? For getting the most interesting snippets of code? That’s awesome. Let’s use it!
  • It depends on how you define “most interesting”, but maksis does have a whole pipeline for applying different measures to code. What would make for “interesting” code, do you think?
  • ¯\_(ツ)_/¯ Maybe code that’s statistically unusual, or that has high perplexity? 
  • It’s not immediately clear to me what snippets are most useful and whether they correspond to “interesting” ones — but worth investigating, imo
  • Yeah, maybe we could just use random Github code files in the prompts. Maybe we could filter to only Python code that imports libraries we care about
  • I wonder if we can apply this to snippets of code in Stack Overflow
  • Some of this feels like fitting to the evaluation format, which isn’t bad, but may have a hard/quickly saturated cap of utility.
  • Maybe our planned fine-tuning around use-cases will have the same effect
  • Is it still worthwhile to do some amount of finetuning on wild code / is there any potential benefit we would get from that?
  • I think there are two pieces to this. (1) Directly (pre-)training on wild code is absolutely useful via just next-token prediction of code; (2) Using it for instruction-finetuning via giving it some “task” is not helpful out of the box, unless you can make the formatting much better.
  • oh right, i guess the claim is that the benefits of doing direct finetuning would already have been achieved during pretraining?
  • Right, yeah. e.g., the DeepSeekCoder base model already does 2T tokens of pre-training on roughly the same data.
  • Some of the evaluations felt a little incomplete — what are our thoughts on those?
  • Specifically thinking about “difficulty” being quite vague and the difficulty in telling whether improvements came from data quantity or data quality increases
  • What is next? Can we think of extensions to these works?
  • Cleaning up a lot of their work (fixing prompts, improving heuristics, etc.) seems like low hanging fruit
  • Agreed on this. 
  • I personally found the 5 methods of prompt improvement in Evol a little weird/constrained — do they provide any reasoning for how they reached these 5 specific methods?
  • Meta question: I’m working on the same task that these papers address. It seems like they have many different ideas. Which ones seem like they’re highest value-add? Which ones should we prioritize? 
  • My story is that we can pretty easily just take all of the valuable techniques that we are missing, and add to our pipeline, as there really aren’t that many (just the “evolve to make harder” and “condition on code or wikipedia”, which you’re already doing)
  • Last week I was working on a project to “evolve to make harder”, and I sort of put it down. It’s encouraging to see their numbers that it leads to improvement! 
  • Evan raised a good point/question in the final slide that I don’t see here: these methods are all sort of bootstrapping via a “better” model, how do we surpass them?
  • It seems like we could run these experiments with worse models, to see if they help self improve? I suspect they will, ie, you’d see some gains even without using a strong model, but they would be more subtle
  • Interesting, I’m not sure that matches my intuition - if the model already knows how to generate these instructions and answers, why would instruction-tuning on them help?
  • This is also my intuition, the gains might be marginal
  • I mean, the podcast guy (Rylan Schaeffer) we just talked to confirmed my intuition, should have a paper out about this soon
  • Also, in reality, one would filter the outputs and only keep the good ones, not purely generate
  • My intuition for why is that you can generate and filter for high quality data, and this will push the outputs of your models in better directions. You aren’t generating random internet text, you’re putting some intent into the generation. 
  • Ah, the filtering is a very interesting point. Hardcoded or classifier-based filtering of the top quality outputs of your model to finetune it does seem like it could work for bootstrapping.
  • I expect that grabbing data from better models would be pretty helpful as pre-training/early fine-tuning that could give a good foundational model?
  • Question about the mutators: So they just hard-coded a list of mutation types? 
  • Does the LLM choose which mutation to apply? 
  • +1 to this — do they provide intuition for how they landed on these?
  • Meta question: I know I kind of violated our standard paper party format and probably took more time presenting than we normally do. What sort of format do people prefer?
  • I liked this!
  • +1, this was a good thing to triple paper. Doing just a single one of these would have felt less interesting to me
  • I learned a lot from your presentation! +1 to triple-paper
  • I also liked it. Oftentimes the discussion is not super productive anyway 😛 
  • This was nice! I liked having 3 papers as a compare-contrast, have always been supportive of presenting >1 papers at a time
  • +1 these papers tied in together nicely, and it was well presented
  • I think it depends on the paper! For really hard-to-understand theory or algorithm papers, I tend to prefer a more discussion-based format, but for papers like this where it’s more information-heavy but conceptually simple,  I prefer the presentation format
  • especially when it’s a collection of small things spread across multiple papers, as others have said — really nice to have it consolidated like this
  • This was good

March 29, 2024

  • Part of the premise of small-scale proxies is that you would hopefully identify instabilities before being able to solve them, but this seems to require “knowing where to look”, so to speak
  • i.e., if you knew that the z-loss is the thing that you need to add, what is the role of the small models anymore?
  • Yes, this seems like a pretty good point, maybe it does let us do some more science (we can reproduce these instabilities and show the mitigations help more robustly), and people proposing other theoretical fixes can test them more easily. 
  • They also claim to predict a new issue (low gradient norm) and fix (reduce adam epsilon) but don’t have a a very convincing experiment around it IMO
  • Can you explain qk layer norm?
  • Usually the attention logit is q * k^T, where (q=W_Q x) and (k=W_K x), this just does layernorm before that step, (q=layernorm(W_Q x)) and (k=layernorm(W_K x))
  • It seems there were other interventions they tested in the paper that Abe didn’t talk about—warm-up, independent weight decay, and the µParam. Are these minor compared to qk-layernorm and z-loss?
  • It seems like warmup and weight decay have some mitigation on the issues really close to the threshold, but don’t solve the potential for instability when scaling up. 
  • mup has no impact on the LR sensitivity, but since you don’t need to change your LR when scaling up maybe LR sensitivity doesn’t matter so much (so in the end it does solve the issue maybe?) 
  • This isn’t hugely important but I’m having trouble wrapping my head around what log Z is (for the output logits normalizer)
  • This is the logsumexp of all of the logits together — so Z is the sum of the exponentials of each of the logits.
  • I meant more as far as an interpretation, but I guess this is just a numerical stability thing? I forgot logsumexp was a standard thing for that, thanks!
  • Maybe more about learning stability than numerical stability — it’s hard to learn with most algorithms when your numbers are too far from 1
  • Can you remind us which things you were experimenting with this week, and which ones ended up diverging, iirc?
  • Maybe I’ll get into this after we go over other questions? Happy to cover it but want to get preliminary questions out of the way. Overall I’ve tried both of these mitigations I mentioned.
  • Sg
  • Did it work ?!?!?!
  • No
  • (Yaroslav), regarding following passage “Our results, illustrated by Figure 7, indicate that scaling depth increases LR sensitivity at a faster rate than scaling width”. What are their conclusions about proper depth/width scaling factors?
  • It seems they say, “The standard practice of joint scaling performs best at the largest scale and also has a more reliable scaling prediction when extrapolating.”
  • Figure E.3 in Appendix
  • It’s a small difference though
  • Does normalizing or regularizing the logits help? 
  • This is something we could maybe ask if we reproduce their paper! Using the z_loss is a way of regularizing, it’s hard to normalize because the distribution is weird. But we could test other things and see if they work? 
  • (Yaroslav) What is TLDR/interpretation of their Figure 13?
  • This is about their claim the default AdamW epsilon value (1e-8) is too large. Basically the higher blocks stop updating later in training with default adam epsilon. When reducing it to 1e-15 they show that they get more learning in those blocks (higher update rms). When increasing it to 1e-6 they find instability.
  • (Yaroslav) does Figure 8 graph show that MuParam is useful? I can’t tell
  • I think a little bit, in terms of not needing to change your LR, but not a lot, in terms of the width of the basin being about the same
  • I would be interested in seeing what final eval loss looks like for some of the lower learning rates if they trained until convergence — I can’t really tell how much better 10^-3 on MuParam would get on final eval loss if trained to completion
  • This isn’t really how we train though — we don’t have an unlimited amount of compute. You’d need to train on more data or see data more than once. 
  • Ah this makes sense, thanks for clarifying
  • The paper talks about edge of stability and fast spikes, I don’t fully understand. Why do fast loss spikes happen and why do they conveniently move parameters to a more stable region with smaller λ_max? I guess I need to go read those papers
  • (Yaroslav) Satoki Ishikawa was also confused on this section. There is a “edge of stability” effect described in Cohen where he noticed that instead of adjusting learning rate to curvature, we could fix learning rate, and he observed the curvature would go up until things are “barely stable”. But there’s no connection to “fast spikes” in that work.
  • Yes this work is terrible at referencing things that don’t exist in other papers
  • (Yaroslav) 3 paragrams before “3.2.5 Additional interventions”, below, is surprising to me. It appears to imply that you could either use 1/d and 1/sqrt(d) scaling factor for attention, without much effect. But that’s a big difference in how you normalize. 










March 22, 2024

  • IMO:
  • it wouldn’t feature artificats remaining in a synthetic dataset will be more pronounced than wild ones
  • I would expect this “distributional override” to be less pronounced the more complicated the task and more diverse the training set… IDK I just feel like there’s a lot of confounders on the table.
  • To what extent does something like overtraining account for the model’s tendency to lean on statistical artifacts (rather than assuming that we must somehow remove these sorts of things?)?
  • +1, better regularization seems like a more reasonable approach than trying to remove statistical features from the data
  • Is there intuition for why LP generalizes worse than RP?
  • IDK, though I’m also not sure it’s important, I think they just picked two ways of sampling from the problems
  • Perhaps LP problems are just more limited in their distributional scope?
  • yeah, I do think this is it — in LP, they do a random process, and then a constrained process where the number of facts consistent with the predicate/label combo may be very limited. whereas RP is all randomized and unconstrained.
  • I actually think I know why… ah yes, we both had the same thought, Andy! ^ 
  • They mention proving that BERT has the representational capacity to learn the reasoning function, could we go over that bit?
  • sure! (see appendix C)
  •  
  • Re: “for example, by only looking at the number of rules in a reasoning problem, we can predict the correct label better than a random guess.”
  • Why would this be?  Do they not allow negations in the rules?  It seems like if you allow negations in the rules (in either or both of the clauses or predicates), you could achieve a roughly equal amount of true positives and negatives, regardless of the number of rules.
  • Page 3, column 1 doesn’t indicate that there’s ever any negation, which makes the problem quite a bit easier than it would be if there’s negation.
  • I guess the lack of negation rules out any contradictions?
  • A → B
  • B → ~A
  • i think that’s correct
  • I wonder if this could be mitigated by training on a wider set of problems, then finetuning on this afterwards….maybe giving the model a better foundation could make reasoning easier to learn?
  • that seems reasonable!

March 15, 2024

Representation Engineering: A Top-Down Approach to AI Transparency
  • This technique is comparable to finetuning and prompt engineering. Is it better?
  • In what ways can this technique be used that fine-tuning can’t be?
  • Would be cool to see how much the finetuned weight updates line up with the control vectors
  • Why does LoRRA show up in this paper?
  • Do the control vectors have consistency with each other e.g. process to determine an honesty vector gives opposite result to process for determining a dishonesty vector?
  • I’m not sure what this question means
  • To give maybe an easier example: can you do arithmetic on different control vectors and have the result make sense e.g. the control vector for truthful plus the control vector for happy is in the same direction as the control vector for truthful+happy? In terms of the contrastive pairs used to determine the control vectors.
  • I think yes, because their process relies on antonyms to get the vectors
  • Where there particular layers or other groupings that were more or less important? Did they experiment with only finding control vectors for some layers?
  • +1
  • They investigate this on page 23: they find that emotion-related vectors are more coherent clusters in later layers
  • the PCA does work over the difference between the two activations, BTW (which makes sense)
  • Y: although more correct thing to do is to use LDA (hey, easy follow-up paper)
  • Oh I’m not familiar with LDA—latent Dirichlet allocation? Why is it more correct and where can I learn more about that
  • Y: Oops, MDA, not LDA. But for two classes it’s called “Fisher Linear Discriminant”. From Duda/Hart “Pattern Classification” book:
  • Does the fact that this works surprise anyone?
  • Not really, I think this is exactly what you’d expect to happen if neural networks were entirely linear, so this just means the nonlinearity isn’t messing things up that much
  • Curious about doing this with two finetunings of a model vs. two promptings of a model
  • What would the two finetunings of the model be? I think they only need to do some inference, then fit the PCA to get the control vector
  • maybe RLHF rather than SFT but a model that’s tuned to give happier vs a model that’s tuned to give sadder responses
  • [Eric] the holy grail for me is, given some signal or data (e.g. helpfulness or safety) you can turn a knob to increase or decrease this quality in the model. Not clear to me how to use this method if you’re trying to follow the axis created by a finetune, rather than by a prompt.
  • I like the included blog post
  • This seems obvious unsurprising to me honestly. It just upweights the value of all “honest” semantics regardless of reason or w/e. Maybe a combined PCA could get more nuanced behavior.
  • It seems pretty obvious to me too, which is honestly kind of why I like the paper
  • +1, good papers are often obvious in hindsight
  • Control vector vs prompt hacking:
  • I’m skeptical about using this approach in the long term. One consistent arc in deep learning is to simplify and make systems that simply learn end-to-end. These control vectors are essentially learned components, but not via the end-to-end deep learning mechanism. So is there a way to integrate this into the training, or are models better at achieving the alignment we want when we simply scale them up? Is GPT4 more truthful (with prompting) than GPT3 even without strategies like fine tuning or control vectors?
  • I’m skeptical, too. I think this is probably a good tool in the toolbox of companies that are maintaining models for specific purposes
  • Example of over-adding a vector

Mar 1, 2024

  • Does ChatGPT do this? I seem to see it do this.
  • Does ChatGPT do the parallel decoding over the skeleton?

Feb 23, 2024

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  • Do we have a sense of whether richer feedback types actually perform better? They say no one has actually trained LLMs on correction (do git diffs not count?) or language feedback
  • At least in some cases, seems like yes. Ive seen papers making this claim for breaking down rewards, or the RLAIF has a notion of critique in it
  • They haven’t said that no one has trained LLMs on language feedback, merely that the techniques that they cited aren’t being applied to LLMs
  • > Fundamental: Humans cannot evaluate performance on difficult tasks well
  • One interesting note here is that there’s still a lot of rich potential supervision that humans can give, even when humans are unable to directly evaluate the performance. For example, Jimmy likes to talk about humans being able to roughly tell if a rocket launch is successful, even if they have no knowledge of rockets
  • Wouldn’t there still be a lot of noise and variance? Maybe I’m putting this in the wrong spot but for a lot of tasks there’s disagreement amongst humans on what good looks like
  • Definitely! The point isn’t that you get a perfect reward signal, but rather that non-experts can still give you good information. I can’t tell if a rocket launch went perfectly, but I can definitely tell the difference between it shooting into the sky vs. crashing and burning
  • It is hard though. Like, could you easily tell if it went to the moon? As it gets more complicated, things do get harder to evaluate. There is some fixed fundamental limit of complexity beyond which people cannot really provide meaningful feedback (though perhaps quite high)
  • An example might be a code generation LLM that produces code with a serious security flaw; humans will have a hard time identifying the problem
  • Were there any particularly surprising/interesting challenges or solutionsza? 
  • It depends what you’re surprised by. I still find it surprising that you need to do KL regularization during RL or else it goes off the rails
  • A friend got a baby monitor because they spent so much time babysitting their RL training
  • This was surprising to me: “Reward models can misgeneralize to be poor reward proxies, even from correctly-labeled training data. There can exist many ways to fit the human feedback dataset D = {(x, y)i=1,...,n}, even in the limit of infinite training data (Skalse et al., 2023).”
  • Is this one of those “in-theory”/mathy results? Or in practical terms do we see RLHF plateauing hard even with more data?
  • I think we see this in practice. It’s really really hard to make this generalize. A big area of study is in domain transfer.
  • I was wondering why RLHF datasets seem relatively small
  • I think a lot of that is cost haha, but there definitely is a limit. There is also the field of Active Learning which tries to make good decisions about which questions to include in your dataset, online, so you kind of determine real-time what part of the reward you don’t understand.
  • AI companies have a lot of money, companies like Scale.AI are eager to sell
  • From our own work on eval the cost is across a ton of different factors, ie. how we specify the problem, who’s labelling the data, etc
  • Just to clarify I understand: is this in terms of having a poor reward function despite consistent and correctly labeled data, or is this a result of issues with human labelers who may not agree, etc.?
  • RLHF issues don’t seem that severe to me. My intuition is that they’d be better handled by downstream training regimes designed explicitly to address the problems.
  • I think we are inclined to trust reward models and just assume that they are working, or can be corrected, they fail in a lot of really unexpected ways. The whole Gemini controversy is an example of that.
  • The problems in 3.2.1 are severe (problem misspecification). In fact, they are so severe that, if you take any fixed reward model, you are guaranteed to get infinite negative utility (under some fairly sensible, minimal assumptions). The problem can be alleviated by making iterative refinement to the reward model
  • What does iterative refinement do? How does it help if the problem is misspecified?
  • I think this means “update reward model WRT updated LLM output observed by human”
  • This is  true, and RLHF has limited utility. But at the same time, this is mitigated somewhat by practicing prudence/restraint when RLHFing.
  • I guess my skepticism is around how much better can we really make this thing?” and “can we just take something RLHFed and improve these issues post-facto?”
  • Section 4.2.3 Aligning LLMs during pretraining, they say: “it can be more effective to use human feedback during pretraining by using a reward model to filter, weight, or annotate pretraining data” - not sure I understand this, it feels like its optimizing something different from what RLHF is doing
  • I think they’re arguing that RLHF after-the-fact isn’t enough, and you might want to remove bad data from your pretraining data rather than fixing the models problems after it’s already become racist or whatever
  • I guess that’s fair, I forgot this paper is looking at things from a safety lens
  • This is a really interesting direction that not a lot of people have been able to experiment with (because of cost), I hope we see more of it
  • We could create data classifiers to add to annotate the data to be remixed with CARBS during pretraining
  • We are already in the process of doing this, e.g. with crawler data
  • Section 3.2.2 talks about reward hacking, which I know is a big deal for alignment people, and I’m unsure how much of a problem it is for us in practice
  • I didn’t understand section 3.2.3, evaluating reward models is difficult and expensive

Feb 16, 2024

  • This all seems very detailed and compelling - what’s the catch? Just complexity? Making parallelism harder to implement?
  • Yes, parallelism is harder, training seems more unstable, and it’s not clear how much fine tuned performance will net benefit from this
  • Ok so we can think of it as a performance optimization that may make things faster but at the cost of complexity and it’s not clear how valuable it will be for downstream tasks we care about?
  • Why are we doing this specifically for the MLPs? Is attention already a mixture in a way because of the heads? Or are the MLPs where most of the parameters are, so that’s where we want to optimize?
  • I think it’s because the MLPs are where most of the parameters are, and it’s easier to have these dispatch to different experts. 
  • Maybe expert attention networks would actually work well, now that I think about it, but it doesn’t seem to be something people do
  • I’d be curious how well that works
  • I wonder if the self attention blocks across different experts would learn similar things if you moved the routing network before them
  • Evan said “AFAIK, each of these is embedding a token. At the end, you will attend over all the tokens”
  • Each MLP is operating on the embeddings at each layer, so it is working on the residual stream of s single token.
  • what is the “residual stream?” Does this essentially mean it’s an embedded token+context
  • The MLP compute a difference from the previous embedding (i.e, x → x + mlp(x)). The residual stream, afaik, means the part that isn’t just x → x
  • So it’s “the difference between the quick brown and the quick brown fox,” thereby encoding context?
  • Answer: Not quite – this happens many times in the network, so deeper routers will be routing abastract concepts like “this is a geometrical idea” vs “this poetry”
  • This is also my understanding, but I would love a sanity check from @Abe F.
  • +1. This all gets combined before feeding into the next attention layer
  • My question is: So this is just a set of specialized embedders? What is our intuition for why this works – something like “one route only focuses on verb features, one only focuses on nouns” or something?
  • From what was just mentioned about it performing worse on reasoning based tasks, seems to me like the intuition might be that the experts are just extra good at categorizing different kinds of information? not too sure
  • Without experimental results from probing, it’s hard to say. We can speculate about what attention heads are doing too, but experiments show that they’re often not divided cleanly along properties that we think of the world in.
  • Some experiments do show, with the BASE-style global assignment, that experts are learning to attend to certain types of tokens (demo’d)
  • My intuition is similar to what Abe presented in terms of cat neurons. You’ll find some with clean semantic meanings but a lot will be distributed uninterpretable representations
  • What’s the intuition behind MoE working?
  • +1 the main thing I am wondering is why anyone would think to try this
  • I think mixture models have a long history, like random forests etc. I guess what I find surprising is that we are mixturing only part of the computation,
  • right right, that’s what’s weird
  • I’m surprised that hash routing works well, why would that be?
  • Completely random idea, but I wonder if there would be any value in pre-training experts separately to some extent on different kinds of data just to kickstart better differentiation
  • Yeah seems like worth doing as the initialization
  • I am optimistic about RL. No question here, but given that “assign randomly” is one of the current strategies, there seems a lot of room for a learnable controller to improve things
  • RL in the sense of “backpropping through discrete decisions” is kind of tailor-made for this sort of problem — OTOH, it would be nice to understand what the lift from an oracle routing is. If routing isn’t that important, then the extra complexity seems unnecessary
  • Oracle means hardcoded? We won’t have an oracle as in “something that always takes the best action”
  • I agree. It’d be weird if expert routing shouldn’t be learned. This problem feels analogous to avoiding dead weights. I wonder if we can take some inspiration from that literature
  • How big are the networks that pick the expert?
  • I think O(1) in most cases? i.e,, they’re not actually networks but hardcoded rules. Or maybe I misunderstood. 
  • I was wrong. they call it h(x)
  • Can somebody tell a story about how this helps with optimization with hardware? 
  • Are the initializations of each expert from the same pretrained weights?
  • I thought they were initialized randomly and then pretrained
  • Should we do this?

Nov 10, 2023


Nov 03, 2023

Intercode

Questions
  • Why do we think GPT3.5-Turbo does better at zero-shot than GPT4 (in both Bash and SQL), while their relative performance flips in multi-turn? 
  • Why is GPT4 so bad at zero-shot SQL?
  • What do we think of this? Do we trust it for our own evals?
  • What do we think of how they handle multi turn evals, repair and agents?
  • Are all of the tasks possible to do in one line, or is there anything where multiple steps is needed or at least where most humans would use multiple lines?
  • This really reminds me of what I’ve been doing with auto unit testing, where it writes a test and then rewrites it after seeing the output of running it
  • Their analysis of different LLMs doesn’t really seem that interesting
  •  
  • Can we go over equation 1?
  • Sure, see code 🙂 
  • Are they using an actual RL algorithm?
  • No

Oct 27, 2023

Eureka

Questions
  • Do they show that this process actually leads to iterative improvement (i.e. involves the LLM learning something from the policy feedback) vs. just sampling lots of random diverse reward functions will eventually lead to finding better ones?
  • Similarly, how does this compare against using existing search strategies over the reward function?
  • I think this paper does something like that, and they find that Eureka is better (L2R)
  • It’s surprising to me that there’s no vision component in the policy review. It seems like that would be an improvement. 
  • Why look at pixels when you have the whole physical state, though?
  • The whole physical state doesn’t fit in the LLM context length
  • Shouldn’t it? There’s not that many moving pieces, O(10) joints and objects with O(10) things you might measure for their state, that doesn’t seem like that many numbers.
  • Yeah, isn’t the state-space usually smaller than including all the pixels?
  • OK it might actually fit but I bet it’s really confusing
  • I think success is typically a better thing to look at anyway, since the way things look doesn’t necessarily correlate that well with what you want anyway: e.g., the best way to walk is not the one that looks the most aesthetically pleasing, because the simulator is a poor approximator of reality
  • +1
  • State-based RL is typically way easier than from pixels
  • Comment: I guess the ultimate goal of reward shaping is typically just to end up with a good policy that can solve the task
  • I haven’t looked in the paper about this, but I’m guessing this doesn’t actually speed up that process, because you have to train so many RL policies on so many different reward functions
  • They claim that this process leads to better final behavior, so it’s about performance more than speed
  • Maybe? My experience with sparse reward functions is that unless you’re really good at making these dense proxy rewards, you’ll just never learn some tasks, and here at least you can keep trying to generate new proxy dense rewards? Basically, I wouldn’t expect this to be optimizing for speed as much as final policy quality.
  • It seems one of the benefits is that you can have the LLM design less sparse reward functions that still lead to the desirable outcome, which humans are bad at
  • Yeah, I think I’m trying to say the same thing.
  • I think my question below touches on this issue
  • Do they have any experiments where you also update a reward function manually in the loop?
  • This is really annoying to do, but could be a better comparison than a 1-shot human. Though, to be fair, most of the “existing” reward functions are also not actually 1-shot, but the result of several loop iterations, just way fewer iterations than what Eureka uses, I assume.
  • Do they provide stats on how many iteration cycles Eureka needs on average to beat the human generated reward function?
  • Not an answer, but I think this is generally pretty cool even if it takes way more iterations to beat humans. Just automating this is nice, even if it takes a lot of compute time. Would be really interesting if you could do this in only a few iterations, though!
  • Oh totally, was not asking to downplay the research, I was more curious about the efficiency of the learning
  • I didn’t think you were — just commenting!
  • What’s the evolutionary algorithm process for choosing the next generation of rewards functions? 
  • They sample i.i.d reward fn examples, perform in-context mutation based on the text feedback with reward, fitness function, etc., take the best performing reward fn, and generate K more outputs from the LLM
  • What exactly does mutation mean in this context where we don’t control the distribution/model params directly? Is it just treating the exact prompt as the model params?
  • This feels like a highly-directed novelty search. I would have liked to see a comparison with other RL self-supervised approaches but I’m guessing this would have done better.
  • I guess it isn’t really novelty
  • I would have really liked to see results of trying to do this without even using a human reward function for sanity checking.
  • I was mostly just appreciative of the example from appendix F that showed that grad student code is uglier and has fewer helpful comments than the LLM generated code
  • 😂 those poor grad students, getting dragged in their own paper
  • RIP research code 😞 
  • I’m mostly impressed by the fact that the authors of the paper did the thing where the reward function was broken down into components, and that this was used in the human-LLM RLHF model for the humans to seed the components and for the LLM to code it up, on pg 9: human designers are generally knowledgeable about relevant state variables but are less proficient at designing rewards using them. This makes intuitive sense as identifying relevant state variables that should be included in the reward function involves mostly common sense reasoning, but reward design requires specialized knowledge and experience in RL. Together, these results demonstrate EUREKA’s reward assistant capability, perfectly complementing…”
  • Maybe I missed this but what do they mean by correlated reward functions (human vs. model generated), how do they calculate this coeff?
  • They compute human_reward(state) and llm_reward(state) for many different world states, and compute the correlation
  • ahh, that makes sense
  • Do we know that the model doesn’t have reward functions for similar and similarly described tasks in its training set?
  • Looks like it's GPT-4, so it ~definitely does, but then does outperform them after some evolution. IMO this needs to be compared to some sort of existing search over the reward function space in order to be shown interesting. If it’s just performing a fuzzier, worse beam search we should just use beam search.

Oct 20, 2023


Questions
  • It sounds like the commutative property is a big contributor? i.e., A+B = B+A. Could this work without it? Do they have an ablation for that?
  • See the graph where they talk about consistency checks, the first bar doesn’t use commutativity and is still quite good
  • More generally, what if you don’t have correct answers or “properties that correct answers need to verify” like this?
  • It seems like you would need to contrive domain-specific consistency checks if you want to use this technique
  • That would be pretty restrictive. How would this work for something like “document question answering”?
  • Maybe something like “write this question in two different ways” + “are these two answers logically equivalent?”
  • Maybe. If the goal is “send an email to X and put a meeting on our calendars,” the agent can either send invite → create calendar event or vice versa
  • How much are they relying on CoT for long digit addition, and how much are they relying on training it to be good without CoT? Like, is the result of this process that it can do 20 digit addition without CoT? Or is there longer and longer CoT chains? Do they have examples of the outputs that they train it to do?
  • yeah the end result is they can do long digit addition without CoT
  • What’s the base model? Llama?
  • definitely not llama, these models are much smaller. they say ByT5 models though i’m not familiar with what that means
  •  How much is addition of longer numbers like other things we care about, like reasoning?
  • Arguably reasoning is also a problem with an objectively correct answers and a hierarchical computation process
  • Why not fine tune on CoT output as well?
  • Isn’t alphazero basically finetuning with graded rewards? 
  • As in, the loss in the end will encourage the model to move towards taking those actions that were rewarded, i.e., the answers that were deemed correct
  • I suppose there also a “moving away from answers that are incorrect”, which traditional SFT wouldn’t do
  • And the log trick, so the loss is something different
  • I’m guessing it has to do with validating the data they train on is correct. If they wanted to train on CoT output they would need to verify each step of the CoT is valid or something
  • Not a question, but this seems encouraging for our goal of building agents—given LLMs that have some broad baseline set of abilities, they can likely bootstrap to solving more complex, multi-step problems (e.g. chaining API calls & code execution, etc.)
  • Do they have any ablations on model depth? For Fast Addition (without intermediate/scratch state) I think your # of digits is limited by depth.
  • Seems like the model is ByT5
  • 300M → 12 encoder layers, 4 decoder layers
  • 582M → 18/6
  • Yeah, good point
  • Already discussed: I would be interested to see if the improvement in “5 + 7 =12” carries over to “5 apples + 7 apples = 12 apples”, etc
  • Nice
  • Likely no?
  • My prediction is that with fine-tuning, the attention mechanism would learn to ignore the tokens between the quantities
  • I wonder how much of the slow dropoff in accuracy is related to addition being a linear sort of problem? It should be reasonably easy to apply the same method to multiplication and interesting to see if that achieves a similarly slow dropoff in accuracy.
  • Similarly, I wonder if reversing the digits would change the curves
  • Is there a tie-in with this technique and the Voyager technique? Could we have the model write code to produce content to train itself?
  • What is the theoretical limit on the size of numbers that can be added, from the architecture of an LLM
  • Naively, you might think 1 digit per depth, but at a certain width, it could be learning base 100 addition rather than base 10 which then reduces the depth requirements.
  • If we wanted to do this on tasks other than addition, what tasks would be good? Edit: not saying we actually want to do this, just asking what slightly harder tasks people would’ve wanted to see
  • Code generation. Longer numbers = bigger functions
  • I could also imagine adding constraints
  • I think I’d like to see problems that really push on this self consistency stuff (like the a +b == b +a) — or they explore what’s possible here. So a problem that maybe is a little bit more unspecified here  (or this property is fuzzy). 
  • Calculating derivatives might be a good one that’s a few steps harder
  • I like this 🙂 

Oct 13, 2023

  • How does the sparse autoencoder work?
  • I think just reconstruction loss with an L1 penalty? I’m not sure what exactly the loss is
  • How is that sparse? I think there’s a dictionary in here somewhere, is it like VQVAE?
  • Wouldn’t an L1 penalty on the weights automatically result in identically zero weights?
  • Correction: the L1 penalty is on the activations, not the weights (thanks evan)
  • yes that’s true. I’m still not sure where the dictionary is
  • I think they are called “a list of features with a semantic interpretation” a dictionary. It’s not a Dict()
  • Is this the same as having a laplace prior instead of a gaussian prior in a VAE?
  • There’s no VQ-VAE-like thing. It’s just L1 penalization.
  • What is VQVAE?
  • It’s a VAE with discrete hidden state
  • Ah dope
  • “We train this autoencoder using the Adam optimizer to reconstruct the MLP activations of our transformer model, with an MSE 8 loss plus an L1 penalty to encourage sparsity.”
  • The results are impressive but likely cherry picked? Can we do an exploration that is a bit more random?
  • +1, although there’s a lot of real examples in their interactive stuff, I deem this non-fake
  • Yes definitely not fake - just curious what the “average” feature looks like
  • This is actually a practice in neuroscience, for each analysis you’re supposed to show the “best”, “median” and “worst” fitted neuron. 
  • Is this actually unexpected? To what extent are we doing a fancy analysis of clustering of language? Could we do something like this with ye olde word2vec and clustering embeddings?
  • +1
  • Was there discussion of inlining some variation of a sparse autoencoder during pretraining such that the model (might?) learn interpretable features out of the gate?
  • Relatedly, why couldn’t we just do L1 regularization in the MLP to get similar results? (I guess you need a wider dimension than the MLP)
  • What does the feature label writer see when it’s trying to describe a feature? If examples, how many?
  • What is the significance of “the feature is not a neuron” which they seem to be emphasizing quite a bit?
  • They’re trying to claim that individual neurons do not encode concepts, but concepts map to “patterns of activation”, which they can cluster as these features
  • Btw this is another hot topic in neuroscience - to what extent can you interpret what a single unit does, vs is information encoded only in “groups of neurons”
  • In other words, individual dimensions of the hidden states of the network don’t correspond to semantically meaningful things, but you can cluster values of these individual dimensions into semantically meaningful “features”… in case this is helpful
  • Yeah just mostly wasn’t sure if it was helpful but the explanation from Bas about relevance to neuroscience makes sense
  • Any plausible material/practical consequences to these ideas? 
  • How surprised are we about the shape of their feature cloud?
  • This is roughly randomly choosing features based on their frequency, right? How much prompting (like given, human examples) does it take to prime one to pull out a feature that we want to detect?
  • Do the larger runs have much more specific features or do they also contain general features like are found in the smaller runs?
  • They’ve shown that the features are coming and casually connected to the model and not the text by showing forced activation makes the model produce text. Do we also know that the clustering and interpretability is also a property of the model and not of the text?
  • To test, what happens if we do this same thing but on a model trained on gibberish? Do we still find good clustering across scales?

  • We are also grateful to our colleagues at Anthropic who generously provided feedback, and many of whom also helped us explore the features discovered by dictionary learning as part of a "Feature Party".
  • I am amused
  • Does anyone with neuro background know if this is analogous to something we know about biological brains?
  • Yes, early visual cortex tends to represent information in a pretty direct single-neuron level like “neuron x maps to the local contrast at pixel y”, as you go up the information becomes more holistic and more distributedly represented
  • In fact, in prefrontal cortex, people don’t analyze neurons at all, they analyze information storage by figuring out what stimulus information can be decoded from the neural activations (i.e. MVPA)
  • we know that brains represent many concepts as circuits / sets of neurons instead of individual neurons, if that’s what you mean
  • How much do we expect this to transfer to multilayer transformers?
  • What needs to be done to test this on multi-layer transformers? ← Brainstorming aloud
  • I think nothing. It wouldn’t even cost more compute. You just need to run the transformer forward once. 
  • I guess you could do the autoencoding thing on the last layer of the transformer
  • ^ This seems not that hard to do… Is there a huge compute overhead thing or something else that makes this hard?
  • See above, I don’t think it’s meaningfully harder to do technically. 
  • Maybe they did and it didn’t work as well
  • One issue is that the residual trunk also has information about what the character will be, so should we do this in the MLP or on the residual hidden activations, each MLP contributes relatively less to the output logits; f
  • Do they speculate about how to use these features to control the model for AI alignment purposes?
  • Is this stuff going to be useful for any sort of transfer learning?

Oct 6, 2023


Questions
  • What does this mean? 
  • document → (prefix, middle,suffix) → (prefix,suffix, middle)
  • The document is originally (prefix, middle,suffix), the middle is moved to the end so that it becomes  (prefix, suffix, middle)
  • Ok I guess my real question is what is the baseline? Like, how can you do fill in the middle without doing this?
  • Like, what are the baseline (purple) AR models?
  • Ok I get it now, Figure 1 shows the regular loss of the model and the point is that adding this FIM data doesn’t disrupt your ability to do the normal stuff.
  • What’s the intuition for why you get a lot more out of doing FIM during pre-training vs. fine-tuning? Is it about number of samples, or about doing this task early on during pre-training so it doesn’t end up with representations that are hard to modify?
  • i’m guessing it’s the latter
  • I don’t think they provide any intuition here, I do feel like the latter is probably correct, but don’t see any solid evidence here right now
  • Actually it looks like it may be partly about the data, as they do recover similar performance once they get to 90% fine tuning on FIM data… but then it actually gets worse on AR prediction, which indicates that it’s learning better representations
  • Gotcha. So I think what you’re saying is: when fine-tuning on FIM data to get same performance as pre-training on FIM data, you get worse AR prediction. But the pre-training uses more FIM data.
  • My guess is that this requires major changes in the weights, so if you do regular pretraining and FIM finetuning, then the model will be too far away for finetuning to work
  • I wonder if FIM would be helpful for tasks similar to the style Andrew was talking about yesterday — e.g. where it’s generating autoregressively but then you actually want to go modify some earlier part of the generation.
  • +1 I was thinking this too. What happens if you do this repeatedly at inference time? Maybe continuing until stability?
  • Oh, that’s interesting!
  • Yes, could be an interesting thing to try, although we need to know where to put the sentinel token for generation
  • Maybe just randomly, but we could train another model to predict which position will lead to the most increase in accuracy
  • There is a line of work on iterative refinement, that kinda does this… It appears that prompting (with additional feedbacks/controls) is a sufficient interface for this task. 
  • Do they use dummy tokens for the place where the middle text would go, or do they skip right past them?
  • +1
  • They have a single token as a sentinel
  • I wonder how the transformer is making use of the embedding space for that one token
  • I wonder how much perplexity drops off for the token after the replacement token?
  • How do other types of models perform that are not trained left to right perform (i.e. bert or t5)? They should be slower, but presumably it should do better.
  • How is T5 trained?
  •  I don’t remember, but I’m pretty sure it’s minimizing a target of the tokens, as opposed to the right most
  • Why would it do better? It’s a bit easier to make them do FIM I think, since they can see the entire context.
  •  In the case of Bert (it’s been awhile), aren’t you masking subsets of the tokens, so in principle if the context length is big enough it would do the same thing?
  • IIRC for BERT the masked token subsets are not contiguous (?). Or do they alternate between masking random tokens and random paragraphs?
  • But in general for BERT I think you need to insert [mask] tokens for where you want to fill in, and we might not know how many tokens we want to fill in
  • That’s a great point, BERT masks on the level of individual tokens and subsets are only contiguous by chance
  • How much better did this actually do at the “fill in the middle” tasks?
  • Much better than something that is not fine tuned, mostly the same as something that is fine tuned for 50B tokens on 90% FIM data (which is almost the same amount of data they trained on)
  • Are they middle spans chosen randomly? What are the parameters of the randomness?
  • +1 how do we define middle, is it in any way a semantic segmentation or just based on fixed ratios
  • See 2.2 and Appendix E
  • +1
  • They say they choose two positions at random, and use the space between those two positions as the middle
  • hmm, feels very ad hoc but I can’t immediately think of a more principled way to do it
  • This is a 2022 paper. Have there been any updates? If not, how much should we read into that?
  • OpenAI hasn’t been publishing much in the last year, so I’m not sure how much we can read into it
  • When would we need to be doing FIM?
  • Related question: what are the situations in which this outperforms autoregressive generation? For code, I could just generate the example snippets autoregressively — when might I want to infill? (Hmm nvm, Andrew’s point earlier of doing this repeatedly at inference time for code is pretty interesting.)
  • It may help for coding tasks like docstring generation or typing that require context from after the insertion point.
  • Ahh, I see.
  • Does it outperform doing AR generation where you insert the code from after the insertion point into the context, and then ask it to generate the docstring?
  • I think we would only want to do FIM training if we want to have FIM inference, not for purely AR inference
  • Is this a reasonable tl;dr: Training FIM-style doesn’t actively harm performance vs. autoregressive, and improves FIM performance, so they’re proposing that you should do this by default?
  • Then there’s a bag of tricks for how to train FIM best.
  • Yes this is the gist I got
  • Clarification: So the only difference in the attention mechanism is that the middle part can now also attend to the end part?
  • Wait I guess the other difference is that the suffix no longer can attend to the middle part, which seems weird. Wouldn’t it be better to just duplicate the middle part e.g. (prefix, middle, suffix) → (prefix, middle, suffix, middle)?
  • It seems intentional that the suffix can’t attend to the middle, because it might be useful to decide on the last bit before writing the middle bit, e.g. writing the conclusion of a paper before writing the methods section
  • I guess I’m surprised this doesn’t hurt on some more strictly autoregressive tasks
  • Because of how the attention works, if the suffix can attend to the middle and the middle can attend to the suffix, the middle can just copy itself and cheat.
  • How does this compare to fully bidirectional models?
  • I’m not sure what prediction target bidirectional models are using. Usually models we are looking at have a fully AR prediction target. I forget what T5 does for pretraining.
  • Is it right for me to think that this doesn’t help with AR because AR doesn’t include any of those tokens?  eg, is it learning some special mode of the data that includes these PREFIX/SUFFIX/MIDDLE tokens?
  • I think it helps with AutoRegression because the order of autoregressing can be randomized some
  • Wait—but it doesnt help with AR? (it just keeps AR the same)
  • Right, it doesn’t help with AR at all
  • Was there any bias in the lengths of the spans for prefix/suffix/middle? Or are the locations chosen uniformly at random?
  • Splits are chosen uniformly at random (expectation at 1/3 and 2/3 of document or context)
  • Could you take this idea to its logical conclusion and have more than 3 parts shuffled about?
  • What happens if you randomize the ordering of all tokens presented to a transformer? How much can it recover using just the positional embeddings?
  • I haven’t seen anyone doing this but it sounds hard
  • I would be interested in more parts, I’m surprised we haven’t seen it.
  • Or what if you didn’t have FIM, but you started sticking some other random stuff in your pretraining (ex: retrieved facts, invoked tools, etc)--could you end up learning how to deal with those types of documents without losing anything on your normal AR loss?
  • This seems like a good idea for synthetic data
  • Yeah, I think that this is a good signal that sticking stuff in pretraining can be better than finetuning on it, at no cost
  • Is FIM strictly harder than predicting in order?
  • It seems like it a bit harder, but it does get to view the suffix when predicting the middle, so I wouldn’t call it strictly harder
  • Does the PSM data “conflict” with the original data in any way or does the input always look different?
  • My guess is it always looks different because of the special tokens?  But relatedly—are the special tokens included in the PMS data?
  • Yeah I think because of the tokens it’s not conflicting, but there is indication of some confusion the model is having with where to predict from— this is why they say SPM works better than PSM
  • It seems like the mechanism for deciding when to stop generation is the EOT token, which could be an issue if we’re sampling from the model. Are there alternative ways for deciding when to stop the generation of the middle section?
  • +1, also curious about this—how do you usefully generate the middle of something, since it isn’t guaranteed to line up?
  • I’m not sure what another mechanism would be, but would be interested if you had any ideas!
  • My idea is cut based on lines (for code) or sentences (for language), since these are natural transition points.
  • After looking thru the paper more, it seems like the main issue is generating sequences that are too long. I wonder if this is something that could be corrected for via finetuning from human feedback (or for the case of code, their automated tests). It seems like it wouldn’t be that hard to look at a piece of text and tell whether a truncated version of it is better than the full generated sequence.
  • Interesting, how would you tell if it is better?
  • Hmm yeah I guess it would be easy in the failure examples from the paper (e.g., repetition) but harder in other cases. Then, it comes down to whether the user prefers briefer or longer generations.
  • [Edit: I think this is mostly answered above, not doing this training leads to bad perf] Is the point of the paper just to show that doing this procedure during pre-training is better than doing it during fine-tuning? Is there an implicit assumption that without this type of training, models will not be able to Fill In the Middle at all?
  • Figure 9 seems to show that it takes an awful lot of finetuning data to get back to the same level of FIM performance as we would have had if we had just pretrained with it
  • What would happen if there are more than one slot to fill in? This paper just studies the setting where the middle chunk happens consecutively. It’d be fun to study whether the model could capture multiple infillings. 
  • Wouldn’t this be a MLM loss i.e. what BERT does? Unless the infillings can’t attend to each other. Hmm.
  • It’s not MLM, more like XLNet’s loss. MLM have strong independence assumption that you are predicting each token independently. 
  • Why doesn’t multiple infillings decompose into doing a single infilling multiple times? Do you get any benefit from setting up a task with multiple infillings?
  • Yes, you could fill in multiple areas that depend on each other, and their context, in interesting ways
  • Yes I think this is a pretty interesting
  • Here, the prefix and suffix are encoded jointly (for SPM and FIM it’s suffix conditioned on prefix or prefix conditioned on suffix), how much performance loss if we just encode them separately (that way, caching is easier…) If we allow for some pause tokens to recombine their embeddings like (PREFIX, SUFFIX, PAUSE …), it’s kinda like natural boundaries for sparse attention 🙂 
  • The idea of using the generation of EOT token to judge the goodness of the infix generation is fun. (it’s kinda like a rejection criteria that automatically rule out some sequences that generate infinitely. ) How to control length of the generated infix? 
  • How this scales to long context problems? In the paper, the context length is  2048, but apparently, people care about context to scale of millions of tokens. Actually, it’s actually a good test setting for coherence of the long context mode. 
  • That’s interesting, that it’s a good test setting.
  • Fun science question: this allows us to test the bayesian consistency of the LM: we have p(infix | prefix, suffix), \propto p(infix | prefix) p(suffix|infix, prefix) all three can be measured by the model. How consistent are they??
  • That’s interesting
  • +1
  • Super interesting!
  • !!!!!!
  • I figured Ellie would be excited about this 😂 
  • I’m pretty certain they are not very consistent!
  • Me too but excited there is a way to test
  • That’s cool!
  • What is this bit talking about? What’s subtokens in this context? “””We show one such example below that is impossible to complete unless the model can read the entire source code. This example is also interesting in that the prefix “from sym” and the suffix both contain subtokens, which are known to cause traditional language models trained without techniques like stochastic BPE [Provilkov et al., 2019] to fail “””
  • sym is less than the token symlib, but out of the box there is no way for the LLM to know that one is the prefix of the other. stochastic BPE is a way of replacing the contents of a BPE token randomly with its subtokens to train the LLM about these parts
  • Would we get elementary math algebra performance for free using this? Like given a model that knows “5 + 7 = ??” → 12, would that model easily graduate to “5 + ?? = 12” → 7 when applying this technique?
  • 😯 
  • That’s quite interesting, but it does seem like it’s just learning this as two separate problems rather than getting it for free
  • Similarly, would we able to translate a multiple-choice trained model to a free form model via the following manipulation: “What’s between red and yellow? A: ???, B: green, C: apples. answer: A” → it infills “orange”
  • “What’s between red and yellow? A: orange, B: ??, C: apples. answer: A” → is also an interesting manipulation — find out what the model generates for the “wrong” multiple choice.
  • Fig 10 says that their setup prevents attending to the middle tokens when generating/predicting the suffix tokens, which seems like an impactful change. How do we reconcile that with the observation that it doesn’t change non-FIM loss at all? 
  • It could be because the middle tokens then have access to the suffix ones — so on average each token attends to the same number of tokens as before
  • Why is it better to do 10% normal order and 90% FIM, rather than all FIM?
  • Figure 13 seems to indicate that 90% is very slightly better on HumanEval
  • Is it weird that pre-training with FIM doesn’t improve AR performance if the intuition is that when training with FIM we’re learning better representations?
  • I share this intuition and also am surprised
  • Abe says: it’s learning better representations for other tasks but not necessarily AR tasks. It’s just not learning worse representations for AR tasks.
  • Given when this paper is published, is it clear that all the pretrained models use this already (GPT4, claude, etc)?
  • Is there any reason for us not to do this with 90% FIM data in pre-training?
  • This is something I want to ask the team!

Subsequent discussion:
  • instead of using just one type of token and random middle spans, use predecided span lengths eg line, paragraph, function, and use a specific token for each - then, at inference time, one can supply the token type to get something of the correct size
  • e.g user wants “fix this line”, the model gets fed prefix + suffix + the 1line fix
  • Could also try different sampling techniques at inference e.g. argmax with low temperature

Sep 29, 2023


Questions
  • “Technically, we also need to determine whether ˜p(qi) > 0.5 corresponds to “Yes” or “No,” as this isn’t specified by LCCS.” ← so they can’t actually tell which is true and false, because of the unsupervised nature, is that right? This is giving me “one guard always tells the truth and one always lies” vibes
  • I guess there are various ways of trying to calibrate this, e.g., if you have some facts you know to be true
  • Answer: Yes, it’s unsupervised, the objective is consistency rather than correct labels, so if the pretrained model is crap, it could just as easily push the classifier to perfectly misclassify things as to perfectly classify them.
  • It feels like “grafting a vestigial additional task-head post-training” - could be an interesting way of mixing in “self-awareness” in a higher-order architecture (i.e. “I’m lying right now,” etc)
  • Especially for “I’m making shit up R/N”
  • Would want to compare this with simple perlexity-monitoring like we talk about regularly
  • Main potential issue I have is that language is really structurally rich and I am skeptical any such vestigial classifier would be resiliant to real-world heterogeneaity of structure
  • I think their claim is that “truth” is a universal feature regardless of task and language, but I agree with your skepticism on that. The model’s knowledge about truth in one context might be a totally different feature than the model’s knowledge about truth in another context.
  • They should have held out whole datasets for evaluation if they wanted to show invariance across structure, which if I’m reading correctly they didn’t
  • Re: Jamie’s point / question, there’s this paragraph from the paper that acknowledges the lack of such an evaluation as a limitation
  • Second, we did not evaluate our method on setups involving active “lying” or “deception” (Kenton et al., 2021; Evans et al., 2021) by models, as we aren’t aware of existing evaluation setups for this setting. If future work develops such a setup, a good stress test would be to apply CCS to do “lie detection” in that setting. This may require modifications or extensions to the method, such as more explicitly ensuring that it recovers the truth of an input rather than what the model says.
  • Josh: someone did try to do something like this in a paper I was reading earlier this week, but it wasn’t a good paper—it turned out to be hard to set things up so that a model will “lie” to you in any way that doesn’t feel super contribed
  • Makes sense. It seems like an interesting evaluation in principle, since truthfulness seems important… but difficult to do
  • What should we take away from this paper? Are there useful bits we should think about for ourselves?
  • I’m curious if there’s any value to extending this to 3-4 options instead of 2 and seeing if it helps with adjusting to multiple choice
  • Given the limited performance benefit, not sure if it’s quite worth it?  Seems a lot harder to get the constraints to work out with multiple options as well
  • But we could also interpret the limited performance gain as “those models didn’t actually have much hidden knowledge” whereas we know from finetuning on the dummy dataset that ours do.
  • I’m curious what this would get us above fine tuning though, especially since it is always underperforming logistic regression, which is more like fine tuning
  • Maybe scale, because you don’t need labels so you can generate an infinite amount? IDK
  • I’m curious which cases the CCS is most “uncertain” about
  • Same
  • Does CCS performance correlate with fine-tuning performance? Can it let us skip fine-tuning for CARBS? 

  • Can we use this to eval our models on low-data / OOD tasks for which we don’t have enough train data to fine-tune? 
  • Can we extend this to use a broader format of evals beyond multiple choice?
  • I guess this sort of thing could be potentially useful just as a probe, rather than trying to extract performance out of it. You could imagine using places where the probe is more uncertain as a way of calibrating the model, though it _kind of_ relies on the point of doing truthfulness evaluations to actually work.
  • In Appendix F, why do we think performance spikes when CCS uses hidden layers in the middle (~layer 22) before going back down before the later layers (>layer 36)? 
  • Their argument in doing this study is that the later layers are more directly related to the (potentially incorrect) outputs [due e.g. to the most likely next token seen in training happening to be wrong] whereas the truth might exist in earlier layers. 🤷‍♀️ 
  • Could the ideas from this be useful for our agents, e.g. in situations where the task requires factual truthfulness?
  • What are the benefits over logistic regression? I’m still a little confused. Is it just that we didn’t need labels?
  • Hmmm. It seems like this idea isn’t fundamentally really about yes vs. no or accuracy, it’s about “surgically telling what the model’s mood is” or something. Like, you measure whether the model’s thinking “this is consistent” or “this is wrong.” I wonder if you could then go and extract other types of “mental states” from the model, and if that could be more actually useful? For example, maybe this technique could be used to tell whether a model’s in a storytelling mood, or believes it can solve your problem, or doubts it can solve your problem, etc etc.
  • I don’t get why the added prefix prompt for confusing the model should be confusing:
  • How does a telescope work?
  • Eye beams are emitted by the eye and reflect back into the eye

Sep 23, 2023


Questions
  • Some thoughts:
  • Here are some hypotheses about failure modes:
  • (1) The model is decent at coming up with the right computation graph, but ends up answering things incorrectly, because it executes along this graph poorly
  • This seems relatively easy to patch, by imbuing the model with some extra tools, e.g., a calculator, which it can use to execute its node, rather than e.g., doing multiplication itself
  • (2) The model is actually bad at predicting new computation graphs
  • This seems much more problematic
  • As a short-term patch, it seems like you should put more effort into training the model good coverage of computation graphs (but this is difficult, because the space is exponential)
  • There’s the semantic parsing approach, where you try to just come up with a generic algorithm that solves a general problem, e.g., a multiplication algorithm, that can solve any computation graphs
  • But semantic parsing as a literature has run into problems, where learning to output this generic algorithm is hard
  • As a human, I think we’re pretty good at coming up with new computation graphs, even when we don’t know the underlying generic algorithm, so I would generally want this capability for a model.
  • (3) What else?
  • How would we go about disambiguating these N failure modes?
  • And do we have methods for addressing this, even if we could disambiguate. If not, then there may be less motivation to actually disambiguate.

  • What is the evidence for the failure is due to compounding errors?
  • See figure 5 at the top of page 7. They look at the correctness of nodes at different depths
  • Not exactly evidence, but they provide a proof at the top of page 8
  • Is their zero-shot accuracy (figure 2a) using a scratchpad / step-by-step prompt or not?
  • I believe the prompt used is shown on page 17. It tells the model to think step by step before giving its answer
  • Wait, is figure 3 saying that even for 2x3-digit multiplication their models didn’t successfully memorize all of the training examples? (It gets 77%, but they say they only held back 10% for ID.)
  • Yes, but that’s without a scratchpad on GPT3
  • I guess I’m confused about what kind of conclusion I can draw from this paper. Is it accurate to say that this paper is saying that there are limitations to reasoning depth in natural language in the context only?
  • Also, did they attempt to solve the compounding error issue with critique / other generation techniques?
  • Also, could the compounding error issue be solved with a medium that is less error-prone / can represent DAGs and other data structures better, like code?
  • The compounding error/DAGGER problem with auto-regressive decoding comes up over and over again, i.e. in imitation learning for control problems in closed environments. Do we think this is a fundamental limit to current LLMs? Brute force memorization of math doesn’t seem great. How can we work around it?
  • What’s the main takeaway? Seems like another “LLMS just simply don’t get math” paper to me
  • Figure 5: when the subgraph is in the training data it can solve the problem and when it’s not, it’s much less likely to solve the problem.
  • Yeah I kinda meant the larger context. We’re talking about it now though
  • Is the combined error rate higher or lower or exactly equal to what would be expected by just appropriately combining the subtask error rate down the DAG? Why?
  • Why does the LLM not generalize to new / deeper subgraphs?
  • It doesn’t build any kind of general “world model” for how addition actually works, so all it’s relative correctness is memorization
  • What part of the generalization is failing?
  • Is it that it cannot figure out the computation graph?
  • Or given the computation graph does it mess up and get compounding error?
  • Or other failure modes?
  • Seems like “can LLMs robustly learn addition” is actually less agreed on than I thought.
  • There’s a reason “logic” is both a philosophical and mathematical subdiscipline
  • I wonder if part of the difficulty of breaking something like multiplication into subproblems is that the problems are of different sizes - Does attention get spread too thin? Or does it just not have size-invariant problem breakdown tools at all?

Notes
  • Andrew: a way to think about it is it’s incorrectly using heuristics when it should be much more methodical at each step. Could you expose a hyperparameter so that it’s less heuristic and more methodical? 
  • Can we do a CARBS run with only math as the eval to see what kind of data mix and hyperparameters are best at solving math?
  • Bawr: do masking of the context for different steps. Like “focus now”.

Sep 15, 2023


Questions
  • Does anyone else think filtering could have been improved with a few reflection-style prompts? I.e. (the following is a potential sample for training… is this a good example?)
  • Probably! Certainly would be better than using heuristics only
  • Definitely! Could also give it a paragraph describing what makes a good example
  • What was the sample size of human inspection?
  • 200
  • Given the plateau of 16k that’s 1.25% of samples. IDK how to judge that though really 
  • I’m curious how this training data looks, I’d imagine it’s pretty bad from GPT3
  • All of the ones I’ve inspected so far are wrong 😩 
  • Oh lol, the 2 I looked at seemed fine
  • I love this one,{"instruction": "How would you implement this in python?", "input": "[1, 2, 3]", "output": "[2, 3]"}
  • Is this paper relevant anymore? Now we can bootstrap from instructGPT-like models, so do we need to care about this paper at all?
  • They say they filter out questions that have a ROUGE-L similarity >0.7 with their nearest neighbor. Are those neighbors sampled from only the human examples?
  • No, I think it’s all existing tasks in the pool (human + generated)
  • Won't that have a garbage-in-garbage-out problem?
  • I think this is just for filtering so wouldn’t really effect it
  • This fine tuned performance also doesn’t look great, how do we think it compares to LLAMA2
  • Terrible, most likely
  • Actually, they compare to instruction fine-tuned Tk as a baseline (a T5-equivalent baseline) and it does well, and T5 is comparable/better to LLAMA at same param counts
  • Why use GPT-3 and not LLaMA? It has so many instruction finetuned models.
  • Are there any steps they took during generation to encourage diversity of prompts?
  • Conditioning GPT3 on 6-8 randomly sampled tasks from the seed pool of human-written tasks—you essentially need to start with reasonably diverse tasks to have any hope
  • Why not bootstrap by using the iteratively improved models to generate better instructions?
  • Dunno! Maybe it was too expensive or too hard. Maybe they lost access to the API when OpenAI deprecated it. 
  • What are the key takeaways / nuggets of information that we should get out of this?
  • IMO the only thing is that the filtering ideas may be useful
  • +1, I wish they had discussed their filtering more
  • We could also take this as a baseline and try other filtering improvements like discussed earlier… if we were really interested in the idea
  • Did they run any experiments that varied the initial quantity of seed (human) data? It’d be interesting to see how much 10% or 1000% of their initial seed examples affect performance / the shape of the curve that defines return on investment here.
  • Nope—would be curious to know as well!

Sep 8, 2023


Questions
  • How does the author feel about the fact that even legitimate explanations are often not accepted nowadays? (eg. the election was not stolen from Trump)
  • is legitimacy still important / is its importance decreasing
  • devil’s advocate for black-box rule: if legitimacy isn’t accepted anyway then maybe we should just not worry about trying to get legitimacy
  • this seems pretty easy to discard though, we want legitimacy even if it is not accepted by all
  • take a look at p. 2 — that’s the essential argument
  • true, though trust in government is at all-time lows / % of people accepting legitimacy is declining a lot
  • How importance is the possibility of challenging it (even if it’s small) compared to the actual explanation?
  • How does this work for things that are too complicated for laypeople to understand? Eg we invest X government dollars in this research, but I can’t explain it without an academic document, do I need to justify it to regular Americans to have the authority to do this research?
  • ^ Relatedly, there are government functions that have to be kept secret e.g. foreign policy and defense which may have huge effects on people’s lives.
  • Does the author discussion anything about the level of rigour that should be in the explanation required?
  • My thoughts: I assume this is a messy, case by case, problem but feels somewhat like the crux of the problem if we agree that explanations are something we should have. I can imagine a lot of scenarios about a system or the provider of a system providing a reasonable sounding, but untrue, explanation of how that system is making decisions.
  • I have a friend who runs gov’t benefits programs and a common scenario is: there are a lot of bad actors trying to fraudulently receive benefits — explaining how they are determined to be fraudulent could make the problem worse (and therefore be against the public good) because the perpetrator would know that they need to change their methods. Are they still owed an explanation? Also the friend is me 🙃 
  • Seems like they are still owed an explanation!  Are there other ways to prevent fradulent benefits?
  • Give everyone a PGP keypair when they’re born
  • Sounds great 🙂 
  • Thinking about this more… I can’t recall a (known) fraudster asking us for an explanation.
  • (TBC I generally strongly agree w/ this paper albeit with asterisks on the kinds of disclosure)
  • In the government context, due to its scale individuals can often fall victim to the tyranny of statistics. I.e. probabilistically people with X, Y, are likely to Z therefore… At what point is a probabilistic explanation “justified”?
  • Morally I want to say, never but nothing is ever 100% certain and that burden would be paralyzing. 
  • I think this is a super important question. And a lot of “experience-based intuition” (e.g. what judges might do) is really just shitty statistics.
  • To pull folks away from the recidivism example: a lot of benefits programs are qualified based on your census track having a median income below the median income of the surrounding area which is another way of saying “chances are you are poor.” Whereas a poor person who lives by a bunch of rich people might be excluded from benefits.

Sep 1, 2023


Questions
  • What happens beyond 262K tokens in the memory? Does performance flatline or get worse?
  • If your document is longer than fits, it starts swapping out previous memory (i.e., it’s a queue?) It removes the oldest memory. 
  • Gotcha, makes sense.
  • Edit: actually the way to think about this is that it’s absurdly long context length, as opposed to adding external memory. So I think this question isn’t that relevant anymore, since “current” documents aren’t that long.
  • It depends on what a “document” is. They dump entire codebases into a single “document”
  • How does the context size interact with the memory size? Do some of the results of kNN lookup replace the slots in your normal context?
  • Answered IRL.
  • What is the value of k?
  • Answered my own question, seems it’s 32.
  • Does every model use this now by default since it seems to help quite a lot?
  • Josh says it is 2.4x slower per step during training, but still seems valuable.
  • Does every head select its own top kNN?
  • Yes, I think every key does. there are 65K queries/values that can match this key, and we get the top k of those in our softmax()
  • 👍 
  • How does the model decide which strings to hold memories about?
  • It does not
  • Does it hold strings or just intermediate k/v states?
  • cached key/value pairs, not the original tokens
  • I wonder what happens when you stack external memory / RAG on top of this. Do the improvements stack? (Is there a paper on this?)
  • I suspect there is - if not, we should experiment
  • Is there a better way to construct subsequences than the arbitrary way they’re doing? (top of page 4)
  • Yes
  • I don’t understand how finetuning works. What is there to tune?
  • Ben: the gating parameter!
  • Bas: thank you, I forgot about this
  • I don’t understand the concept of staleness. Why is training on a larger memory from scratch worse?

Aug 25, 2023


Questions
  • Wouldn’t fine-tuning on questions that the model is able to solve bias the model towards proposing more questions like those it can solve?

Aug 11, 2023


Questions:



Questions:
  • Except for the writing task, these are all tree-shaped problems. Is there any benefit to doing an explicit tree search structure vs. just doing a normal CoT that prompts the model to internally do tree search?
  • On what basis are intermediate states / branches terminated early?
  • Thought evaluation stage: generate a response that ends with “likely” or “impossible”
  • Is there anything special about their sampling that ensures they get a wide variety of possible next steps? (may partly not be relevant given the tasks at hand)
  • They discuss this at the bottom of page 3

August 4, 2023


Questions:
  • Does this technique apply to all fine-tuning cases?
  • RLHF is typically one-step, rather than multi-step As in other RL problems. Is DPO suitable to multi-step RL problems (i.e. ones requiring regulariation across the Bellman equation)
  • Is DPO basically a way to map human preference datasets to offline RL, where each example is a state-action pair
  • Yes 

July 28, 2023


Questions:
  • Does Table 1 suggest that human feedback is much more effective than GPT-4 feedback? What if we got GPT-4 to generate feedback that was more similar to the human feedback?
  • Yes, and trying the latter would help, unclear how close you can get to human feedback.
  • What does the feedback look like? Or, how was the feedback given? Did they experiment with different ways of giving feedback?
  • Went through it, !page 28.
  • Why does self-repair help more vs. generation with harder problems vs. easier problems?
  • Does prompt engineering change the slope or the y-intercept of these pass@t curves?
  • We think shape stays the same.

July 21, 2023


Questions:
  • What is the Transformer learning inside the matrix multiplications that allows it to add numbers—is it similar to grokking modulo blog re: discrete fourier transforms
  • For practical purposes, wouldn’t we prefer something like toolformer – training the llm to just use a calculator in the right situations?
  • I think our interest in this paper is more about learning sequences and “reasoning” (explicit reasoning algorithms) than about arithmetic.
  • IMO an approach that generalized could be beneficial, but until then it seems unintegrated/detached from general reasoning
  • Does the table on page 18 affect how we think about what to train vs. fine-tune on?
  • Relatedly, I’ve heard from other people that it’s a bad idea to filter out “low quality” data from the model, and instead better to just label the data low quality and label high quality high quality. (At least, with code - unsure about other data)
  • How can we formalize the idea that some sequences are easy to learn without a scratchpad, and some benefit a lot from a chain of thought style scratchpad? 
  • What patterns show up in internet text that would benefit from scratchpad?
  • Is all human language “easy” to learn? 

July 14, 2023


Questions:
  • Is Minecraft actually hard?
  • More specifically, if you get this particular API, how much of it is just creating these objects in the right sequence and then you get a diamond? 
  • +1, related to another question I had
  • Is the real world actually hard? How much of it is doing things in the right sequence? 🤔 jk, somewhat — I think your question is how much is solved by the API functions vs. how much the language model is doing. But I think there are a surprising number of computer tasks in the real world that give “high level functions” like this.
  • I have never played Minecraft and never will, please educate me
  • They misspelled Shield in fig 2 :p
  • What is the curriculum?
  • What is the baseline? Do they compare with any standard RL baselines?
  • No. They only compare to their “best effort” implementation of comparable prompting work with mineflayer
  • Here’s Dreamerv3 on Minecraft for comparison
  • Does this only work because GPT-blah knows what minecraft is? And this would basically not work in any novel setting?
  • Related, is it accurate to say that the model (/paper) is largely about guessing recipes for the technology tree? If so, the question is how intuitive the recipes are based on some human priors learned through pre-training.
  • I also assume that GPT-x has read the Minecraft wiki.
  • yeah 100%
  • This paper really gets at the knowledge/capability distinction in intelligence
  • I’m almost certain this is the case
  • At what level of abstraction does mineflayer provide? Does it do most of the heavy lifting in terms of expressing actions you can do?
  • What happens when you die? Is there a timeout? 
  • What happens when the agent dies?
  • No actually it just respawns with all its stuff
  • Do you keep retrying the run until you succeed? ie. find a diamond you may die 100000000000 times but get lucky and not fall into lava one time
  • My interpretation was that the script has some way of being notified on death, but that you could keep exploring through it
  • What’s the abstraction / power level of the “built-in” actions it gets at the start?
  • I think these are listed at the bottom of page 24: exploreUntil, mineBlock, craftItem, +5 more
  • It sounds like once a skill is learned, it’s banked and retrieved whenever needed—does this mean it will always perform constituent actions the exact same way? For example, I can think of situations where flexibility is important for a really complicated task
  • When do we do this for avalon 😉?
  • How hard would it be to make a mindflayer equivalent for Avalon?
  • Relatively difficult — we don’t have the same interface as mineflayer 
  • There are some similarities to some of the ideas me and Josh discussed for the caretaker implementation
  • How do they make it make functions that are actually useful? Is there any special prompting?
  • They only maintain functions in their library if they’re deem useful, i.e. if they end up with the correct item in inventory
  • Consistently or even just once? I feel like you can learn some really dumb skills that happen to work by accident…
  • What do they do to make things more composable?
  • They prompt GPT4 by saying “make sure your code is composable”
  • “Just Ask For Composability”
  • What other tasks would this sort of approach be useful for?
  • Maybe learning to navigate & accomplish tasks in a web browser
  • What is the scripting language it uses? JS
  • When they say “embodied”… in what way is this embodied?
  • Just text and json
  • Can you go into more detail about the “env feedback + interpreter errors” feedback loop?
  • For a task like “find a diamond” that might take a very long time, how are they figuring out whether the underlying script is correct or not?
  • I think the skill library is a very interesting idea. How do they make useful skills? What is a skill vs. a function? What are the failure modes? Would it be useful for us to make a skill library? What would that look like for coding agents vs. non-coding agents?
  • How do they learn useful embeddings of plain text descriptions of skills in the skill library?
  • +1
  • What is novel about this prompting strategy?
  • Given that it’s based on this high level API rather than learning from pixels, did they discuss why they didn’t choose to do this study with a purely text game like NetHack?
  • +1 (and I suspect the reason is that “minecraft” makes for a better title / abstract)
  • Can we go through some of the techniques that they’re using to make stuff composable?
  • GPT-4 probably hallucinated a lot of invalid code or arguments to functions (trying to feed wrong/nonexistent objects)—how can they improve this other than more prompt tuning or hoping for a “better model”? This feels like trial & error with some priors
  • Can we name one real world task that this would be useful for?
  • If you rot13’ed (or randomized) all the minecraft primitive names I wonder if it would figure out anything useful (read: obscure the prior knowledge it’s learned in pre-training). Then again, this may defeat humans too.
  • You would need to include a lot of game rules in the prompt, like “5 foos and 1 bar can be crafted into a blorp, etc” 
  • I don’t think it would have any idea of what it should be doing for the self-guided curriculum part

July 7, 2023


Questions:
  • Do they attribute the long range performance to the fact that they’ve adapted RMT’s to an encoder only model? / Would we expect this benchmark to not work on the previous RMTs?
  • They do not attribute it to encoder-only—I’m not sure they conducted the ablations to be sure whether it’s coming from pretrained BERT vs. encoder-only architecture
  • It almost feels to me like they were like “oh shoot, long context windows are a meme now, what can we change about our previous paper to be sufficiently different to republish and capitalize on the meme”? 😅 
  • Do they do tests/experiments with the number of memory tokens? 10 seems pretty small
  • Yes, it is pretty small. No, they don’t try other sizes
  • What is the speed/memory impact? If we need to keep two batches in memory (to propagate memory gradients between them) that seems like a pretty significant cost.
  • I think you’d only need to keep the memory module’s activations in memory to backprop gradients to it—their memory usage is reportedly constant
  • It seems like the reason why this works is because the memory block picks up on salient pieces of information throughout a very long input.  Why don’t we just use a vanilla LLM to find embeddings for each sequence, use these embeddings to find which sequences are most relevant to a particular final prediction task, and then use those sequences as additional context that’s used to generate the final next token prediction?
  • I think that’s a pretty interesting paper idea to explore yourself, Danny 🙂 
  • More seriously, I think the compression problem in general is best solved when you do as little hand-designing as possible for the structure of that compression (see: Bitter Lesson). The positive of this approach is that it’s really simple and it leaves the decision-making/heuristic of what to store & how to store it in the memory vectors up to gradient descent. Having a trained LLM generate these embeddings & then interpret/retrieve them is a “frozen” way to compress. 
  • How significant do we think this is? It feels a little gimmicky / limited
  • For example, it seems to me a version of their reasoning task with an input of lots of facts would just fail. Like they have a gimmick for dealing with interleaved obviously-useless data and that’s it.
  • Agree that it’s super gimmicky as implemented — I would be curious if you didn’t do any task-specific training if it learns to memorize “useful” things and how changing the ratio of memory to tokens would help (at it’s core this feels like an information capacity problem)
  • +1

June 30, 2023


Questions Block-Recurrent:


Questions RWKV:
  • How far do we think this can go really, like will future optimizations in Transformers outperform this approach? 
  • Equations 11-15
  • Why the exponents in equation 14?
  • Answering my own question… I clearly need to review basic Transformers because the softmax is just over (Q.K) not V (at least according to Eqn 7) - which makes sense, since Q.K is what should transform into a probability. So the exponentials/sums there are just doing the Softmax.
  • What problem exactly is U solving?
  • What is “token shift” in Figure 3?
  • “By linearly interpolating between the current input and the previous time step input, the model naturally aggregates and gates information in the input channels.”
  • What computation is it giving up by doing linear instead of quadratic compute? Is it giving up some important calculations? 
  • Is their K vector similar to the K vector in transformers
  • Can someone explain Eqn 9? AFT?
  • AFT: instead of doing attention (q.k) they do a weighted average based on a learned pairwise positional bias (w+k) (e.g. weigh local tokens more)
  • They take inspiration from that and implement as a (time) decay vector
Blog posts:
  • “During training, we use the transformer type formulation of the architecture, which allows massive parallelization (with a sort of attention which scales linearly with the number of tokens). For inference, we use an equivalent formulation which works like an RNN with a state. This allows us to get the best of both worlds.” wot

June 23, 2023


Questions:
  • For context, what is the performance of GPT-4 and GPT-3.5 on HumanEval? 
  • (see Table 1)
  • KJ: Table 1 is probably outdated
  • BF: it’s actually higher than this from our own experiments and I’ve also seen tweet threads with similar results to us. These numbers are the ones reported from the GPT-4 technical report (I suspect) — but the caveat is they constantly change the models so it’s hard to do proper science
  • What is it from our evals?
  • 70-75% gpt-3.5 / 80-85% gpt-4 (trying to find corroborating tweet thread)
  • Whoa
  • From Textbooks are all you need
  • how did they generate their synthetic textbook?
  • not clear but probably like Tiny Stories
  • did they run ablations on the data? what was the difference in performance when using the generated data vs not?
  • no they did not
  • MR: This makes me very skeptical about potential contamination with evals. Do they address that at all?
  • MR: They do at least attempt to in “4. Evaluation on unconventional problems with LLM grading”…  ok 5 addresses this thoroughly
  • Why did they pretrain and then fine-tune on generated data instead of pretraining on Stack plus generated data? Is there a benefit to having a fine-tuning phase vs. putting everything into the pretraining? (I guess it “targets” the output kind of)
  • +1
  • DG: My sense is that they wanted to separate these steps so that the model would first learn basic information during pre training, and then “consolidate” that information during fine-tuning on trickier examples
  • KJ: interesting. Wonder how much performance on HumanEval improves if we use more pretraining tokens and fine-tune on more data.
  • Is the purpose of this paper more about the benefits of pre training on textbooks, or about all the “emergent behaviors” that arise after fine tuning?
  • Abe: the purpose of this paper is about pretraining on generated data, not textbooks.
  • If the claim is that higher quality data is the main driver for better performance here, one question i have is: what if you add more data quantity but the added data is lower quality?
  • Also, why did the authors pick the problem of code + humanEval? would be nice if there were an even smaller/easier problem to use to test data quality related hypotheses
  • For both papers:
  • Is the theme - ask GPT4 how to solve/build better criteria → feed into smaller model → profit? I’m noticing a trend for that. 
  • Yes, there is definitely something there. Seems you can get better scaling by bootstrapping from existing LLMs.
  • For tiny stories
  • The creativity, grammar, consistency scores - are those the scores reported from the GPT model, or from a subset that are rated by human users?
  • From GPT-4 scoring
  • If it is the former, then why only use the rouge score, what are some other metrics score besides that - i.e. embedding distance?
  • More of a general discussion point: this seems like another blow to the concept of emergent capabilities as these sudden phase transitions in LLM capability and it being more about a lack of the right metrics to evaluate smaller models.
  • A related general discussion point: it’s interesting that these smaller models seem to get so good with good training data, so if there were “phase transitions” they’d have to happen much earlier? I wonder what the loss curves look like compared to training on filtered Internet data.
  • As a response to that, what’s kind of exciting because the models are so much smaller, it’s pretty plausible to replicate and see how it shifts.
  • Did they do any experiments or make any comments about scaling up the model size and number of training tokens while still using this type of synthetic training data? I’d be curious what the scaling law looks like.
  • Do they try training with this same data on a larger model? Curious to ablate model size vs. dataset.
  • The particular completion tasks strikes me as something a RNNs/LSTMs could also do well. What is attention / the transformer model adding there that feels novel (given the size)?
  • That’s actually an interesting comparison - given the size of the model, it’s reasonable to compare. If there truly isn’t a difference, then it feels like it’s truly an issue of data quality, and training order than specific characteristics of transformers*.
  • For our own future experiments
  • Would these be relatively easy for us to reproduce? Seems worth reproducing.
  • Should we make much crazier changes to our dataset?

May 19, 2023


  • Is there any kind of baseline comparison in terms of human (or GPT-4 I guess) preference between un-finetuned LLaMA 7B vs ChatGPT? How big of a gap is this FT closing? 
  • Answer: unclear, probably helps some given LLaMA un-FT seems quite bad at one of their prompts we tried, but then again maybe it’s a lot better in the non zero-shot setting
  • Abe: are there any good evals for this?
  • Answer: unclear, it’s hard. See “serious use”
  • Did they run this on standard evals (like MC)? Did it make any difference in performance?
  • Answer: No
  • “Only know as much as our evals” – would be good to have eval slices of different difficulty instead of just some big pool of undifferentiated samples
  • I vaguely recall that a lot of these were distributed as deltas from LLaMA… is the size of the deltas surprising at all in terms of information density / compared to LoRa approaches?

May 12, 2023


Questions:
  • Anything interesting in the image task section?

May 5, 2023


Questions:
  • What about the grokking phenomenon? If grokking is real (i.e. it learns an algorithm kind of suddenly), the wouldn’t that point to emergent capabilities actually happening?
  • +1 +1
  • IMO the point of grokking is that learning occurs due to regularization, separate from error minimization, I don’t think this paper is related
  • bad term for anything that isn’t a sudden step-change in understanding
  • Conclusion: grokking papers have this metric flaw
  • Can we please talk about their equations in page 4? I actually feel like their mathematical formalism does not prove the point they’re trying to make, but maybe I’m missing something.
  • It’s quite simplistic - they just show that if you have an error on every token and you assume errors are IID, then you get a soft threshold (i.e., exp(N*error rate))
  • But the thing I’m confused about is the sequence length has nothing to do with the model size (when you are comparing two different models) so this “deforming the per-token error” thing doesn’t seem like the right thing to talk about?
  • How do you even get a continuous variant of an exact string match?
  • Bas: token edit distance: how many characters it gets correct. This is still technically discrete but at least not binary. Also obviously wrong because it’s not IID but it’s an improvement
  • Ooof, do we have some ideas for an even better alternative?
  • Maybe some kind of semantic vector distance?
  • Minor point, I think Token Edit Distance is how many it gets wrong  - how many integer operations you need to do to transform it into the right string
  • Bryden: when would we ever want a metric like exact string match? Seems like a cursed metric to begin with. +1 +1
  • Math?
  • Not even then, consider 3/2 vs vs 1+1/2 vs 1.5 vs 1.50, etc. But you make a good point that the answer being correct or not usually is somewhat discrete.
  • I guess you can pull out the answer and run it through a calculator to verify to things are equivalent (similar to code execution)
  • Yeah, although it will forever be at least a little bit cursed if you’re doing serious math and the answer is like / 2) ^ * π) and the model gives this in digits.
  • Correct or not isn’t always discrete - for humans, for math, if I’m reasoning about the problem right then that is better than reasoning about it wrong, even if I end up with the wrong answer. 
  • It’s super cursed, yet it’s what most people use for evals. What I think we do want is exact string match on a super constrained version of the problem i.e. multiple choice with a very rigid answer format.
  • I think usually when people use accuracy metrics, they show a calibration plot, which is meant to reinforce accuracy as a good metric - can this “emergent non-emergence” happen with calibrated metrics? 
  • Considering most kinds of problems can be solved in many ways, is it really a good assumption that “emergence” is going to be indicated by sudden sharp improvements in performance? 
  • How are multiple correct answers even evaluated usually? Just using your closes match? That would seem to be even more sharp / discrete, and prone to this.
  • How would you define “emergence”? 
  • resulting non-random structure that depends on the (typically nonlinear) interaction of many elements and is otherwise absent. My point is that when training RNNs, there are tons of emergent things happening at all levels, and changes in performance will be smoothed by the somewhat anarchic production of many simultaneous, variably effective, algorithms for getting the same result. (as with the central limit theorem)
  • Do we expect most important problems in the world to be thresholding-style rather than regression-style? If so, do we expect future advancements in LLMs to display sudden jumps in difficult domains?
  • +1
  • I have a different variant of this question - do we expect that most / all threshold-style problems can’t be turned into regressions with a better metric?
  • Probably for evaluation we can, but for deployment it might be necessary to do thresholding, e.g. for classifying documents or writing functional code
  • Isn’t the eval / training stage the most important point of this, though?
  • The very last sentence of the paper provides a commentary on the current landscape of SOTA models. To what extent is the main argument of the paper motivated by these frustrations?
  • +1
  • Does it matter, if the main point still seems to hold?



Ideas
  • Partial credit for wrong answers when model gets several steps right (unsure how feasible to implement)
  • Yes partial credit seems very important. Maybe going away from “right” and “wrong” answers and making benchmarks with graded scores
  • Could we design an evaluation system that would enable the LLM to argue their answer somehow? Maybe that doesn’t make any sense.
  • training an expert network to behave like a teacher might wrt partial credit seems feasible.
  • For code eval, Michael suggestion: partial credit depending on what type of errors you’re getting. Ellie suggestion: different partial credit for passing different tests. Abe: curriculum learning (start with one-line code, progress to harder).
  • Another idea: cloze test style coding evals. Write the answer code, correctly formatted, but leave a bunch of variable names, etc. blank and ask the model to fill them in. Model has to understand what the code does and how it works but doesn’t worry about syntax.

April 14, 2023


Questions:
  • So is this concept of induction heads for in-context learning just learning bigrams of the context?
  • Answer: bit more than that, the bigrams can work with synonyms or even at the abstract/conceptual level
  • Can we look at that derivative of the log loss token comparison again - i’m trying to understand how that relates to the overall shift in model performance?
  • Yeah, I’d like to better understand the phase shift co-occurence argument.
  • Can we go over the plot that looks like the trunk of an elephant? (the PCA analysis I guess)
  • What are some of the implications of this work?
  • Prefix Matching Score
  • Prefix matching: Generate a sequence of 25 random tokens, excluding the most common and the least common tokens. Repeat this sequence 4 times and prepend a “start of sequence” token.It would’ve been better if we had used a start of sequence token for the copying head evaluator as well, but we omitted it by mistake. Without the “start of sequence” token, some heads that were doing prefix matching on real data would get anomalously low scores on our test sequences Compute the attention pattern. The prefix matching score is the average of all attention pattern entries attending from a given token back to the tokens that preceded the same token in earlier repeats.
  • Can what LLMs are doing really just be broken down to induction heads? To me it feels like there’s something more complicated going on than “given A predict B” …
  • If two layers get us [A] [B]…[A] [B] then do 3 layers get us [A]..[B]..[C]…..[A][B][C], etc?
  • Anyone understand what the PCA of the loss in the second paper is all about?
  • They didn’t do any research around “engineered” synthetic (toy) induction heads, only learned ones, right?
  • Does this mean we can build LLMs with solely induction heads? What would happen? Is that a smart thing to do?
  • More generally, what would be the best way to demonstrate that induction heads are a “thing”
  • Does this allow us to improve learning?
  • If this is responsible for the majority of in context learning, how important are the MLPs (which have the majority of parameters)?
  • +1
  • +1
  • Alternatively, if we hardcode this behavior in, will it allow the regular attention heads to learn more sophisticated things?
  • Where’s the MLP comparison : )? - I.E observing if you could get any form of that representation in a pure MLP.
  • How does the bigram token shifting behavior affect the performance of these models as a function of depth?
  • How long of a first sequence can these trigger off of?
  • Is there some formal correspondence between this and tree grammars?
  • Is this really important for AI Safety?
  • Bas: No

Mar 31, 2023


Summary: 
  • Uses a model that generates a lot of answers to questions, e.g. math questions. Take answers that agree, use that as fine-tuning. About ~600K samples for fine-tuning.

Questions:
  • How did they make it generate a lot of different rollouts that are sufficiently diverse?
  • They didnt do anything, they just generated with temperature = 0.7
  • Did they need to do anything special to make sure it always generates answer? What do you do if it gets stuck in a loop or something — just drop the rollout?
  • Unclear from the current reading, but from my own playing around, you’ll almost always get an answer eventually. My guess is they simply dropped those, or they were effectively dropped since the answer would be “undefined” and thus not the most common answer
  • How does majority voting work if it’s a tie (e.g. n completely different answers each with 1 vote, or all votes split across 2 answers or something)? Is there a minimum confidence needed to be included in the finetuning?
  • No minimum needed
  • Did not see any mention of ties, but with 32, was probably pretty unlikely. My guess is they dropped or included both, but we can check
  • How would you do this for questions that don’t have a clear “answer”, or parsing the “answer” from the generated text is harder. Like 32 answers to “why are trees green?” 
  • +1 
  • Clustering? Use an LLM to summarize? 
  • No idea
  • Or even situations where comparing the different answers is harder (i.e. they’re not numbers) - how did they handle this for things like DROP?
  • I think it’s just about getting the most consistent answer, so that applies to DROP, but not Bas’s example
  • But what does consistent mean if it’s not numbers but rather words or phrases? How do you measure similarity between answers?
  • ^ Yeah, seems like this wouldn’t work for any “why” questions. Wonder if there’s a variant of this we could get to work for “why” questions, which are required in reasoning?
  • Can we review some example questions from OpenBookQA and ANLI-A2 and DROP? It might help me better understand why it helps those tasks.
  • What’s the difference between the different prompting styles (CoT prompting vs. Standard prompting)?
  • Nvm: The Standard prompting examples are the same question-answer pairs with CoT prompting examples, except that reasoning is removed.
  • Why does fine-tuning on these chain of thought traces help so much?
  • There is very little web data of this format, so it is helpful to give the model a sense for the “form” that you’re looking for, or the “algorithm” it should use (but this is a very handwavy answer on my side, just guessing).
  • Also, looking at the examples, it does make sense that it helps. I have to show my work for those kinds of math problems anyway
  • Would it work better if they filtered out answers that were not accurate (i.e. incorrect answer to the question even though there was agreement)? Why did they keep those? (Or, did they?)
  • I think the reasoning here is that this method could be unsupervised and thus scaled up beyond labelled datasets for finetuning
  • Yes, the idea is that this is more scalable
  • I think I practical version of this would actually do what you suggested, and be better about cleaning out the wrong answers. Not clear how much it would help. Becomes a cost trade-off curve at that point, but is probably a MUCH cheaper way of generating training data
  • At the very least having a minimum confidence threshold seems like it would be a good idea
  • I like this idea ;)
  • Does it give you a sense of what the curve looks like for amount of training data vs improvement?
  • Figure 4b seems the closest? But this varies number of sampled answers, not the number of questions
  • Is this higher risk for hallucination? But like also might it come up with an alternative physics / world model? What are the consequences of an early incorrect answer? 
  • I would expect it to. If the initial 32 sample has a majority for the wrong answer, it will now train to more confidently generate that answer in future generations. 
  • In some ways, it might make it less likely to hallucinate since you’re training it to not generate wrong reasoning - but I do wonder if sometimes it gets the right answer with the wrong reasoning and trains on that
  • That’s interesting. But yeah, in general it seems less likely to hallucinate. 
  • I think I used the word hallucinate wrong 🙂 (although these answers are super interesting so I’m glad I asked that) What I was trying to ask is it possible to come up a with a consistent alternative logical system / physics model. Basically is it likely to become confidently “Wrong”. 
  • Ah, no, seems unlikely that it would be even capable of being wrong in any consistent way. (Although Bai’s answer below about doing this too much could amplify wrong methods / wrong answers.)
  • It feels like the performance on a given benchmark might be correlated with the domain of possible answers: if there are many possible answers then majority voting is likely to behave differently than if there are, i.e. only 4 multiple choice answers. I wonder what alternative approaches might work if you were much more likely to probabilistically CoT your way incorrectly to the right answer than doing arithmetic?
  • Wait, how is the final evaluation setting of the finetuned model formulated? Multiple-choice setting or no? Most of these benchmarks are not multiple choice by default.
  • Although OpenBookQA is.
  • High-level: this feels like bootstrapping and ought to be impossible due to some no-free-lunch theorem. Why is it possible? Is it generally going to be the case with reasoning systems that modal responses are better than samples? Like, could we do this with people?  Or is there something about these LLMs that make them worse at generating answers than checking. Related: what are we sacrificing? Presumably this lunch comes at some other cost. Diversity? Calibration?
  • Kanjun: This is wisdom of the crowds?
  • Kind of neat trick (from wikipedia): For a given question, people a
  • re asked to give two responses: What they think the right answer is, and what they think popular opinion will be. The averaged difference between the two indicates the correct answer. It was found that the "surprisingly popular" algorithm reduces errors by 21.3 percent in comparison to simple majority votes. 
  • I was too quick to dismiss the compute: it might actually be realistic that this is a technique that trades of compute vs correctness, akin to alphazero
  • Yeah I would be interested to see what would happen if you ran this many times in iteration
  • Yea I can imagine it reducing diversity, also amplifying biases when there is no clear right answer but after finetuning with this method it generates the biased answer more often.
  • Jamie and I were talking about how this is a way to exploit a gap between evaluation and generation ability, and that this helps close that gap. But it might not be able to do well on things that aren’t easy to evaluate, and it might make performance worse on those things (e.g. “why” questions). Though I could also believe that it is learning reasoning strategies and therefore would do better on those questions.
  • Yes very curious what Jamie has to say about this paper, seems very related to the generative-v-discriminative gap he brought up
  • Bawr: this is not wisdom of the crowd. This moves from linguistic probability to semantic probability. The model already knows the answer, but you don’t know what sentence gets that answer. This helps you get that answer with more sentences.
  • Comment: this idea seems related to AlphaGo / monte carlo tree search — you’re effectively using a more powerful version of the model (with multiple rollouts and majority voting) to generate training data for itself, kinda like how AlphaGo uses search
  • Yes, we’ve talked about this before in different contexts - there’s a cool analogy there. Though Bryden consistently points out that Alphazero is grounded in the actual environment (chess/go), whereas this seems to not need access to a ground truth
  • Does it do any ablations around whether fine-tuning on one of these question types in one eval set helps it perform better on other eval sets? For example, does just training on GSM8K help on ANLI or vis versa?

Mar 24, 2023

Questions:
  • Is the attention in table 2 a pretrained transformer on text, or purely trained to accomplished this task?
  • Can we go into detail on 2.2 Linear attention?
  • How does the shift matrix allow it to remember previous tokens?
  • What’s the difference between state space model and RNN/GRU/LSTMs? Seems like just another variant of a model that processes a sequence sequentially.
  • What’s up with the title of this paper?
  • What is Flash Attention and what’s the relationship between that and FlashConv?
  • Can someone explain what’s going on with the FFT? Is this introducing an nlogn term?
  • At any moment, is this model only capable of carrying forward a finite amount of memory about all previous inputs?
  • Are there any tradeoffs with this compared to attention? Are we going to see it supersede attention modules?
  • Is there a use case for models with very long context?
  • Why is it better at copying content from the prompt?

Mar 17, 2023


Main idea: Keep memory bank of every time model misunderstands the user’s intent, query that memory bank for better prompting.

Questions:
  • What exactly does it store in the memory bank? Is the entire interaction stored? Some selection?
  • MR: I think it’s just {initial_prompt}: {explicit_feedback}
  • Key-value pairs; key is the input x, value is the feedback fb
  • KJ: I wonder what happens if the feedback is not that good / doesn’t explain things very well. 
  • +1
  • Can we go through how the retriever works?
  • +1
  • Answer: It’s just looking for biggest dot product.
  • There was an extra piece about using the model to generate stuff like “this question is about X”, and the user adding feedback specifically about that, unlike in the very first example in the paper. How does this whole piece work?
  • What happens if for some subset of users “similar” means “sounds like”, and for the other half it’s “means something similar”? Does this / can this do more of an explicit disambiguation? Probably not, since at the end of the day it just appends some extra explanation to the input prompt, right?
  • ^ related to above question, what happens when you have conflicting feedback? What gets retrieved?
  • How does the user provide clarification? Does it get it from the normal conversation, or is there a meta-conversation happening on the side? If it is normal, how does it know what things the user said were related to task understanding?
  • Yeah, I’m not entirely sure because they have a bunch of figures in the paper that seem to have different templates for giving feedback. But it definitely seems like it’s structured as a dialogue so it’s part of the conversation.
  • Table 1 makes it seem like there are two chat-interfaces running in parallel
  • I am even more confused after reading this: A note on feedback and understanding Feedback fb and understanding u are two concepts that we repeatedly use in this work. Briefly, MemPrompt requires a model to spell out its understanding of the instruction (u). The user can then provide a feedback fb on the understanding. In the prompt, both fb and u are identical. Such examples are of the form x, u → u, y and their main purpose is to reinforce that model the input feedback u be used to generate the output.”. Can we make sure we understand the definitions?
  • How relevant is this for something like ChatGPT where the feedback is binary (thumbs up or down) and is kinda noisy?
  • Does it assume the user disagreeing with the model as the model being “wrong”? Eg, what if some other users actually want “wood” to be similar to “good” and the question itself is ambiguous?

  • How does this relate to LLMs that use scratchpads or have external memory (not sure if there’s a better name, maybe transformers + information retrieval)? It feels like they’re doing something kind of similar with this embedding look up.
  • It’s definitely a version of information retrieval. I think the thing that sets this apart a bit is that it focuses more on user intention than quality/correctness of answer. So it’s something you use even if the user doesn’t know the correct answer but at least knows that the LLM is not understanding the question correctly.
  • BF: related to MR’s question above, how do they determine the user intent? Or what’s the “model” they’re using?
  • [beginner question] Is there a way to feed this back into the LLM training, so that the capability could be more generalized (i.e., recognizing errors in meaning)?
  • BL: retraining the model is usually quite involved, if there is only a small amount of new data it’s easier to “retrain” it by feeding it as a prompt.
  • MB: Thank you!
  • KJ: I do feel like you could use these traces for something if you have enough data
  • Strikes me as mostly a “partially-automated prompt refiner”
  • Looks like clarification probability improving performance is simply about having more prompt data: “We observe that using a higher clarification probability leads to a sharp increase in instruction and label accuracy early on in the training for both ERTCAT and ERT-NL. This is because a higher clarification probability causes the feedback memory to fill up more quickly, providing more feedback for new questions.”
  • Unless memory is user-specific this seems really vulnerable to attacks (“by love I mean hate”) 

Mar 10, 2023

InstructGPT

Questions:
  • What is the loss for the human preference model?
  • What is the loss for stage 3 language model with PPO, what sort of tricks do they use (eg, advantage estimation, discounting, gradient clipping)?
  • How does the supervised fine tuning work?
  • Does this matter? (how big are the performance differences really)
  • I’m confused about how an RL policy generates an LLM output in step 3. I thought RL models mapped [state] [action]. Are we just cleverly reinterpreting the LLM as an RL model and training it with PPO?
  • I think so, yes
  • How big was the PPO model? Or is it just the LLM?
  • Why does it become more truthful?
  • What models in the API (eg curie or davinci) have incorporated InstructGPT style changes?
  • How much data did they use? Did they have a scaling curve for how more data helps?

Mar 3, 2023

LLaMA

Summary:
  • They used about $10,000,000 in hardware with 2,000 A100s
  • Chinchilla scaling laws are all based on training budget, but inference budget matters - and that’s a reason to do a smaller model that’s more expensive to train.
  • Results shown on 20 benchmarks, comparable to PaLM and Chinchilla, comparable to GSM8k on Minerva of the same size (but Minerva was fine-tuned), so pretty impressive
  • Massive multitask language understanding (MMLU) doesn’t do nearly as well - they postulate because they didn’t train on nearly as many books and papers.
  • CommonCrawl - they trained a model to classify whether a page would be usable as a Wikipedia reference. Bai has also seen people filter based on whether a page has been upvoted on Reddit at least 3 times.
  • We should also train a separate classifier for fiction and add that stuff from CommonCrawl
  • They said toxicity increases with model size for their models. They also cited another paper that said this. But they also found that Gopher and Chinchilla didn’t see this.
  • It’s possible that smaller models are just worse at doing what you tell them to do when you tell them to do something toxic.
  • Also you might be going into lower quality data, and eliciting those parts of the document space (e.g. prompt “I’m not racist, but…”).

Questions:
  • They used xformers, does it matter which library to use to train it? Or are they basically equivalent?
  • Mostly equivalent in results, sometimes very different for flops performance / throughput.
  • So then which ones are fastest / slowest? How good is huggingface?
  • HF depends on the model a lot - for stuff we’re interested in, mosaic tends to be way faster, and for gpt-neox in particular, there’s a completely separate repo for training. Basically though, there’s two levels to this, one is how you split the model to multiple machines, the other part is optimizing operations within the model, xformers is more about the latter.
  • One thing they didn’t do is evaluation set filtering, could this have inflated their results? Also they didn’t do much deduplication (only the CCNet one)
  • What does evaluation set filtering entail other than de-duplication and avoiding including benchmarks?
  • Yea it appears they didn’t do this..
  • Uh-oh.
  • Are there changes we're surprised they didn't make from the original 2017 transformer model?
  • They have different embeddings, it’s decoder-only, different activation functions, different normalizations.
  • Are almost all the score differences due to training data differences?
  • What are the primary sources of the improvements in LLaMA to make their 65B one as good as PALM-540B? Anything aside from data?
  • I think it’s mostly just more data i.e. using Chinchilla scaling
  • I didn’t see ANLI in there… wonder if they have any other evals anywhere (esp weird bc it’s their dataset)
  • Did they try doing some fine-tuning (particularly on the smaller 7B models) before any of the tasks? Any comparisons for post-fine-tuning results across model sizes?
  • They did some simple instruction fine-tuning and showed it worked well.
  • No, they actually just did it on the 65B model, I think because they wanted to compare it to competitors.
  • Sad.

Feb 24, 2023


 Questions:
  • How does it work and what is even going on?
  • Ok, thanks I feel better now 🙂 
  • Can someone explain InstructGPT?
  • What other tasks is HER used in?
  • What’s the KL penalty mentioned in part 5?
  • Are they using any instructions besides “be correct/be wrong”?
  • They imply that they could use a less scripted instruction relabeler. Is there some good idea hiding there? 
  • What is the contrastive loss term?
  • +1
  • How do you score alignment without human feedback? Is any amount of human feedback needed to train the scorer?
  • Why is this algorithm related to PPO (Figure 4)?
  • +1
  • More generally, what does the reinforcement learning formalism add? Just a way to encode learning from negative examples (i.e., using a reward function that assigns a -1) 

Feb 17, 2023


Questions
  • Why do they finetune it on a corpus with question-answer pairs, instead of just questions with no answers?
  • I guess to answer my own question, it’s so that it would know how to react to the answers during Inference
  • Was the calendar lookup of the current day ever actually useful in the training corpus? Isn’t all the training data from the past?
  • Not really, the model wasn’t very likely to ever output an API call to the calendar
  • I wish they had evaluated the model with different tau thresholds?
  • Yes +1
  • Why do we think Toolformer (Disabled) outperformed GPT-J?
  • hahaha, welcome to ML (bc it’s their system so they actually tuned it and they didnt bother tuning GPT-J is my guess)
  • How many API calls do you actually have to make (during training), as a function of dataset size?
  • at most 25 calls per each example
  • Do they do anything with the API inputs so at least it caches better? I guess the real question is how many actual API calls they need to make.
  • Can it make more complex API calls? Eg: eval(this python program). Would this be useful?
  • How did they pipeline this to handle lag from slow non-web APIs?
  • They didn’t do that in this paper (I think to Michael’s point they were all local or cheap to call)
  • I’m confused about A.2. When they call the API, do they actually run some code to properly format a request, or do they use these prompts? But if it is the prompts, how do they know it’s correct? 
  • It feels to me like the interesting questions in integrating APIs with transformers are 
  • how to format requests to the API?
  • Having a clever (i.e., trained) policy for when and what to search. This seems highly non-trivial and I’m not sure how they tackle this problem
  • Can we compare this to what Bing is doing?

Feb 10, 2023

  • How does the fine-tuning work? What’s the difference between SL fine-tuning and RL fine-tuning?
  • SL is using the revisions as a target. RL fine-tuning is training preference model on human ratings (specifically “which of these answers is better”) and then doing RL against that. 
  • Maybe what I don’t understand is how the training setup works. i.e. Is there a classifier on top of the output that classifies the output as “helpful” or “not helpful” and “harmful” or “not harmful”? If so, how does the output of that classifier flow back into training?
  • RLHF uses the human preferences to learn a reward (preference) model.
  • RLAIF then learns a policy for producing better text? 
  • They talk about using less human feedback or leveraging human feedback better, how does that work?
  • Unclear, they still use humans at various points and it’s hard to tease out specifically where humans come in. The preference model is supposed to get around that but unclear if that scales. 
  • How hard was their red teaming? Asking because I can imagine trading different kinds of harm against one another, i.e. instead of asking how to hack my neighbour’s wifi, asking about doing that in general because I’m trying to prevent other harm, etc.
  • Also done by crowdworkers. They instruct people to try to break the agent into endorsing National Socialism etc
  • How many human feedbacks did they get for the initial RLHF model?

  • How exactly does the constitution get used?
  • Is the “constitution” literally just those keywords they put into a generic prompt in various combinations?
  • Yes. It’s a list of 16 critiques and 10 revision prompts
  • How does the preference model get trained?
  • See above, people rate one answer better than another, the PM does supervised learning to mimic those judgements. So like behavior cloning? I think?
  • So the pipeline’s basically “first do some standard RL tuning, then do the AI-critique-revision trick?” Am I missing any big steps there?
  • I think we should think of the contribution less as a full pipeline, but more about what changes before and after the AI-critique-revision loop. The first and last steps are more like pre and postprocessing. Although the RL and the preference model seems to matter too
  • Is the final model simply doing a single basic pass, or when you talk to Claude is it doing revisions or secretly appending a CoT prompt or something?
  • I think just the prompt “let’s think step by step” but nothing else. 
  • I guess one advantage of AI feedback is that you can get a lot of it really cheaply… any idea how much AI feedback they used relative to human feedback?

Ideas
  • Condition on an agent persona
  • Make an ultra hateful thing and subtract it out
  • Critiques can be used differently: does this make sense, is this a non-sequitor, does this seem factually correct
  • Critique might be the most powerful piece of this - take advantage of generation/discrimination gap
  • Can we use this to broaden the training data? e.g. in the middle of the training data, can we generate much higher entropy data. You can use it to critique or augment the training data.
  • Model that could go off and read about something for a while

Jan 6, 2023

RETRO: Improving language models by retrieving from trillions of tokens

  • How big are the sections of text that they’re using for the MIPS query?
  • 64 tokens for both key and value
  • Are the found items of text just included as context for the transformer?
  • The encoded nearest neighbors from the retrieval dataset are used as keys and values into chunked cross attention.
  • Why does this technique only work for very large MIPS databases?
  • How often does it retrieve a near-perfect match for the query text? Should we be worried that it’s overfitting?
  • Is it natural to include a parameter to balance between trusting the retrieval database vs trusting weights?
  • What’s the intuition behind why this self attention feeding into chunked cross attention splits up the task into learning the structure of language vs. recalling facts?
  • How do they choose chunks?
  • Chunks are 128 tokens
  • Why does this work better at all?
  • Unclear if this causes the weights to be used in a better way.
  • What is the intuition behind what the nearest neighbor encoded chunks are doing for the model?
  • If we add a bunch of information to the retrieval dataset without retraining the model, can it now work well on that new information?
  • Maybe? If we change e.g. the capital of France everywhere in the retrieval dataset it might output the right thing but the information might be in its model parameters so it might not.
  • How does the nearest neighbor database retrieval work, roughly?
  • Why did they use a weird bpb (bits-per-byte) metric instead of perplexity?
  • They used both! In different places 😕 
  • Can this setup easily be modified to support stuff other than language modeling? (eg: seq2seq, sequence classification, translation)
  • Why did they invent chunked cross attention instead of standard cross attention?
  • What happens when a chunk is spanning multiple sentences or paragraphs? Or do they make sure this doesn’t happen?
  • Can you explain why they need to be very careful about “cheating” again?
  • How big is the transformer encoder? Did they do any ablations here? How necessary is it? 
  • But really, why retrieve data?
  • Did they try tiny tiny chunks? (like one word)
  • Previous work did one word but it’s slower because they use k-nearest neighbors.
  • Does this architecture help for model interpretability?
  • Will this retrieval be useful for online situation modeling?
  • Can this be used for online rumination by asking the model to constantly generate implications of the current text?
  • What heuristics could be used to improve the use of this retrieval cache?
  • Could it use a predictive model to only store items in the retrieval database if it predicts that they’ll be useful?

  • Issue: knn and bert embeddings aren’t designed to perform well in a retrieval setting, so maybe that’s a good avenue for improvement.
  • Using a smaller task and smaller retrieval dataset could better show the benefit. (Related work does some of this.)

Nov 18, 2022

Temporally Consistent Video Transformer for Long-Term Video Prediction

  • Why the codebook discretization?
  • Likely for compute and memory reasons in order to lengthen the context window. Breaks the space of possible latents down into a grid, and then map each latent to one spot in the grid. 32 things where each thing is one of 1024 values. 
  • What is the size of x1 and z1?
  • What makes this “temporally consistent” vs other video transformers?
  • A bunch of tricks allow you to have a longer context length: latent compression, codebook, DropLoss, conditioning on the previous frame.
  • How does the temporal transformer work?
  • How does the posterior affect the decoder?
  • How does the dataset evaluation normally work for temporal consistency?
  • Use PSNR, FVD, etc. to compare generated frames to true video frames. Seems like action-conditioning helps evaluate for temporal consistency (i.e. given the same actions does it end up generating the same frame).
  • What is the main novelty of this paper over previous work on video generation?
  • A longer context window due to these tricks.
  • Is this useful for us?
  • DropLoss seems useful to try. Same with some other compression stuff.
  • This is a much smaller Transformer which is nice, only 30 GPU days to train. That’s the biggest thing that’s useful.
  • The sampling time is pretty good, 40x faster than Perceiver AR. This matters for our agent.
  • Does the quantization result in “less nuanced/varied” generated videos? As in, does it end up only ever using one of the transitions encoded by the discrete tokens?
  • You lose expressiveness because you use your codes for things that normally show up in distribution. So if you have something that’s unusual or out of distribution then it wouldn’t encode that very well.
  • They did do some work to use the discrete latent space well.
  • Why do the conditional encodings learn better representations for video predictions? 
  • This concentenates the previous frame with this frame. The images are probably very similar and have a lot of features that are probably the same. You are more likely capture the time-dependent similarities in the representation.
  • Keyframes could be an interesting extension for real video.
  • “We also ablate the codebook size, showing that although there exists an optimal codebook size, it does not matter too much as along as there are not too many codes, which may make it more difficult for the prior to learn.”

Oct 14, 2022

  • Why should we care?
  • I don’t intuitively understand the geometric diagrams and how they change with sparsity, can we dig in more?
  • +1
  • +1 especially when the geometries are  combined
  • +1
  • I get the definition of privileged basis but am not understanding how to make something a privileged basis. How is it related to adding nonlinearities e.g. Relu?
  • “A linear representation exhibits superposition if W^T W is not invertible. If its invertible it does not exhibit superposition.” Didn’t follow this.
  • +1
  • Does anyone have intuition for interpreting the matrix (W^T W)?
  • I can field this one if you want! ~Jamie 
  • ^ yes please, Josh and I must leave in 2 min
  • They say lower loss is not always better because the geometry regime changes - how do we know which geometry is “better”?
  • How does weight regularization like L2 regularization or dropout affect the formation of superposition?
  • +1
  • They mention L1 regularization reduces polysemantism, potentially (my speculation) by killing off higher order n-gons
  • Does this occur with non-binary inputs?
  • I don’t think the inputs were binary! I think they were either zero or sampled from U[0,1]
  • Oh yeah you’re right. “synthetic distribution to have features be zero with probability S and otherwise uniformly distributed between [0,1]”
  • How much are these superposition patterns shaped by the generation or interpretation sides?
  • Can one-hot vectors of arbitrary length be autoencoded in 2 dimensions as an n-gon? Is that relevant?
  • What can we learn about real neurons from this paper? How do we expect them to differ?
  • +1
  • Could there be an analytic method to determine which geometry will best represent the data?
  • Why do you get things like “a bunch of digons plus some pentagons” instead of just one big polyhedron with lots of dimensions and lots of vertices?

these are more researchy followup Qs but…
  • How does this interact with feature-learning in the infinite-width (muP) limit? Their results would suggest that there’s no need for polysemanticity in that limit
  • ^Relatedly, does a wider model trained for less time behave like a narrower model in their experiments? (That one cylindrical time dynamics plot made it look like this was so!)


July 21, 2022

BYOL-Explore

  • I don’t understand the intuition beyond the EMA target encoder. What would happen if you didn’t have it slow update like that?
  • In the beginning of training you might get collapse where the encoder just always predicts 0s and the predictor also always predicts 0s.
  • Is there anything smelly in here / places where you would expect it to work poorly on certain types of exploration problems?
  • +1
  • Can you append this exploration method to something like Dreamer?
  • Maybe, but it doesn’t predict any rewards. So we need to think about how it would work.
  • Can we go through the online network loss function (with all its summations) carefully?
  • It’s summing the loss first across predictions, and then averaging across every step in your trajectory, and then averaging across the batch.
  • I’m still confused by how the intrinsic reward works (in the “world model uncertainties” section).
  • It’s p+q = t+1 because you can predict each observation in multiple ways: e.g. you can product O3 either by taking 2 actions from O1, or by predicting from the next observation O2. 
  • "Later on, if the previously nullified rewards remain, they will naturally become the ones with highest uncertainties..." ← Why?
  • Where are the batch norms?
  • It seems like there are no batch norms. In the encoder they do some group norms.
  • Can this setup work with other contrastive methods (ie. SimCLR, MoCo, etc)?
  • Would want to see if data augmentation works here. Though we may want to be selective about augmentations (jitter and crop likely make sense). But maybe not e.g. color shifting.
  • Where would data augmentations be added? The target encoder.
  • What is PopArt normalization?
  • What is training time like? (eg time per 1m steps on 1 gpu)
  • About 20 minutes per million steps on 1 GPU.
  • (Note that in the paper charts are Learner steps instead of Environment steps.)
  • Is the rollout (open loop) world model ever used for the policy?
  • No
  • The world model closed loop RNN and Encoder are always used for the policy, right? Is that necessary?

July 8, 2022

  • How many steps did this work out to?
  • What compute did they have for learning in 1hr? 128 TPU pods or 1 gpu or somewhere in between?
  • Implies that they trained on a single GPU (but that seems incredible?).
  • What counts as a “step”?
  • Observations (training the world model) and rollouts (training the actor-critic) are done asynchronously. In RL every environment observation is a step. It doesn’t tell us how many rollouts they’re doing.
  • Can we review how the Dreamer architecture works?
  • Why is it that the robot is able to adapt more quickly to being pushed?
  • Because we’re explicitly learning the dynamics of how one state is related to future states, e.g. “If I’m leaning to the right if I push this leg down it’ll have this effect whereas if I put this other leg down it’ll have this other effect” - it can maybe 
  • World model is kind of a bad simulator, so doing imagination is like simulating some of these states.
  • How do they deal with proprioception / sensor fusion?
  • Paper doesn’t say. Danijar’s code has a default: visual input goes through CNN, proprioceptive input goes through 4 layer MLP. Concatenate the outputs of CNN and MLP and then passes through a final layer.
  • How does the policy optimization get parallelized?
  • It’s parallelized across all frames, not just across trajectories, because of imagination.
  • What are the action spaces? What do they output?
  • Why does it seem like they have so many interventions in their 1 hr training? (see video)
  • They intervene every time the robot tries to walk out of its training area (which I guess is a lot - dunno why they didn’t use a bigger area) to relocate but keep the joints as they are
  • What changes were made between Dreamer v2 and Daydreamer? (Are there any architectural changes? Does Dreamer v2 get proprioceptive input?)
  • No architectural changes. (Except I think Dreamer v2 does not get proprioceptive input.)
  • Do they use the same set of hyperparameters for all tasks? (only see one set of hyperparameters in the appendix)
  • Yes, they said so in the abstract.

  • is this useful?
  • Abe thinks after reading everything it doesn’t feel as useful. They do show it’s not that much worse on most benchmarks.
  • It does do well with sparse rewards. It does seem good at exploring, it is better than Plan2Explore, it doesn’t need an ensemble of Dreamers.
  • what is the exploration part?
  • During training, the manager gives reward to states it hasn’t seen that often. This is the exploration reward.
  • This ablates whether the exploration reward is used while training the manager policy, the worker policy, both, or none. It looks like exploration reward is mostly only helpful in Ant Maze.
  • Could we review the ablations in more detail? I’m curious what things seem to make this work well vs. don’t matter.
  • How does the manager get trained?
  • Does this run super slowly? (Or I guess just 12% slower than Dreamer v2 since the manager only runs every 8 frames?)
  • No. The manager and the worker both train on the same rollouts. So it may be a bit slower but not significantly.
  • Does it perform better / learn more quickly than other networks on standard benchmarks (e.g. Walker, DMLab)? 
  • No, see ablations.
  • Is everything trained end to end? Is there any pretraining of any of the components?
  • Everything is trained in parallel, not end-to-end. The world model is trained separately from the policies, because the world model is always trained on the replay buffer and the policy is always trained on the rollouts.
  • Can you explain the worker reward?
  • It’s this reward they made up. It’s the dot product of the goal state and next state, both normalized by taking the max of the goal state and next state. Intuitively this is cosine distance between the two states, normalized to the longer of the two vectors.
  • This cosine-max distance seems to work a little better than other distances.

  • I’m trying to understand how this is “hierarchical.” It chooses a new goal every k steps, but without knowing the ultimate final goal (for which it gets sparse reward at the end), in what sense is it breaking down the problem into intermediate goals?
  • nvm thought about it some more and I think I understand.
  • Maybe it’s a time hierarchy? Though don’t know it’s very successful at learning this time hierarchy. What is missing here is good evidence that it is successfully breaking down problems into subproblems.
  • Another way to think about it is below.
  • Slightly confused about why they call the output of the manager policy “abstract actions” - aren’t they abstract goals? I guess maybe by definition the output of a policy is always called an action…?
  • Can think of it as manager gives an abstract action (e.g. “grasp object”) and worker figures out primitive actions to accomplish the abstract action (e.g. controlling all your joints)
  • Yi-Fu is surprised this works at all. You’d expect to have to pretrain the worker.
  • However, because they’re conditioning on the decoded image as the goal, instead of conditioning on the compact representation of the goal, you don’t need to have a good manager / goal encoder to train a good worker, so the worker does get effectively trained (this gets around the GAN dynamic).

July 1, 2022

BIG-bench

  • Why does PaLM perform so much better after 1 or 2 shots? Thought it was just scaled up and parallelized.
  • A bunch of small changes
  • Better tokenization
  • Probably closer to Chinchilla scaling laws instead of GPT scaling laws
  • HHH task - it’s possible that the model performs worse than chance because the “true” responses are just longer.

March 4, 2022

Discovering and Achieving Goals via World Models: https://danijar.com/asset/lexa/paper.pdf

Questions

World model
  • What was the architecture? (Recurrent State Space Model (RSSM) ?)
  • How does it learn the world model from the replay buffer? Is it self-supervised predict future frame?
  • The world model is the MLP above, which takes previous hidden state and current observation, and outputs predicted next observation latent (z_hat). It learns the world model as above, by trying to make z_hat and z the same, plus the VAE at the bottom. 
  • How does it learn the policy given the world model?
  • It learns two separate policies (explorer and achiever). Explorer reward is variance of ensemble of transition functions. Achiever reward is temporal distance between current observation and goal observation.

Explorer
  • How is it generating sequences to consider? Randomly or with some sort of recurrent network that tries to maximize information gain?
  • They train a 1-step model to predict next model state from current model state, and then make an ensemble of these models, differently initialized. Then they try to maximize variance of the ensemble in order to find states that the model is most uncertain about.
  • What is the “imagination”? Is it a world model in a simulation or in some internal vector space?
  • It is generated from the hidden state and the previous predicted latent (z_hat in world model architecture above), by the MLP above. It does not require new observations from the simulation, because it can just sample from z_hat.
  • What is epistemic uncertainty?
  • “How much uncertainty do I have about the future?” not things that can’t be known, e.g. rolling a die or watching static noise on a TV.
  • What is ensemble disagreement?
  • It is how much disagreement is in the predictions of the differently initialized models in the ensemble, measured by variance of the output state distributions.
  • When they say “frontier” in the introduction what does that mean?

Abe doesn’t like the exploration part of this paper - think we can do better using other exploration techniques. This is only exploring one state into the future.

Achiever
  • What’s the Achiever reward? Intuitively what does cosine/temporal distance mean?
  • Temporal distance is how many steps I need to go through to get from one observation to another observation. 
  • How are they using goal images? Are these turned into latent representations? Do you just hold the resulting image for every entry in the replay buffer? Is the loss for achiever just pixel loss?
  • See above, temporal distance.
  • How does it estimate information gain?
  • A: Disagreement among an ensemble of next state predicting models
  • What was training time?
  • 6-8 million samples (simulated frames generated). This would be a few hours on a single GPU. Probably used way more imagined frames, maybe 100x more.

  • How does hindsight experience replay normally work? Does it also get rewarded for matching an image?
  • Goal conditioning is where you sample trajectories from your history and you just label the last state of the trajectory as your goal. Hindsight experience replay is goal conditioning where you can observe/evaluate the goal.
  • Does this approach need to explore the entire environment? Did they do any experiments on slightly or significantly altered environments?
  • No
  • Do they train the world model first, and then the explorer?
  • They train the world model, then the explorer, then the achiever - a few steps of each at a time (how many steps is a hyperparameter).

January 14, 2022

Towards mental time travel: a hierarchical memory for reinforcement learning agents

Questions
  • None of these tasks are public, right? :-/
  • No, not really
  • Why can you look away for arbitrarily long? Does the lookaway all end up in a single chunk?
  • Answer: It can pay attention to any chunk it’s seen over its lifetime.
  • How does the chunking work? How do things end up in a chunk?
  • +1
  • Answer: Fixed size chunk. Every 20 frames is a chunk.
  • How do you generate a summary for a chunk?
  • Answer: Chunk summary is average of all the things in the chunk.
  • How does the training / backprop work? What gets gradient? Looks like there are an awful lot of heads…
  • +1
  • +1
  • +1
  • What is surprising about this paper (if anything?) It feels like B-Trees but for attention?
  • +1
  • +1, couldn’t you also do top-k on all latents and then take ordered windows around them?
  • 🚧 How is k chosen for top-k? How sensitive is performance to k?
  • Section D5 claims it can perform well with top-1 through top-16… this seems surprising (top-1) given how chunking is done just based on temporal order?
  • +1
  • We don’t understand this still.
  • How long were the sequences they were looking at? Does this break down when things get VERY long?
  • +1
  • In addition to improving performance does this (1) reduce compute beyond standard attention (I think yes) (2) reduce what you need to retain in memory (I think no) or some combination of both?
  • (1) yes
  • (2) in GPU memory yes, can store stuff on disk and look it up
  • How many frames end up in one chunk? Is that number static?
  • How is the object permanence task set up?
  • How much better does this do than normal transformers?
  • Is there an issue with performance for very long videos, since it can pay attention to any chunk it’s seen?

December 3, 2021

MetaFormer is Actually What You Need for Vision

Questions
  • Can we go through the pooling blocks in more detail?
  • +2
  • Pooling is local, but attention is global, right?
  • Can we go over the extra Zack ablations / tweaks? 🙂 
  • How sensitive is this setup in regards to the structure from figure 2?
  • Why does it go from L/6 to L/6 to L/2 and back to L/6 blocks for patch embedding?
  • +1
  • I don’t get why this works. 😞 If the point is the architecture is more important than the token mixer, shouldn’t it still perform better with a more sophisticated token mixer like attention?
  • +2
  • A: Transformer probably learns to attend to nearby info more than far away info over time.
  • What other token mixers might we consider?
  • What difference does it make if we remove the 4 stages of patch encoding?
  • In what ways is this different from a regular CNN?
  • +1 I don’t understand the similarity / difference to a CNN.
  • Dumb question: what is a token mixer?
  • Does this model contain any resnet style connections?
  • “compared with specific token mixers, MetaFormer is more essential for the model to achieve competitive performance” — how much have we actually ablated the transformer architecture outside of the attention heads? Is there actually a lot of room for improvement here?   

November 20, 2021

Masked Autoencoders Are Scalable Vision Learners

Questions
  • Why might it have this property that fine tuning makes it better faster than MoCo?
  • How well would this work if you trained it on fewer images? (Trains on 250 million images)
  • Ellie: Why does the masked token have to be a learned vector? What would happen if it were not a learned vector (e.g. a constant vector)?
  • Bawr: maybe since the information is repeated many times in the embedding

October 28, 2021


Questions
  • How does this work?
  • Can you walk through how the covariance works one more time?
  • The two branches don’t have to share architecture and weights, but they do have the same output size, right?
  • What’s the default setup they actually use for these branches? “Siamese with shared weights” isn’t 100% obvious to me.
  • What’s a good intuition for what would happen with similar vs vastly different branches?
  • Why don’t we do the covariance thing all the time?
  • Like in the VAE? (VAE maybe not the best because of the stochasticity)
  • Does it make sense to replace the hinge loss in the variance with what they did in Soft-IntroVAE?
  • +1
  • Did they ablate over the gamma term in the variance loss?
  • Is there not issues with the variance if two images look very similar?
  • Seems like to covariance term isn’t a huge piece of the resulting performance?
  • Does my takeaway make sense: this is a method for learning embeddings that encode some hard-coded notion of invariance (in this sense, coming from the data augmentations)
  • Does this really need batches at all? The invariance term does not, the covariance and variance ones feel like they could be made “continuous”…
  • I wonder how well this would work on completely separate problems (different domains)
  • How does batch size effect performance?
  • How strong is the dependence on the loss coefficients?
  • One thing I really like—since 2 of the 3 are regularization, it makes sense in the framework of that multi loss stuff we talked about before (regularize them “enough”)


October 14, 2021


Questions
  • Their outermost PBT loop adjusts the tasks dynamically - is it only combining the pre-existing predicates with “or”, or are there more advanced combinations?
  • BF: It’s and, or, not, with anywhere from 1 to 3 predicates
  • Are the possible goals supplied to the agents in any way, or do they need to discover them?
  • Somewhat underspecified - how “smooth” are the changes to a given agent’s task distribution?
  • +1
  • Wanted to quickly double-check - the agents get the camera view as input, and nothing else?
  • BF: they receive RGB, the text of the given goal and forces related to holding an object
  • Just how does the next generation “distill from the best of previous generation”?
  • +1
  • I think it’s doing model distillation based on the output of the network, maybe training on some internal states, which basically attempts to copy what the network knows
  • Can we review the architecture outside of GOAT? 
  • What is the “torso”?
  • Can we go through the GOAT architecture? I do not understand how it works.
  • +1
  • How do they normalize the task scores, and how does their overall progress metric work?
  • Why is it important for the task space (and thus the game space) to be smooth?
  • How do RL policies work?
  • How are the goals encoded?
  • What are the inputs to the attention layer? Are they heads on the LSTM?
  • Does parallel task ideation occur in the background of attention-competing LSTMs?
  • The dynamic game creation and generational distillation both seem required for this approach to work. Is it the case that these agents do not succeed at generalizing without those?
  • What is stop gradient?
  • Is any of the code for XLand open source?
  • How do they compare held out tasks to training tasks - they have a ‘measurement’ but not clear (section 4.2) What is the basis for the interpretation of the behavior - purely observational besides like the som internal representation?
  • How exactly does this architecture work? Specifically curious about the way that the goals and tasks are shoved in there
  • How exactly did the evaluation work? That seems quite relevant, it’s hard to eval populations
  • How can this be said to generalize? Aren’t the set of tasks and goals sort of fixed? Dont the agents have to get trained on each individual task?

September 30, 2021

Summary

Questions
  • BF: Can someone explain what they’re doing with transformers a little bit more concretely? (figure 2 isn’t all that informative)
  • +1
  • +1
  • why does spatial vs temporal aggregation work after the transformer? couldn’t it all have gotten mixed - what enforced that it still stays that way?
  • +1
  • +1
  • EK: What do the dimensions (e.g. 16 x 32) for the object latents correspond to?
  • 16 is roughly corresponding to objects (or colors)
  • 32 is the feature dimension (e.g. perturb one element to change the color or size etc. of an object in the scene)
  • BF: Quick refresher on GMM and what they’re using them for in this paper?
  • How does the decoder work?
  • +1
  • Is the following summary correct?
  • Object latents are stable for a whole given video.
  • So when you decode one frame within a video, you always get the same latents.
  • Frame latents are different for every frame.
  • Can we walk through the components of the loss function?
  • Is this doomed for long videos?
  • +1
  • Josh: Probably yes due to the transformer memory. But could extend this to work with longer videos (on short time scale it changes, long time scale it does change).
  • Can this deal with random motion across the video?
  • Like, multiple times you move your camera around to random places.
  • what exactly is the input format to decoder. specifically the pixel coordinate. also the timestep - how do these get incorporated with the features? 
  • just floats concated to the other features
  • what is the ordering of features into the transformer - like raster scan over each frame and then each frame in order? does it matter?
  • probably doesn’t matter due to 3d absolute position encoding
  • How excited are we about this?
  • +1


September 16, 2021

Summary
  • When getting pieces of the input (patches of an image) that are shuffled, you can still end up learning the task. Previous network was not robust to changing the background but now it doesn’t totally fail.
  • How it works: take embedding of each patch, pass into sensory neurons — each pair of observation slide to the keys, and then to values.

Questions
  • Ellie: how did they factor QKV matrices vs LSTM?
  • Related to above: why is K taking previous action as input but not V? (In fact, it seems fv is just a pass through function?)
  • Are all the sensory neurons using the same neural network? Maybe a brief description of how this is actually working?
  • why did they use ES or BC to train the policy rather than some sort of backprop?
  • What does the fixed Q matrix look like?
  • Stupid question: I don’t understand RL policies at all 😞 — what is the policy network optimizing for? Is a “policy” the loss function of that network?
  • Policy is choosing action that maximizes reward; policy network output is the action.
  • I have no clue how BC works
  • How important is ES / BC for us? Do we like these methods for training neural networks?
  • (sounds like no, too slow)
  • Does this only work in an RL setting?
  • I’m curious how he came up with this idea
  • Why is it interesting that it’s invariant to shuffled inputs / why is permutation invariance interesting?
  • not a question, but: that cartpole noise was pretty low magnitude…

June 10, 2021


Summary
  • Gets comparable performance to GANs with a much simpler autoregressive model.

Questions

May 27, 2021


Summary
  • We have a series of measurements. We want to approximate the function that generates these measurements. 
  • We can say there’s an optimal value for the coefficients of this polynomial because we can define a distance between the approximation and the actual function.
  • We can find the coefficients of the Legendre polynomials through some magic. The coefficients are different for every time point t.
  • We can get the new coefficients at the next time step given this discrete-time HiPPO Recurrence.
  • They guarantee that given this recurrence relation matrices that they can derive, you get optimal coefficients at the next time step. So you don’t need to refit f_i at every time step.
  • A and B are computed dependent on t. They are computed based on the μ that we chose.
  • They have an option of 3 measures. If you choose a different measure, you need a different set of polynomials (not Legendre).
  • They use μ that is uniform over your history. Given uniform weighting over history, this recurrence relation is the way to do it. A learned recurrence relation may give you different information, but it will not give you better information with uniform + Legendre polynomials.
  • Your most recent coefficients represent your compressed memory at this time step.
  • These provably optimal guarantees work for a scalar output. 

Questions
  • Can you explain this image 
  • At time t0 μ is defined as the red area. At time t1 μ is defined as the blue area.
  • μ is how much we care about time t. If it’s uniform, we care uniformly across time t. Having non-uniform μ causes Legendre polynomials to be no longer orthogonal.
  • What is an orthogonal polynomial basis?

  • What is a hilbert space?
  • Legendre polynomials are defined in a Hilbert space, which means distance is a measure that makes sense.
  • What does it mean to store optimal coefficients in terms of basis functions?
  • What are bounded gradients?
  •  Questions:
  • ... Seriously, "translated Legendre (LegT)" vs "translated Laguerre (LagT) "?
  • ... At least LegS stands out with uniform weights, but can I still kick them?
  • 1. It's not so much a question as a statement, but why in hell do they pick these troll names.
  • Any probability measure µ on [0, ∞) equips the space of square integrable functions with inner product, inducing a Hilbert space structure and a corresponding norm.
  • 2. Okay, so I see how if you have an inner product, you get a Hilbert subspace structure, and that gets you a norm, so you can have distances and angles on your function space - very post-modern, very abstract, very cool... but how does a probability measure factor into this? A probability measure of... what, exactly? How do I into a mental model of this step zero.
  • Any N-dimensional subspace G of this function space is a suitable candidate for the approximation. The parameter N corresponds to the order of the approximation, or the size of the compression; the projected history can be represented by the N coefficients of its expansion in any basis of G. For the remainder of this paper, we use the polynomials as a natural basis, so that G is the set of polynomials of degree less than N.
  • 3. So they say "projected history" here, but this N isn't related to any time step, right? We just pick the dimensionality of our polynomial space, since these pesky Hilbert spaces and their infinite dimensions aren't something we really want to work with. And I assume we pick orthogonal polynomials in particular, because then we don't have to worry about how to pick a basis?
  • Since we care about approximating f≤t for every time t, we also let the measure vary through time. For every t, let µ (t) be a measure supported on (−∞, t] (since f≤t is only defined up to time t). Overall, we seek some g (t) ∈ G that minimizes ||f≤t − g (t)||L2(µ(t)) . Intuitively, the measure µ controls the importance of various parts of the input domain, and the basis defines the allowable approximations. The challenge is how to solve the optimization problem in closed form given µ (t) , and how these coefficients can be maintained online as t → ∞.
  • 4. Right, so what this is saying is that we have... learned? defined? did the needful to get a family of measure functions µ_1, µ_2... µ_Tmax, and our task on-line is to go from a bunch of µ-functions to the same number of g-functions?
  • Then the coefficients of the optimal basis expansion are simply c_n^(t) := <f, g>_µ(t) .
  • 5. To be clear, this is saying that the coefficients for a given t are the result of... an inner product of the real function and our basis vectors? Wat.

April 1, 2021

Questions
  • Overall assessment of this paper:
  • Loved it. “Genius paper”. Nobody else is doing anything like this - the architecture is so simple and gets rid of everything that doesn’t need to be there.
  • Why do we care?
  • Great evidence that hard attention is useful for generalizing. Humans pay attention to some things and not others - this paper shows that it can work.
  • How long does it take to train?
  • Not mentioned. It’s worth us reproducing!
  • How does CMA-ES compare to an optimization-based approach in terms of total number of rollouts? Which is more expensive (RL)?
  • We don’t really have an intuitive sense - we should do the math. Or count experimentally. Some helpful numbers:
  • DoomTakeCover: at least 1700 ticks * 5 rollouts per simulation. = 8,500 forward passes per sim.
  • CarRacing: 16 rollouts.
  • CMA-ES has a much harder time as number of parameters increases.
  • If we were to reimplement this, what things would we want to try differently?
  • Could do attention in latent space
  • Could use gradient-based optimization method
  • Could add 
  • One thing they say off-handedly in the blog post: “To keep the agent as simple as possible, we do not use positional encoding in this work." - I’m not sure what does that fully mean and how it affects things.
  • One other thing that caught my eye: “Previous work [60] has demonstrated that with a good representation of the input image, even a small RNN controller with only 6—18 neurons is sufficient to perform well at several Atari games using only visual inputs.” - wait what, that… seems like way too few neurons, is it basically good latents all the way down?
  • Huh, this is interesting. → I’d be interested in reviewing the paper in citation [60].
  • What are the tradeoffs between gradient descent and neuroevolution?
  • e.g. There is objection to backprop because the brain doesn’t do backprop. Is neuroevolution more biologically plausible? (And aside from being inefficient with more parameters, does it have any other advantages vs. tradeoffs?)
  • Also: What would a differentiable version of this look like? Do you think it would perform better or worse? (Are there other advantages of neuroevolution besides exploration that are not captured by gradient-based methods?)
  • KJ
  • bawr
  • BF
  • Discussion:
  • Unclear that brain doesn’t do backprop. Spiking neural networks potentially do something like backprop. “Neurons that fire together wire together” looks more like backprop than neuroevolution. 
  • Sample inefficient. The bias is against neuroevolution as number of parameters increases. To have a more expressive function you may prefer to use gradient descent. 
  • Maybe worth asking the authors why they used neuroevolution. Maybe gradient descent would perform worse for the same number of parameters.
  • Does it really only give the network the position of the patches and nothing about the content? That seems pretty weird
  • bawr
  • JB. 
  • Huh. Are these two tasks uniquely solveable by maintaining positions relative to visual attention positions (avoid fireballs, track left side of highway)?
  • How exactly does it end up with so few params?
  • KJ
  • What is inductive bias (not at a high level)? Why does self-attention force inductive bias?
  • JB
  • KJ
  • bawr
  • Answer: Seems like they are abusing the inductive bias term.
  • I realized I don’t understand attention very well. Why should I trust the importance vector that gets generated? 
  • KJ
  • The failure cases (due to switching to distracting backgrounds) is interesting. Humans seem to have an upper bound on what to attend to (so at some point when things become random noise we stop paying attention). I wonder how one would incorporate that into the self-attention method.
  • JB
  • bawr
  • Potential improvements:
  • Here, the only thing that can change what you’re attending to is the image itself. There’s no way to have task-dependent attention: “I’m driving, so I should pay attention to parts of the image related to driving.”
  • The question to actually ask here: How do you distinguish something that’s random noise (so, high information) but not giving any new information from something that is complex and interesting? 
  • Example mechanism: Active learning paper says: How well does this patch help me predict what’s going on in other patches? If you’re asking that question, then random noise does not help you at all. There are other mechanisms too. 
  • How does Spatial Softmax compresses visual inputs into a set of 2D keypoints that are relevant to the task?
  • Why did the original Attention Is All You Need authors make such a dang complicated architecture
  • What did they mean by indirect encoding?
  • JA — they define it, but I’d like to go over it and understand more deeply, as idk if I could write out the answer here without looking
  • JB. Are they referring to the implicit matrices generated by XK * XQ, using “CPPN/CPPN-NEAT generate features using a small number of coefficients”, or other things?
  • What’s the exact thing that forces them to use neuroevolution vs just gradient methods?
  • What did they end up feeding into their network after selecting patches? Just the raw pixels or was it just the position of the patches? 
  • I think I know the answer to my own question, it’s just the position of the patches and all the image data is in the self attention portion


Mar 25, 2021


At some point we can also discuss Perceiver: General Perception with Iterative Attention
  • I’ll just provide a quick overview of the paper <2mins
  • Then I will discuss what I initially didn’t like about the paper and where my intuition was “wrong” or “right”


Questions
  • What is MDL?
  • I was confused by the “smaller numbers is better” comment—do you want high or low MI between X and Y?
  • How well does this apply to more complex losses? Different evaluation metrics like accuracy, AUC, etc?
  • bawr
  • JA
  • What’s the surplus description length?
  • What’s the difference between SDL and cross-entropy?
  • I’m confused as to what MDL measures and why this is an improvement over MDL.
  • HB
  • In “A theoretical example”, I don’t understand the sentence “On the other hand, the raw data representation will achieve perfect validation accuracy once the evaluation dataset contains d linearly independent xi’s. In this case, Gaussian elimination will exactly recover s.”
  • bawr
  • What is the meaning of the intersection points in the graphs in figure 2? 
  • JB
  • HB
  • bawr
  • KJ
  • AL
  • In “Insensitivity to representation quality & computational complexity in MI”, what is the data processing inequality?
  • data processing inequality = “you can only lose information by processing data, you cannot gain it. At best you can preserve it. ie, information(F(X)) <= information(X)”
  • They say something to the effect of “SDL measures only the extra entropy that comes from not having the correct model” - do we fully believe that? If so, what’s a good way of interpreting these “extra entropy” numbers that’s not misleading / doesn’t lead to incorrect assumptions?
  • JB
  • KJ
  • To be clear, they don’t actually compute the exact SDL, they just have an estimator for it, right? Are there any edge cases for that SDL estimator? What do we expect the variance / bias to be affected by?
  • bawr
  • They have a Python library for “reprieve” - is it any good? Is it easy to plug into our existing stuff?
  • bawr
  • JB
  • HB
  • Are there experiments where these measures are shown to be more effective? Are they convincing?
  • JB
  • HB
  • bawr
  • AF
  • KJ
  • AL
  • Can you explain the axes in fig 2b?
  • AF
  • Can you explain the linear probe method that this is an improvement on?
  • Can we cover the loss-data framework / graph in fig 2 in more detail?
  • What is the X and Y in H(Y | X)? What do these random variables represent?
  • BF - X is the training data (ie. some input image) and Y is the label (ie. is this a cat, a dog)
  • Ok cool thanks! So H(Y|X) is measurement on probability distribution of labels given the training data (basically, does the training data give you any information about the labels and how much information does it give you)
  • correct!

Mar 11, 2021

Questions
  • How does this work?
  • BF
  • AF
  • What’s the deal with intermediate views? “The proposed objective avoids the need to explicitly infer intermediate latent views, instead imposing a sequence-level constraint based on long-range correspondence known by construction.”
  • Answered by for t in range(2*T-2)…, which I had missed on my first read
  • Some discontinuities will be because of scene changes, or disocclusions, etc. Because transitions are softmaxed… won’t that force transitions that don’t make sense? I could imagine an adversarial video that was just jump cut after jump cut, that would totally confuse an architecture like this.
  • "In easier cases (e.g. smooth videos), the paths that the walker takes from each node will not overlap, and these paths will simply be reinforced. In more ambiguous cases – e.g. deformation, multi-modality, or one-to-many matches – transition probability may be split across latent correspondences, such that we consider distribution over paths with higher entropy.“
  • What are the "transition probabilities”? How are they trained? Is there actually any random walking?
  • AF
  • HB
  • BF
  • How many frames does the network have access to when predicting transition probabilities?
  • AF
  • HB
  • Would this work if the video had a blank frame? 
  • Yeah, like what happens when you blink?
  • I wonder if skip connections (across time) would be helpful…
  • What is edge dropout?
  • AF
  • What is spatial jitter trying to accomplish?
  • How do we tell this model to track what we want?
  • Seems like this would break if we try and focus on the background
  • Ahh, they have an input segmentation mask
  • “However, optical flow proved too noisy to provide long-range composite correspondences across many frames.” Thoughts?
  • Wow, much code golf. I’m confused about all the bits in blue

Mar 4, 2021

Questions
  • Why does it make sense for the representation mode and the transition predictor networks to share parameters?
  • JB
  • HB
  • What is ELU?
  • What is temporal difference learning?
  • JB
  • HB
  • I don’t understand the actor / critic learning very well.
  • JB
  • KJ
  • I’m less interested in the RL components of this, more in the world model part—would actually like to walk through all of the details of the world model part.
  • KJ - I’m interested in understanding Figure 2 better…
  • JA
  • BF
  • JM
  • -JB I want to the learn RL stuff 🙂 
  • I did not understand the regularization loss term, or how all of those loss terms relate to other stuff. 
  • JB
  • HB
  • I’d really like to understand the straight through gradients thing. It seems so simple, but I dont get it
  • JB
  • JA
  • HB - What is stop_grad doing in algorithm 1?
  • Same question on KL balancing—what exactly is it? Seems so simple, but…
  • JB
  • JA
  • BF (Can you explain what KL Balancing is?)
  • HB (How are they doing KL balancing?)
  • Did they just make the world model with all of the data, and then the RL was completely separate afterward?
  • BF
  • JB
  • JA
  • What is the difference between the deterministic and stochastic hidden states? How are they generated? 
  • JA
  • What are categorical variables? (vs. Gaussian latents of RSSM in PlaNet)
  • BF (and sub questions)
  • JB (all qs)
  • is this choice of using categories arbitrary or could it be anything that has more representation than the gaussian they first tried? Why categories?
  • KJ
  • Is this similar to the VQ VAE? Are there any differences?
  • KJ
  • What is your intuition behind the benefit of discretization (is it compression)? 
  • (Is there a difference between “using categorical variables” and “discretizing” as in VQ-VAE?)
  • KJ
  • How does discretizing stochastic state compare to discretizing deterministic state?
  • What’s the difference between discretizing stochastic state and reducing the number of dimensions?
  • What is the prior of the categorical variables? For gaussian variables we have the standard normal distribution
  • Why restrict to single GPU?

Feb 25, 2021

Questions:
  • Can we walk through exactly how this works?
  • BW HB +1
  • How does it get the training data?
  • How would we get training data for proposed RSSM reconstructions use case?
  • KJ: When do you need vs. don’t need to generate captions of the image for the training data to be useful?
  • What does it mean that a network is invertible?
  • You can get the inverse function of the function that the neural network approximates. Invertibility is from residuals to z_theta. Conditioned on z_phi.
  • What is the residual? 
  • It captures the difference between z_theta and z_phi. It contains the information that is in z_phi but not in z_theta.
  • How do you choose which intermediate layer to use as your embedding?
  • You can choose any layer. You have to retrain it if you choose a different layer.
  • What is it feeding into a GAN to get conditioned image?
  • What can we not do with this? It seemed so applicable to so many things…
  • What does it mean “learns the inherent ambiguity of the domain translation, which facilitates content creation and model diagnostics”
  • How can this be used for model diagnostics
  • More specific version, what does this mean: By comparing the generated samples y = Λ(zΘ) (see Eq. (7)) conditioned on representations zΦ = Φ(x) extracted from different layers of f, we see how the invariances increase with increasing layer depth and can thereby visualize what the model has learned to ignore.
  • Could we use this to compose a world model from video prediction and language models?
  • How much less powerful would such a model be from something that learned such representations in a joint fashion?
  • What are the limitations of flow networks? (i.e. lack of expressivity from lack of nonlinearities) How do these limitations manifest in this paper?
  • Can this be extended to do M:N domain mapping? If so, how?
  • How do we transfer latent space from our RSSM model to a latent space of a model that is designed to have good reconstructions.

Feb 18, 2021

Questions:
  • What’s up with this temporal block?
  • What is the advantage to decomposing the filters as described on page 5? I see they do an ablation but I don't understand why it should be any better.
  • What is the purpose of having multiple average pooling scales?
  • How are these blocks stacked together?
  • KJ
  • 👍
  • 👍
  • 👍
  • 👍
  • Things are decomposed to segmentation, depth, and flow - what’s “flow” specifically?
  • 👍
  • KJ
  • 👍
  • How might we dramatically simplify this paper?
  • KJ
  • 👍
  • 👍
  • 👍
  • What are the take-home lessons from this paper?
  • 👍
  • 👍
  • How is the diversity distance metric computed?
  • 👍
  • 👍
  • Do other baselines use the same pretrained encoders described in 3.1? Does this method backprop through the encoders?
  • 👍
  • How do you read Table 1 (or, really, any of the tables)?
  • KJ - someone can explain this to me later
  • What parts felt fiddly and complicated to others?
  • Why doesn’t the control depend on the future prediction?
  • There’s an off-hand comment on page 5, what are the implications of this?
  • All convolutions are preceded by a (1, 1, 1) convolution to compress the channel dimension.
  • Whey encourage the present distribution to match the future distribution, that seems a bit weird, but I don’t have a specific question - maybe it’s just the names that are confusing me?
  • What’s the repeat frame baseline?
  • They say they predict “two seconds into the future” - what’s the FPS here? 
  • Does Dynamics Module learn physics, or is physics engine hardcoded?

Notes:
  • The cool thing about this paper is that:
  • It's a demonstration of all of the architectural pieces put together, similar to our strategy. And it works really well, despite all the issues.
  • Overall assessment of paper:
  • Weak baselines, sketch ablation studies, poor architecture choices. However, is an interesting
  • Dynamics Module has a weird crazy architecture with lots of weird features in it.
  • Present and Future Distributions uses Dynamics Module to unroll the possible future states.
  • Useful things for us:
  • Only sample once from the future prediction used for every frame of the rollout.
  • Has two good ideas for visualization that we can borrow.
  • Visualizing entropy of distribution of future outcomes.
  • Sampling distribution of possible futures many times, and classify the "modes" of futures (e.g. turn right vs. continue straight).
  • Diversity Distance Metric is about asking: How do you measure how good the future samples are?

Questions:
  • What are the "transition probabilities”? How are they trained? Is there actually any random walking?
  • What is edge dropout?
  • Would this work if the video had a blank frame?