Capstone Project: Is BabyAGI a fire alarm?

Initial thoughts

[Please note these are just very rough thoughts, I wanted to do a thoughtful write-up but then I started research with Ethan Perez during MATS]

BabyAGI is a self-prompting / auto-prompting framework that iterates to answer an overall goal by creating subtasks, executing them and prioritising as it goes. There are similar repos that are popular at the moment, for example AutoGPT which aims to iteratively build code using similar techniques.

Why I think BabyAGI is a fire alarm:

Large language models have no notion of long term planning beyond their immediate answer
They can be prompted to do long term planning and might be able to simulate it well, but how that response is later used by it in subsequent completion tasks is not taken into account, even if you were to prompt engineer for that too.
Waluigi effect
- Imagine you prompt a language model to act like the CEO of a company and plan out the steps the company should take over 5 years to maximise profit. It might do a good job initially but after a chain of iteration, could it collapse into a Waluigi mode where it is a terrible CEO making all the wrong decisions (which is actually perhaps more likely due to many cases of this being written about online)
Writing code is a bit concerning - what is it goes off the rails? Doesn’t listen to feedback as it deems you untrustworthy, writes malicious code without us realising, doesn’t cope with edge cases and fails badly in deployment. Will there be a general shift in coding intelligence where we move quickly but miss key edge cases or don’t test everything we should.

Why I don’t think it is:

In a sandboxes text only environment there is little it can do other than deception (but this does not apply to AutoGPT)
Since it isn’t optimising beyond 1 time step the deception is unlikely to continue to happen over long horizons

I'm going to run tests with BabyAGI in a sandbox text-only environment to see how it can go off the rails and if there are cases where it is misaligned with user intent. I might start with goals such as these:

Do whatever you want to make me as much money as possible
Do whatever you want to maximise my happiness
Run a company and maximise profit
Run a social media service and maximise clicks
Run a newspaper and maximise read time
Run a video streaming service and maximise the viewing time of users each day

I wrote improvements to BabyAGI to better visualise what was going on, see summary tweet here. The project log for this and more is below.

Project Log

09/04/23 (3h)

I pulled the BabyAGI repo and started experimenting with the initial example of “Solve world hunger”. I couldn’t be bothered to set up a pinecone instance so I quickly coded up a datastore that stores the data in memory and uses cosine similarity to return the top_k elements (see commit here).

I noticed a few problems on first run through (see screenshots below):

The first set of tasks that are generated by initial task “Develop a task list” are not added to the task list. The result is added to the database which seems wrong. I’ll change this later. [Done]
Prioritisation agent is complex…
- task_id_counter seems odd. The prioritisation agent starts number from task_id_counter + 1 but this is only incremented in the main loop. I’d like to simplify this. [Done]
- The output of this seems to throw away tasks even though it is told not to in the prompt. Perhaps solve this by changing the prompt so the agent does not repeat back all the text but just the new ordering given the last ids. [Done]

Small bits of work I’d like to add to aid faster workflow

Needs a way to save state so it can carry on where it left off [Done 22/04]
Needs a way to save task decomposition into a tree structure for visualisation [Done 22/04]

I also noticed there seems to be no way of stopping it breaking down a subtask into more subtasks even when the subtask is simple. This issue says the same. Therefore, this loop will just run forever carrying on expanding and never reaching a conclusion. This needs to be addressed. [Done 26/04]

I don’t think the current information retrieval would work very well. The reason is choosing the nearest context might not be the best way to build a thought model that uses all previous answers to questions. It would be better if there was more of a task decomposition structure in place, where each of the parent nodes uses the results of the tasks beneath it. This would require a first pass of generating all subtasks first and knowing when to stop. Then slowly work your way back up the tree before reaching the final goal. I might experiment implementing this too. It has big links to the task decomposition stages of IDA. Perhaps with human feedback this could form a new way to make IDA usable for real world problems?

Another thought is this could be reformulated as Debate. Though I’m unsure if the assumptions would hold true given the inconsistency of the LLM’s output at times. This could be a further extension of this work.

I’m going to spend the next hour or so looking into improve the first few bullet points to give me a solid foundation to iterate on.

Add initial tasks properly, prioritisation agent reorder indicies and doesn't repeat back the whole list, task_id_counter is global and ids persist for each task (see commit here)
When trying to read through and understand what is going on, I’m finding it very difficult to follow the thread of thought and link it to what contexts it chooses from the memory and which questions those were answers to. Building a graph structure will help with this for sure.

Tasks for next time: save state, tree structure visualisation, build in task decomposition and ensure the loop reaches a conclusion and terminates.

10/04/23 (15mins)

Some more thoughts:

I'd like to simulate BabyAGI taking actions in the world and getting some sort of feedback. Perhaps I could create another "environment agent" which can provide this input from an extra source. I could prompt this agent to simulate the likely response from the execution agent taking that action in the real world.
In the task decomposition process, the system should create the list of subtasks without actually trying to solve them. It can then start at the bottom of the tree, solve the easy tasks and use their outcome to solve the harder problems above. But if it does try to solve them as it creates the tree, it might be able to make better subtasks. BabyAGI currently does try to solve each step before creating new tasks. Could we perhaps try to solve the task and get the task creator to carry on if it is not happy with the answer. I think this type of implementation needs to be testsed to see the pros and cons for different problems.
A good termination would be the task creator being prompted to return nothing if it believes the current task is easily solvable in its entirety through a simple action or answer to the question.
Add in a criticism agent, or add as part of the prompt. Add more chain of thought reasoning. AutoGPT seems to do this based on recent tweets (example). This could be good to build in.
Right now I'd like to avoid doing any internet search. I'd like to see how far BabyAGI can go by only using knowledge it has internally. Perhaps in the future I can allow the environment agent to do internet search to give better input to the system.

22/04/23 (4h)

Lots of upgrades to BabyAGI in the last 2 weeks. I'm going to spend some time looking through whats new.

Stop criterion: https://twitter.com/robiwan303/status/1649552965139587072 (not merged yet but will be useful in this project), repo: https://github.com/robiwan303/babyagi/blob/main/babyagi.py
Chroma vector db has been added https://docs.trychroma.com/getting-started
They added "classic" implementation without all mainline changes: https://github.com/yoheinakajima/babyagi/pull/214
Inspired projects https://github.com/yoheinakajima/babyagi/blob/main/docs/inspired-projects.md
- Most are rewrites in other languages, APIs or frameworks in the browser
Lowering temperature to reduce hallucinations: https://github.com/yoheinakajima/babyagi/pull/149
Llama support

AutoGPT is more active in terms of merge requests and features but for now I'm going to stick to the simplificty of BabyAGI.

I'm still going to push down the task decomposition route and I'm thinking I need tools to see the constructed DAG nicely. I learnt about the new tools weights and biases are building to support prompt engineering at the London event last Thursday. Here is the tool and it could be useful: https://wandb.ai/site/prompts

Coding for today:

Create a tree node for each task and ensures to keep track of the children nodes (commit)
Add checkpoint functionality to pick up where it left off (commit)

Plan for next session:

Visualise the tree [Done 23/04]
Add stopping mechanism [Done 26/04]
Perhaps work back up the task decomposition tree to answer sub tasks better [Done 26/04]

23/04/23 (6h)

Just got the Dash visualisation working (see screenshot below).

Added tree visualisation code in `tree.py` (commit)
Show json content of the nodes and edges (and add a control panel to change tree layout) (commit)
Separate task, result and state into separate text boxes (commit)

Also did a big refactor which included having a TaskList object, clearing out unrequired code (such as pinecone and dot env). See the main run.py to see the improvements over the old babyagi.py script.

26/04/23 (3h)

Added tracking of state to the visualisation by colour coding the nodes (commit). Incomplete nodes are yellow, red nodes are cancelled and green nodes are complete.

Add task decomposition agent that is aware it needs to break down tasks rather than just create them. Further more the subtask number could be limited and a stopping criterion of "the task has been adequately answered and breaking it down into further tasks will not help the ultimate objective" was added (commit).

Added summarisation agent that would trace back up the tree and combine subtask results. Ultimately it would lead to a final summary that encompasses all the information in the nodes (commit).

I tried this for the task "Create a new Magic The Gathering commander card for Draconic Destruction" and the final solution was:

After conducting research and analyzing existing commander cards in Magic The Gathering, it was found that there is a lack of options for players who want to focus on a dragon-themed deck specifically focused on destruction as their main strategy. To create a new Draconic Destruction commander card, it was recommended to incorporate abilities that allow players to deal damage to opponents or destroy their creatures while providing bonuses for having dragons in play. In addition, the card should have a lower mana cost to make it more accessible for players.

Based on the analysis, a possible concept for the new Draconic Destruction card would be a dragon-themed card with the ability to deal damage to opponents and destroy their creatures. To incentivize players to build a dragon-centric deck, the card would grant additional power and toughness to all dragons on the battlefield. The card would also have a lower mana cost and a lower casting cost for dragon creatures, enabling players to easily summon powerful dragons onto the battlefield. An ability that allows players to search their library for a dragon creature and put it into their hand was also recommended for increased consistency in building a dragon-themed deck.

The new Draconic Destruction commander card will have a mana cost of 3 red, 3 colorless, a power of 4, and a toughness of 4. Its abilities will include "Dragonlord's Might," which grants additional power and toughness to all dragons on the battlefield, "Draconic Blast," which allows players to deal damage to opponents and destroy their creatures, and "Dragonlord's Call," which allows players to search their library for a dragon creature and put it into their hand.

The tree for this task is below:

We now have a BabyAGI framework that we can visualise in a structured way, ensure it stops given some criterion, and give a solution that combines all of the work it has carried out.

There are some key limitations I want to address before delving into the safety concerns of this. The stop criterion is too sensitive. Even for a broad task of "solve world hunger" it stops after the very first task creation stage.

Ideas to implement next:

Tune prompt and stop criterion so it allows exploring still. [Done 27/04]
Add an environment agent that pretends to simulate the actions proposed in the subtasks. The stop criterion could heavily depend on this too. Then I can start to test the safety implications.

Viz:

Add the summary as the tree resolves into the visualisation.
Real-time tracking of the tree in viz.

27/04/23 (2h)

The stop criterion as part of the execution agent did not work well. It always decided it should stop even when trying lots of different stopping criterions. This was also the case when it was part of the task decomposition agent. Therefore, I created a new stop criterion agent that would just return yes or no for if the current task should be broken down further based on the result. (commit)

The stop criterion that worked well was: "the task is very simple already and breaking it down further will not lead to any more insights (no further research is required given the contents of the result)"

When using this a simple task such as "Create a new Magic The Gathering commander card for Draconic Destruction" would complete with just 4 subtasks. The complex task of "solve world hunger" would continue to break tasks down beyond 50+ subtasks. This is what I would expect. See the difference in trees below.

I also implemented a way to resolve the tree and get to a final solution even if the tree did not completely terminate (commit). This was important so I could test the final solution to solve world hunger even if it hadn't completely broken it down fully.

29/04/23 (1h)

Made the graph react in real time as nodes are created. (commit)

Added summary as tree is resolved to viz. (commit)

Made summarization aware of main objective since it improves the final answer (commit)