Submitted by Lesterpaintstheworld t3_11ccqjr in singularity
Lesterpaintstheworld OP t1_ja2mobc wrote
Reply to comment by turnip_burrito in Raising AGIs - Human exposure by Lesterpaintstheworld
At this stage this is actually surprisingly easy. People have to intentionally be very manipulativr and creative to get ChatGPT to "behave badly" now. Without those "bad actors", this behavior would almost never happen.
One easy way to do that is to preface each prompt with a reminded of values / objectives / personality. Every thought is then colored with this. The only moment I had alignment problems is when I made obvious mistakes in my code.
I'm actually working on making the ACE like me less, because he has a tendency to take everything I say as absolute truths ^^
turnip_burrito t1_ja2ngmw wrote
That's good.
Maybe also in the future, for an extra layer of safety, when you can several LLMs together, you can use separate LLMs "judges". The judges can have memory refreshed every time you interact with the main one, and can screen the main LLM for unwanted behavior. They can do this by taking the main LLM's tentative output string as their own input, and use that to stop the main LLM from misbehaving.
Lesterpaintstheworld OP t1_ja2nq8n wrote
Whoo, forks & merges, with a consensus layer. I like that
DizzyNobody t1_ja2pthy wrote
What about running it in the other direction: have the judge LLMs screen user input/prompts. If the user is being mean or deceptive, their prompts never make it to the main LLM. Persistently "bad" users get temp banned for increasing lengths of time, which creates an incentive for people to behave when interacting with the LLM.
turnip_burrito t1_ja2q7t6 wrote
That's also interesting. It's like building a specialized "wariness" or "discernment" layer into the agent.
This really makes one wonder which kinds of pre-main and post-main processes (like other LLMs) would be useful to have.
DizzyNobody t1_ja2uka9 wrote
I wonder if you can combine the two - have a judge that examines both input and output. Perhaps this is one way to mitigate the alignment problem. The judge/supervisory LLM could be running on the same model / weights as the main LLM, but with a much more constrained objective - prevent the main LLM from behaving in undesirable ways either by moderating its input and even by halting the main LLM when undesirable behaviour is detected. Perhaps it could even monitor the main LLM's internal state, and periodically use that to update its own weights.
turnip_burrito t1_ja6re1h wrote
I think if we had the right resources, this would make a hell of a research paper and conference talk.
Viewing a single comment thread. View all comments