Submitted by jackfaker t3_126wg0o in MachineLearning
Has there been much recent literature on leveraging natural language internal interfaces in AI systems, or what are your thoughts in this space? Perhaps not yet implementation, but at least discussion on how it pertains to consciousness and AI alignment?
Examples of potential implementations:
- LLM with a final gate that chooses to output the next word to the external reader or to an internal stream. LLM output is a function of both the internal stream and the external prompt.
- Collection of LLMs that engage in a dialogue based on external prompt. External output is a function of a final LLM that parses this dialogue + the external prompt.
- Loss function includes a critic that evaluates the grammatical structure of a hidden state (after passed through a decoder) at at specific points in the network, encouraging the network to propagate information via natural language.
Link to consciousness:
- Simple non-conscious organisms evolve to communicate.
- Internal feedback loops sporadically develop in the brains of these organisms.
- Feedback loops that encode information via primitive natural language proliferate, as the brain already has substantial evolutionary pressure to interpret and emit information in natural language.
- The memory writing function of the brain could plausibly be similar regardless if the natural language was originally sourced from an external stimuli or an internal feedback loop.
- Hypothesize that consciousness is feedback loops in the brain through natural language + writing to memory.
Link to AI Safety/Alignment:
Consider a human being who had their stream of consciousness monitored for the entire duration of their life. If an AI system's intelligence is heavily derived from a natural language internal interface, then it becomes possible to observe its thought process in real time and shut it off when this thought process becomes substantially unaligned with humans. Of course this process would be incredibly nuanced in practice, but at surface level this approach appears infinitely more tenable that hoping to diagnose the alignment of an AI system from only an analysis of its inputs and outputs.
grotundeek_apocolyps t1_jecdbtm wrote
"AI alignment" / "AI safety" are not credible fields of study.