The Art of Taking Turns: Multi-Player AI Interfaces

We’ve all been in a conversation with a trigger-happy interlocutor, firing off every errant thought and opinion that floated through their mind. The same goes for a conversational partner who shows barely any interest, uttering next to nothing for the entirety of the discussion; the only way you can get them to join the group is to address them directly. And, of course, we’re all acquainted with the under-confident participant who offers thoughts and opinions at awkward times, interjecting when there’s only the slightest of opportunities. The nuance of conversational rules isn’t something most people spend a lot of time thinking about explicitly. It’s a set of skills, picked up over time, through practice and experience. We learn, through practice and experience, that while there aren’t concrete rules per se, there are widely accepted norms and etiquette. Creating and codifying these heuristics will be one of artificial intelligence’s next big frontiers in making multi-player interactions possible and frictionless.

Large language models today focus heavily on written and spoken words. But human communication isn’t just verbal, it’s embedded within interactional exchanges of multimodal signals. In other words, communication is layered: tone, timing, gesture, and expression. To design more lifelike, collaborative AI, we need to understand how these layers work together. This is where concepts of syntax and prosody come into play: the unwritten rules behind the human understanding of speech and dialogue. Syntax—the structure behind our sentences—helps us anticipate when someone might be wrapping up a thought or when they’re gearing up for another. Rising intonation might signal a question. A long pause might mean they’re conceding the floor. Tiny, barely perceptible breaks can even tell us whether the speaker will continue or yield.

We also rely heavily on social cues. Much of our communication is non-verbal; eye contact, body orientation, micro-expressions (like a raised eyebrow) let us know who’s being addressed or who’s invited to speak next. A question might not be directed at everyone, and yet most of us will know when it’s our turn to speak. And if things go awry, we know when to employ repairing mechanisms—clarifications, restarts, and polite interruptions—to get things back on track.

It goes without saying that our AIs don’t have access to any of this (yet). One of the most vexing problems surrounding multi-player AI interaction is the issue of turn-taking, especially if there will be more than one bot in the conversation. Who should answer? In what order or priority? Should bots answer based on their specialization, personality parameters, or some other consideration? How will the bot(s) choose when to stay quiet, interject, offer a full analysis, or a short overview, saving the full answer for when humans are less involved?

The technology is already there, but it will take a lot of effort to make it widely implementable and scalable. Current AI systems follow “turn-based” logic—one user speaks, then one AI responds—rather than the more fluid and overlapping “turn-taking” we see in human group conversations. When multi-player situations arise, where people speak in rapid succession, interrupt, or talk over one another, AI systems quickly become overwhelmed, often having difficulty with attribution and simultaneous inputs. The context window can fill up rapidly because of the increase in contributors. The AI needs logic for when to speak up and when to remain silent, which is not always clear-cut. Google’s “MUCA” (multi-user chat assistant) framework has made strides here, addressing “what to say, when to respond, and who to address” as the three key decision dimensions.

And yet, we’re still not quite there. As Matt Webb points out, “The thing is multi-player is hard. The tech is solved but the design is still hard.” He tried three approaches to get AI to respect the guardrails of group chat interactions, none of which fully worked. First, he tried a “shouldReply pattern,” with a quick call to an LLM with the context of the chat before the full call to generate a response (or not). The bots would all respond at once, there was no allowance for “personality,” no coordination amongst the bots, and generally, the conversation felt stilted and unnatural. He then tried a “more nuanced shouldReply discriminator for multi-bot” situations. Webb deployed a centralized decision-maker for all the bots, which would choose which of them would respond and when. This attempt resembled that of a city council meeting, again resulting in odd and stilted conversation. This decision-making apparatus wouldn’t do anyway because, if the bots are all hosted on different platforms, the “discriminator” won’t have visibility or access into each bots’ infrastructure to determine their intent, planned response, how “eager” or appropriate each bot’s reply would be, or each bot’s personality parameters.

The final effort returned to the basics of human interaction, outlined at the beginning of this post: a way to communicate body language and other kinds of “side-channel communication” to negotiate who speaks when. Turn allocation rules are our silver bullet. This could be the way to get a bot to determine for itself when and how to reply. In particular, the “current selects next” allocation rule is the default group of strategies. The “self-selection” strategies follow. In natural conversation, the current speaker has the right, and even the obligation, to select the next speaker, as outlined in A Simplest Systematics for the Organization of Turn-Taking for Conversation by Sacks, Schegloff, and Jefferson. The bot would need to first determine if they’ve been appointed as the next speaker or not, and if there’s a follow-up after the initial question. Then it would need to check if it would be interrupting (this is easier done by bypassing the conversation altogether and simply checking the most recent participant exchanges). Lastly, the self-selection rules come into play, where the bot assesses whether it has something to contribute, how confident it is in that contribution, and whether it fits within its programmed personality profile, rated on a scale from 0 to 9 (0 being no).

The algorithm that ultimately determines if the bot should reply and when, in Webb’s model, calculates an “enthusiasm score.” This is a composite score of all the previously mentioned turn-taking parameters, scoring if the bot is currently engaged in conversation on the topic at hand (returning a binary 0 for no, and 9 for yes), and then assessing its self-selection score based on how confident the bot is on if it has something relevant to say. After all the bots in the chat have returned their scores, only the bots with scores above 5 would be considered, and then ultimately selected, to reply. This smart approach works because it mirrors a real-world scenario: typically, the most enthusiastic person grabs the reins of the conversation next.

As AI becomes more integrated into our collaborative workflows and social spaces, the ability to gracefully manage multi-party dialogue will be essential. Designing AI that can negotiate conversational norms won’t just improve usability—it will make these systems feel less like tools and more like teammates. This is an area ripe for invention, interruption, and innovation. We would recommend this as an area of interest, and for those so inclined, to study and deep dive into the mechanics and rules of human systems of exchange, linguistics, and conversational norms. There’s still so much to consider here: company or group culture, infrastructure and reliability, interface clarity, scalability and cost, and shared mental models, just to name a few. Happy building.

Previous
Previous

“Becoming a Writer” by Dorothea Brande

Next
Next

Becoming, Not Arriving: A Resonant Journey