Methodological Gaps in AI User Experience Research

Over the past four months, I've spoken with more than 50 professionals as part of our interview process for AI UXR roles at Maincode. During these conversations with experts from diverse backgrounds in Human-Computer Interaction (HCI), several recurring methodological questions emerged around how we study AI systems, particularly large language models (LLMs).

These conversations uncovered real limitations in current UXR methodologies when applied to conversational AI. While some of the questions we heard are beginning to be addressed by research, others remain wide open. This post shares the questions we encountered, links them to emerging work where available, and outlines our early thinking about how we might approach the gaps that remain.

Questions That Emerged (and Where They're Leading Us)

1. How do we capture the emotional dimensions of AI interaction?

One of the most surprising patterns we noticed was how often users responded to LLMs with emotional expressions, saying things like “That’s right, great work,” despite knowing these systems aren’t sentient. This phenomenon came up repeatedly in interviews, and it seemed clear that traditional methodologies weren’t built to capture this kind of interaction.

There is some research starting to tackle this. For instance, Li et al. (2023) found that emotion prompts can significantly improve LLM performance. In another study with 106 human evaluators, LLM-generated responses were rated as more empathic than human-written ones.

But these insights haven’t yet made their way into how we evaluate UX. Emotional engagement with AI appears categorically different from engagement with traditional tools, and it’s still methodologically underexplored.

What if we began exploring ways to study emotional engagement over time, combining psychometric instruments with physiological monitoring to trace how users relate to AI across repeated interactions? There’s an intuition that something new is happening here, something that doesn't show up in isolated moments but only reveals itself in the unfolding relationship. We’re wondering what patterns might emerge if we treated this as an evolving emotional arc rather than a static interaction.

2. What does the uncanny valley feel like in text and how do we measure it?

Another recurring theme was the discomfort users sometimes feel when LLMs respond “too well.” It reminded us of the uncanny valley concept, but the original theory was built around humanoid robots, not text interfaces. So we started asking: What is the uncanny valley of language?

There is some precedent. Ciechanowski et al. (2019) showed that people preferred simpler chatbots over overly complex ones, possibly due to the uncanny effect. Kim et al. (2022) found that the uncanny valley isn’t a single dip, but a complex emotional topology with both positive and negative reactions.

Still, most studies don’t address text-based AI. This is a methodological gap we’re eager to explore.

What if the discomfort people feel when talking to LLMs isn’t just about aesthetics, but something deeper and more cognitive? Could it come from a kind of dissonance, where users know they’re interacting with a machine but still respond emotionally, almost involuntarily? If that’s the case, what would it take to measure that tension in a meaningful way? We’re curious whether exploring this across different tasks and user types might reveal patterns that existing methods have overlooked.

3. How do we evaluate power dynamics and information asymmetry in AI interaction?

A third question that surfaced was about power. Unlike earlier tools, LLMs don’t just serve, they interpret, suggest, and adapt. And because their internals are opaque, users often don’t realize just how much influence the system has.

Xu et al. (2023) flagged this in their work, noting that emotional attachment and unrealistic expectations are common. Other studies have shown that LLMs tailor their language to users, potentially triggering relational patterns that resemble mentorship or therapy.

That is a lot of power in one direction. And yet, most UX methods assume parity or user control. Clearly, that assumption no longer holds.

What if we looked more closely at how these asymmetries show up in real-world usage? What might it mean for a user to unknowingly shift their behavior based on an AI's tone, pacing, or adaptation strategy? Could studying that dynamic reveal forms of influence we don’t yet have tools to detect? It might require entirely new methods, approaches designed to surface the subtler psychological effects of systems that feel responsive, but are ultimately opaque.

4. How can we evaluate systems that evolve over time?

We also heard questions about adaptability. Many current UX methods are built for static systems: evaluate it once, get your answers, move on. But LLMs change. They evolve with users. That makes traditional evaluations unreliable or obsolete too quickly.

These limitations aren’t just theoretical. Standard methods often require significant time and resourcing, and they struggle to keep pace with systems that shift in response to user input. While techniques like synthetic UXR offer some promise for improving efficiency, they still fall short of capturing the full complexity and unpredictability of real human behavior.

What if we started experimenting with more longitudinal approaches, looking at how user behaviors and AI behaviors shift in parallel over time? Could this co-evolution be where the richest insights into trust, reliance, and adaptation actually live? It’s possible that by observing these interactions unfold across weeks or months, we would start to see patterns and transformations that traditional, point-in-time studies simply miss.

5. How do we assess users’ mental models of AI?

Finally, one of the thorniest issues is mental models. Users rarely understand how LLMs work. But that incomplete understanding shapes how they use, trust, and interpret the system.

Jiang et al. (2024) emphasized the need for interdisciplinary methods to assess this. Other studies showed that flawed mental models directly impact decision-making and trust calibration.

Yet most tools for evaluating mental models weren’t built for AI, and certainly not for the scale or complexity of LLMs.

What if we began collecting informal evidence suggesting that most users operate with strikingly inaccurate assumptions, not just about what the system does, but what it knows, how it adapts, and who ultimately controls it? What might we learn if we prototyped mixed-method studies to surface these hidden beliefs and track how they evolve through repeated interaction? There’s a hunch here that user misunderstandings aren’t just common, but structurally embedded, and if that’s true, we may need an entirely new toolkit to make those models visible and measurable.

Where We're Headed

What’s striking across all of these questions is how little of the existing UXR toolkit fits the problem space. Emotional engagement, power asymmetries, evolving systems, these aren't edge cases, they're the essence of AI interaction.

At Maincode, we’re not discouraged by these gaps. We’re energized. These are the kinds of problems that require a mix of scientific rigor and creative openness. We’re drawing from psychology, anthropology, systems thinking, and experimental design to build a research toolkit that’s up to the task.

We’re also not trying to solve it alone. Many of these questions are foundational and collaborative. If you’re working on similar challenges or have ideas about how to study them, we’d love to be in conversation.

We believe that getting this right matters, not just for usability, but for the kind of human–AI relationships we want to create: ones grounded in understanding, empowerment, and mutual respect.

References

Kim, B., de Visser, E. J., & Phillips, E. (2022). Two uncanny valleys: Re-evaluating the uncanny valley across the full spectrum of real-world human-like robots. Computers in Human Behavior, 133, 107625.

Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., & Xie, X. (2023). Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760.

Ciechanowski, L., Przegalinska, A., Magnuski, M., & Gloor, P. (2019). In the shades of the uncanny valley: An experimental study of human–chatbot interaction. Future Generation Computer Systems, 97, 23-34.

Xu, W. (2023). AI in HCI design and user experience. Human-Computer Interaction in Intelligent Environments, 141-170.

Jiang, T., Sun, Z., Fu, S., & Lv, Y. (2024). Human-AI interaction research agenda: A user-centered perspective. Data and Information Management, 8(4), 100078.