How do you adopt AI intentionally in a Voice of Customer programme?
Intentional AI adoption means matching the right type of AI to the right job, not putting everything through generative AI. Wordnerds' data-science team shares the framework they use to decide.
TL;DR
Intentional AI adoption means matching the right type of AI to the right job, not putting everything through generative AI. In this Wordnerds webinar, the data-science team shares a two-axis assessment framework, how subjective the right answer is and how easily you can spot a mistake, for deciding where generative AI belongs in a Voice of Customer workflow.
Head of Data Science Hugh Volpe frames the decision on two axes: how subjective the right answer is, and how easily you can catch a mistake. Using a general chatbot to analyse feedback lands in the risky zone, and his benchmarking found generative AI roughly 200 times slower and 500 times more expensive than the methods it would replace.
To make it safe and affordable, Wordnerds uses Google's Gemini to label a small sample, then distils that into a tiny custom model that applies a theme across a whole dataset in about ten minutes. Senior Machine Learning Engineer Damani Richards is candid that building this took roughly six months and a full team, with the production work around 80 percent of the effort.
Head of Product Stephanie Clish closes on fragmentation: when every tool bolts on its own AI, you get conflicting insights and no single source of truth, the case for analysing all your feedback in one specialist tool with one methodology.
Why watch this webinar?
Hugh demystifies the AI jargon in plain English, then shows exactly where generative AI quietly earns its place and where it becomes a liability you will not catch in time. Damani is refreshingly candid about what building actually costs once the fun ideation phase ends. Stephanie names the mess of bolted-on AI that most insight teams are already living with. It is the grounding conversation to have before your next "let's just put it in Copilot".
Duration: 58 minutes.
What this webinar covers
This session tackles the AI conversations every insight team is now having: "let's just put it in Copilot", "let's build this ourselves", and "we already have AI insights in our other tools". Rather than answer with hype or hand-waving, the Wordnerds data-science and product teams show the reasoning they apply internally when they decide whether, and how, to use AI in a Voice of Customer platform.
The throughline is a practical assessment framework for generative AI, illustrated against a live demo of Wordnerds' new definition-led themes, plus an honest behind-the-scenes account of what it took to build that feature. You will hear why generative AI is powerful but slow, expensive and prone to confident mistakes, and how Wordnerds combines it with a small, fast, distilled model to get the best of both.
By the end, you will have three concrete tests to take back to your own stack: is the right AI doing the right job, can non-technical people use and maintain what you build, and are your insights joined up enough to make confident decisions.
Natalie Grant | Head of Customer Success | Wordnerds
Natalie leads Customer Success at Wordnerds and hears customer challenges daily. She hosted this session and framed the three AI adoption conversations the panel set out to answer.
Hugh Volpe | Head of Data Science | Wordnerds
Hugh has been building with AI since before the current wave of hype. He leads data science at Wordnerds and walked the audience through the assessment framework the team uses to decide where generative AI belongs.
Damani Richards | Senior Machine Learning Engineer | Wordnerds
Damani built enterprise NLP at Cantor before joining Wordnerds. As Senior Machine Learning Engineer he led the build of the new definition-led themes feature and shared the real cost of taking it from idea to production.
Stephanie Clish | Head of Product | Wordnerds
Stephanie shapes Wordnerds' product direction. She closed the session on why fragmented, bolted-on AI creates conflicting insights, and what a specialist tool does differently.
What is intentional AI adoption?
Intentional AI adoption means choosing the right type of AI for the right job, instead of routing everything through generative AI because it is the technology in the headlines. Wordnerds' Head of Customer Success, Natalie Grant, framed the webinar around the three conversations insight teams keep having: "let's just put it in Copilot", "let's build this ourselves", and "we already have AI insights in our other tools".
The key reframe from the session is that most of the hype of the last few years is specifically about generative AI, one narrow branch of a much broader field that also includes machine learning and deep learning. Being intentional means being able to tell those apart and ask whether the flashy option actually serves your goal.
The practical takeaway Wordnerds offered is an action you can do this week: audit how you are using generative AI now, and check whether each use passes a simple suitability test before you scale it. AI moves fast and the opportunities are real, but being intentional is what turns that into value rather than noise.
How do you decide where to use generative AI?
Wordnerds uses a two-axis framework, presented by Head of Data Science Hugh Volpe, to judge any generative-AI use case. The vertical axis asks how subjective the correct answer is; the horizontal axis asks how easily a human can spot a mistake. Where a task lands tells you how much risk you are taking on and what guardrails you need.
A tool that drafts a reply for a support agent to review sits in a comfortable zone: the answer is subjective, and the expert reading it can catch any error. A public self-diagnosis app sits in the riskiest corner: there is a single correct answer and the user cannot tell when it is wrong. The same medical model handed to doctors moves back into safety, because experts can spot mistakes.
Using a general chatbot to analyse customer feedback, Wordnerds places in the risky zone. There is a right answer, and errors are hard to catch, because if you have to re-read all the data to check the tool, it has not saved you anything. The framework is not about banning risky uses; it is about redesigning them until they are safe.
Why not just use Copilot or ChatGPT for feedback analysis?
Generative AI is genuinely impressive, but Hugh Volpe's benchmarking showed why it is the wrong default for feedback analysis at scale. For the tasks Wordnerds tested, generative AI ran roughly 200 times slower and 500 times more expensive than the machine-learning approaches it would replace, with a correspondingly heavier environmental cost because you are largely paying for energy.
That expense is not academic. It means you cannot simply push a million phone calls a month through a chatbot, and you certainly cannot go back and re-analyse five years of history every time you think of a new thing to look for. The practical effect is artificial limits, exactly the limits Wordnerds does not want on your themes.
There is also the reliability problem. Generative models are built to produce coherent-sounding language, so they have a particular habit of giving very convincing wrong answers, what the industry calls hallucinations. Feedback analysis lands in the framework's risky zone precisely because those mistakes are hard to spot. Notably, Wordnerds uses Google's Gemini rather than Copilot for its own generative-AI steps, behind proper guardrails.
How did Wordnerds build its definition-led themes feature?
Wordnerds' new definition-led themes, demoed live in the session and rolling out to customers from November, replace about fifteen minutes of manual tagging per theme with a written definition you simply describe in plain language. Behind the scenes, Google's Gemini does the equivalent of around 24 hours of training work, then shows you what it understood so you can correct it.
The engineering trick, explained by Damani Richards, is distillation. Gemini tags a small sample of data, and that behaviour is distilled into a very small custom neural network, far smaller than a language model. An embeddings model turns text into numbers as data arrives, so the heavy lifting is done up front and a theme can then be applied across a whole dataset in about ten minutes.
Getting there took roughly four months of experimentation, including a consultant deep-learning specialist for the distillation step, and another two months to build the production tool, about six months in total. The result is comparable accuracy to a full LLM, at a fraction of the speed and cost.
What does it really cost to build AI in-house?
If you are weighing "let's build this ourselves", Damani Richards' honest answer is that the idea is the cheap part. A rough version that worked for one or two themes existed on two laptops in about two weeks. Turning that into something reliable, secure and usable was roughly 80 percent of the work and the bulk of a six-month timeline.
That production phase needed skills well beyond two engineers. Wordnerds brought in front-end and user-interface developers to make the tool intuitive, infrastructure experts so it scaled without falling over, security specialists to make sure customer text data never leaked or trained outside models, and product managers to keep the build genuinely usable.
The cost does not stop at launch, either. Once a tool ships, there is continuous commitment: answering user questions, fixing inevitable bugs, updating the technologies underneath it, and keeping pace with new advances. As Damani put it, the dream of building and the reality of maintaining are two different things, and the maintenance takes a bigger team than the people who had the original idea.
Does Wordnerds use your data to train AI models?
No. In the closing Q&A, Head of Data Science Hugh Volpe answered this directly: Wordnerds does not use any customer data, or the public data in your projects, to train large language models. It is a question the team gets often, and a legitimate one.
The distinction Hugh drew is between consumer and enterprise use of the same underlying technology. If you paste data into a free ChatGPT account, that data may well be used to train the model. Wordnerds instead uses a properly configured enterprise setup within Google, where the model sees the data, returns its answer, and then the data is gone, locked down and documented for that single purpose.
That control is part of what intentional adoption looks like in practice. The same generative-AI capability can be a serious data-governance risk or a safe, auditable component, depending entirely on how it is set up. For a Voice of Customer programme handling personal information inside free-text feedback, that configuration is not a detail; it is the whole point.
Full Webinar Transcript
Natalie Grant: Beyond the hype, intentional AI adoption for VoC. This conversation is to help you lead intentional AI adoption with confidence. I'll get started with some introductions. So first of all, if you've not met me before, I'm Nat. I'm the Head of Customer Success at Wordnerds and I hear your challenges daily.
Joining us today we've got Hugh, our Head of Data Science, who has been building with AI before the hype. We've got Damani, who is our Senior Machine Learning Engineer and built enterprise NLP at Cantor before joining Wordnerds earlier this year. And finally we've got Steph, our Head of Product, who helps shape our future direction.
Before we get started, some housekeeping for today. Do introduce yourself, pop your name and where you're joining us from in the chat, and feel free to share what the weather's like where you are. It's pretty grey and gloomy here in Sunderland. At the end we're going to have a live Q&A with our expert panel, so do add your questions in the chat anytime. We'll flag them during the conversation and cover them all at the end in one go. During the Q&A we're only going to use public data. You will see a live demo later on, and just to assure you, we're only going to be using public data for that too.
At the end, we're going to be sharing a recording today straight after the session of the full session. But later on in the week you're going to get a resource pack with each chapter broken down and some additional resources to help you put this into practice. Finally, your pets are welcome and in fact encouraged, because during our run-throughs our pets kept joining us.
So let's see who we've got joining us. We've got Joseph from Sage. We've got Rob in Northumberland too. Lots of people from Sage joining today, Amy from True Potential, Alan from the DWP. So fab to see so many familiar names.
We've got a packed agenda today. So if you're thinking "why am I here?", well, these are the conversations that you've been telling us you're having. What's really interesting is that a lot of our customers, especially the ones that have been with us for years, were actually the first people in their organisations to bring in AI. They were the innovators. But now everybody in their organisation is talking about AI. And these are those adoption conversations: "let's just put it in Copilot", "let's build this ourselves", "we have AI insights in other tools". And you've also got colleagues who feel like it's all hype and want to wait and see.
But what does this mean for you? This is the dream analysis machine that I know you're building. You want to get all of your data from any data source into one place for that scalable, robust and consistent analysis. And when it comes to text analytics in particular, what really matters is speed, flexibility, depth and accuracy. Why does this matter? Because ultimately what you want to be able to do is make confident decisions, improve ROI and really make a difference.
So this is the agenda for our journey today and how we're going to unpack those conversations one by one. Starting with Hugh, we're going to talk about that conversation around "let's just put it in Copilot". Hugh's going to demystify AI for us, share our assessment framework for intentional AI adoption, and give you a demo of our new definition-led themes that are rolling out on the platform next month. Then Damani is going to talk about the conversation around "let's build this ourselves". He'll take us behind the scenes of our recent build, how we use our assessment framework, and share the hidden cost of innovation. Finally, Steph is going to answer that question around having AI insights in other tools. So over to you, Hugh.
Hugh Volpe: Thanks, Nat. So as I said, I'm going to attempt to give you our take on how we'd respond to "let's just put it in Copilot" as an idea. I'm going to have to do that by explaining some AI terms to begin with, because otherwise everything else I say is going to be quite confusing for half the audience.
So, these are the terms we'd like to use in the presentation. The first one is artificial intelligence. It's probably the easiest one to understand. We're just talking about computer systems that can perform complex tasks normally done by human reasoning, decision-making, creativity and so on. It's an umbrella term for a lot of different things. You'll find it's used in a way that suggests it's something very narrow, and it's not. It's a very broad field with a lot of different approaches in it.
Within that you have machine learning. That is one approach to artificial intelligence. Essentially it's when you're using data and algorithms to train computers to do the AI. Imagine a little black box. You take loads of real-world examples on one side and the outputs you're expecting on the other, and you have a process that automatically goes through and tweaks all the little dials on the side of the box to make it consistently output a good answer. You don't really need to think about what's in the box. To be honest, it's better if you don't.
Within machine learning there's a specific subset, deep learning. I'm going to say "neural network" but I'm not going to tell you what a neural network is, because it's such a long thing to describe and you don't really need to know. Deep learning is essentially one of the things that's really exploded over the last decade or so, where people found that having just enormous layers of neural networks in this little black box was able to achieve really quite astounding things. When people talk about AI now, they're often talking about this section, but it's not all of AI. AI is much broader.
Within deep learning you have a very specific thing that is now the thing you more or less always hear about: generative AI. Generative AI by its nature is not normally categorising something. It produces new stuff. In our case that would normally be text; it can be images. If you imagine tools like Copilot for Microsoft, OpenAI's ChatGPT or Google's Gemini, you're probably imagining a chat interface. Those chat interfaces all sit a layer above a core model, a rather large black box running in the background doing the intelligence part. But it's worth understanding you don't have to use generative AI like that. Embracing it does not mean having a chat interface everywhere. It can sit in the background doing things without you even appreciating that's how it worked.
Other terms you might hear are things like Transformers. Transformers were an architectural change to how the insides of the black box work that came about in about 2018, and that allowed the explosion in performance we've seen. Large language models are just the name of the underlying deep-learning network. And natural language processing is the broad term for having AI that enables computers to understand human language; it isn't specifically a generative, deep-learning or machine-learning thing, it can be anywhere in those circles.
So bearing in mind what I'm trying to address, why don't we just use Copilot for this stuff? First I want to introduce our journey to this point so you understand where we're coming from. Wordnerds has been doing something broadly in the category of Voice of Customer since about 2019, when we made the platform that current customers log into. That platform started in 2019, right as Transformers were released. So our platform is built as an AI-first tool. It was never built around the idea that you might manually code a load of data or build a big taxonomy of keywords. But 2019 is not the point at which large language models became static. Every couple of years there's a new big wave of something we've had to adopt and figure out whether it actually adds value.
For example, around 18 months to two years ago was the first point at which we could realistically put the equivalent of ChatGPT in your platform. We chose not to do it then. The kind of thing it would do is give you a little chat interface to your help docs, which is cool but doesn't fundamentally change what it's like to use a text-analysis platform for Voice of Customer. So we didn't do it. But we've been working through what needs to be true to make this worthwhile, and at what point we should release features that include it.
So let me take you through the way we think about how to adopt generative AI into a Voice of Customer tech stack. If you understand broadly the good and bad things about generative AI, there are some amazingly good things. The thing that stands out above all to me is its ability to take context about you and your particular situation to tailor your results. That's possible in traditional machine learning, but it involves far more customisation per use case, in a way that's doable but expensive. It's also much more understandable from a human point of view because you interact in text, and it can be a very low barrier to getting something that functions. Those are really important advantages.
But you have to pair that with the problems. Generative AI is slow by comparison to the alternatives. For the particular things we looked at, we're talking about something like 200 times slower. It's not a bit slower, it's a lot slower. It's also incredibly expensive by comparison; 500 times is the number we came out with. It depends what you're doing, but it's a lot more expensive, not a little. Part of that also means it's worse for the environment, because you're using a lot more resources, largely energy.
What that expense really means is you couldn't use it to drop in and replace the things we feel you should be able to do. You'd end up with artificial limits. In our case, if you're a customer you'll understand the concept of themes in our platform. We don't want to put a limit saying you can have two new themes a month, or that you can only apply them for a certain period. We want you to apply them to all your data. We don't want you unable to compare Black Friday this year to last year because it's past a 12-month cut-off. So the expense question is really: what limits do we have to put in place to make it affordable?
And then hallucinations. All systems make mistakes; humans make mistakes. There's no version of this where things are perfect. Generative AI has a particularly funny relationship with making mistakes: it has a tendency to come out with very convincing-sounding incorrect answers. That's just the nature of these networks. They're basically built to predict the next word, to make coherent-sounding language. The fact they can answer questions at all is almost miraculous, but that same nature leads them to hallucinate. They're getting better, but it's not going away, so it's a thing you need to learn to manage.
So what I've got for you is our checklist for how we treat generative AI as part of a wider system. The first thing to address is hallucinations and unreliability. On screen you'll see a quadrant graph. Up the side we've got "how subjective is the answer?", and across, "how easy is it to spot the mistake?".
For example, if you're a customer-support person and you have a tool that automatically drafts your response to an incoming email, that's really great. An email is entirely subjective; there are lots of correct ways to say the right thing. And if it's drafting for you, you can spot the mistake. That lands in a nice space, top of the graph: likely to work well, and when it goes wrong it doesn't matter as much because it's easy to spot.
How does that change if you put exactly the same technology into a tool that automatically responds instead of letting you read it? You've moved it into a less-easy-to-spot region, so you've introduced risk. It doesn't mean you can't do it, but you should be aware it'll go wrong in unpleasant ways, and you'll have to build safety nets and guardrails around it.
To illustrate further, imagine a medical self-diagnosis app, say the NHS releases one where everyone can chat to it and it tells them what's wrong. On the subjectivity scale that's really low: there's a correct answer about whether you have an illness, and you're not going to know it's wrong, because that's the reason you're using the app. That puts you firmly in the very risky category, bottom left. But the exact same idea given to doctors moves back, because they can spot the mistakes.
So where would we put just using Copilot for feedback analysis? We've put that in the risky category. There is a correct answer, and you don't want a tool making unfounded recommendations that just sound nice. And it's not easy to spot, because fundamentally you use the tool to save you time. If you have to read everything to check it was right, it hasn't done its job.
So we try to break this down further. Which bit is manageable, which bit can we turn into something scalable and reliable? That's where we picked out categorisation of the data. Could you use Copilot, or in our case Gemini, the Google equivalent, to categorise data for us? That initially comes into the risky category. But that's where the interesting bit starts: how do you change it so it's no longer risky?
The other checklist items are easier. Can it run efficiently? Mostly in Voice of Customer, people have a volume of data where you can't just throw generative AI at it. You can't put a million phone calls a month through Copilot and expect a reasonable, efficient result, and you certainly can't retrospectively go back over five years. The thing we find really underutilised is using generative AI to supervise slightly older, very effective models, bringing the benefit of what generative AI can do into them, as a supervising layer over a machine-learning network. A smaller black box that's very efficient and can run at scale.
The final bit is how do you test it? This sounds basic, but putting a couple of things through ChatGPT and saying "look, it works" is not the end of the story. Does it work for somebody who uses slightly different language because they're from a different part of the country, or has a different version of the problem? When we test this stuff, we test with tens of thousands of examples, then change the prompt a bit and test it all again, because you get a different result. So finding a problem you can actually verify is working properly is a big part of this.
I've done a run-through of how we think about these problems. I'm now going to try to demo the feature we built, how we do categorisation taking advantage of generative AI. So bear with me while I get that running.
Right, you should hopefully be seeing that. I'll do an example here. As Nat said, we're using public data, so this is all Twitter / X data on the retail space. I'm going to talk about outdated products.
The flow you previously went through when you trained a theme was to manually go through for about 15 minutes tagging individual bits of data. It would present the things it thought you needed to answer so it could get an impression of what to do. That's firmly from the machine-learning background: all our real-world data, and then twiddling dials. What we're doing now is, instead of you showing it what you want by finding or making up examples, you tell it what you want. So here we're going to go for "out of date products, mouldy".
We're giving it a prompt here. What I've written is not the definition; it's a chart, essentially, to tell Gemini "this is roughly what I want". It's going to take that, understand your data, and turn it into a definition that will consistently work. It will then apply that over all of your data, doing the equivalent of about 24 hours of your training while you did 15 minutes. To move us across that graph, from the risky category into something where you can spot the mistakes, it produces a set of results to help you understand how it's performing.
Initially, it's rewritten the definition into something it thinks makes sense. Your job, as the expert who knows exactly what you want, is to read that and check you agree. It's likely to add information to make it clear what you actually want; that's your opportunity to fix it if it's got the wrong end of the stick. It also shows you how effectively it thinks it can categorise that. In this case it thinks it could do pretty well, and reckons about one and a half percent of your data is in that category. It shows you the data that, from its point of view, is in that category, and this is your opportunity to add exclusions. Most companies have their own version of what each of these categories is. It'll also show you the middle ground, where it goes from thinking "this is what you want" to "this isn't", and that's where you'll see your problems and clarify.
And that's basically it. This is how you train a theme. You can go and correct it, or just manually write the definition however you want if the AI isn't playing that day. But what this allows is very rapidly trained themes of a much higher quality. The themes it's training are much better than the ones currently on the platform; there's a major upgrade in performance. And we're rolling out a much faster pipeline, so even on a larger dataset we're expecting tens of minutes to apply a theme. So we're really excited to see what people do with this. That's it for me. Back to you, Nat.
Natalie Grant: Amazing, thanks so much for sharing that, Hugh. That is pretty much the first peek any of our customers have seen of that new feature. It's going to be rolled out next month, and we're really looking forward to your feedback. Just to recap some key points, what really came through to me was that idea of using the right AI for the right job. A lot of the hype over the last couple of years is really about generative AI, and that's why using the assessment framework can help you identify what use of generative AI makes sense for your organisation. Damani is now going to talk us through the build journey for that feature, to give you a realistic look at what that involves. Over to you.
Damani Richards: Thanks, Nat. So, as Nat said, I'm going to talk a little about our experience of building a tool like this, to answer that question of "can we build it ourselves?". I want to tell you what's involved. I've often been the person advocating to build things, because I like building cool things. But sometimes the fun part isn't the whole journey, so I really want to talk about our full journey with this, and give you an idea of the amount of work that's gone into building something like this.
As with everything, it starts with an idea. We keep up to date with advancements in technology, and obviously with ChatGPT, Gemini and other large language models, they've really shot up in terms of how much they can do, how accurate they are, and the impressive tasks they're capable of. So we thought, why not try and categorise data using LLMs and the power in them to improve what we already had?
There are some problems with this. Our customers have huge amounts of data, and when you've got a large amount of data going through, these AI models can be quite slow. Even though their APIs are generally fast, if you're sending a lot of data through it slows down, and it becomes expensive and bad for the environment. And when you think about that quadrant Hugh showed, this problem naturally sits in the risky area. There's generally a right answer to a categorisation, but because of the huge amounts of data it's hard to spot problems if you present it in a random way. So we wanted to address those problems with what we built.
Our solution was to distil the power of these huge LLMs down into a really small, lightweight model. The model is a neural network, but we're talking about a very, very small, lightweight model. The benefit is that it reduces the cost, improves the speed massively, and has a better impact on the environment. And we wanted to move out of that risky quadrant into a more comfortable position by designing a tool and a user interface that made it easy for our customers to identify where their definitions might be going wrong or getting confused, and refine that.
Before we could do any of that, we needed to experiment and prove that what we had in our heads could actually be done. The first phase was experimenting and proving it was a viable idea. There were two stages. The first was seeing if we could get one of these large language models to be good enough, better than what we were doing before. We decided on Gemini in the end, but at the start we kept our minds open. We looked at different open-source models, and the cost of hosting those ourselves versus using an API.
There's also the issue that if you ask a model the exact same question with the exact same data in a slightly different way, the results can vary massively. Even asking it in exactly the same way twice, the results might change, because they're designed to be creative. That's not always the best thing. There are other parameters we needed to tweak, like temperature, which tells it how creative we want it to be, and how much data we give it at one time; too much can confuse it, too little makes the process slow. So we really wanted to work out the best way of getting an LLM to do this categorisation for us.
When we found a solution we were confident was a lot better than before, we wanted to distil it into a smaller model. In that first phase it was mainly me and Hugh experimenting and discussing our results. But when it got to distilling, we brought in a consultant data scientist called Dean, an expert in deep learning, who helped lead that second step.
To do this we needed to turn our text into numbers, so we used an embeddings model, which takes the sentence and turns it into a list of numbers that represents, mathematically, what that sentence means. That's then passed into one of these really small models. We used the results from a small set of data from the LLM to train these models to basically copy what the LLM was doing, but in a much more efficient and quick way. We tested numerous strategies. The whole process took maybe around four months to get to a point where we were confident this was the right solution.
When we were there, the results were really encouraging. What we'd built was a lot cheaper than just using an LLM, a lot faster, we can apply it to a whole dataset in 10 minutes, and the new system was a lot more accurate than what we had previously. We wanted to make sure we were improving, not bringing out something worse or just similar.
At that point, though, what we had only really existed on my laptop and Hugh's laptop. We wanted to put it into the hands of our customers and let other people within Wordnerds play with it. So we needed to build it as an actual production tool. That started with prototyping and testing a new user interface, thinking about the best way to present the data so people creating themes can really see what's going wrong, what's going right, and make tweaks in a clever way, moving it out of that risky quadrant into a more comfortable place.
We needed to build and deploy the necessary cloud infrastructure, because we don't want a tool with a long installation process that only technical people can handle. We want anyone who can use a web browser to use it. And we needed to ensure it was secure. When you're handling text data, there's a lot of personal information that might sit inside it. You don't want your data going out to people you don't want to see it. So we needed security up to standard, following the regulations we follow within Wordnerds. We also needed to integrate it cleverly with our existing platform and code base, so we weren't breaking anything or causing major disruption. That took maybe another two months, to get to where we are now, where we've deployed an internal test version. We've got a full customer rollout coming in November.
So the whole process took about six months. And there's a cost associated with that. The first phase, about four months, took a lot of mine and Hugh's time experimenting, and we had to bring in our consultant data scientist. But when we moved on to building the production tool, that required skills that weren't necessarily within mine or Hugh's role. It required people with user-interface and front-end development skills, people who are experts in infrastructure to ensure it scaled correctly so it doesn't fall down, security specialists to ensure that anywhere we hand over data there's no potential of it leaking or of models taking it to train themselves, and product managers to guide the vision and make sure the product was actually usable. That's a whole team of people, and a significant cost.
So while the dream of building something is great, and I love building things like this, the ideation phase is only really a small bit of it. We could have got to a rough version on my laptop that worked for one or two themes in a couple of weeks. But to have something that worked for loads of different data types, reliably, that we were confident was doing exactly what we needed in the variety of situations it's used in, took a lot more. It took about four months of experimentation to be happy with the model, and then the development of the tool, the user interface, the security and compliance on top of that. That bit was 80 percent of the work; the idea is quick to get to, getting it to a finished tool was the hard bit.
And once we've built this, we're not going to dust off our hands and say "tool's done". There's a continuous commitment. When you use it, you'll have questions. Inevitably bugs will arise. There'll be updates to the technologies underneath it, and advancements we want to stay ahead of. Troubleshooting and all those things are an additional job that continues long after we've built the tool, and it takes a bigger team than just me and Hugh to maintain. So the beginning part is really fun, that's the bit of my job I love, but there's a lot of hard work that comes after. The dream and the reality are two slightly different things. Without the team supporting us and the resources allocated, it wouldn't have been manageable for a team of two. So that's the journey we went on to build what Hugh just presented. I hope that gives you context about what's needed. Back to you, Nat.
Natalie Grant: Amazing, thanks so much, Damani. I know a lot of you have been talking about your teams looking at building tools in-house, so that chapter will be a great one to forward on so they can see Damani's real look at the process. Pulling out a few things that stood out: that question around whether non-technical people can use the product at the end is so important, because the power of a tool like that is getting it into the hands of the people doing the analysis. And that question around whether your team can maintain it, not just security checks but keeping on top of innovations. So thank you again, Damani.
Our final conversation is around AI insights being in other tools. Talk us through that, Steph, our Head of Product. Over to you.
Stephanie Clish: Thanks, Nat. Right, let's dive in. So, with AI developments being so accessible now, the tendency across products is to bolt it on and let users apply it to whatever they want. But this causes lots of confusion around whether it's appropriate for text analytics in Voice of Customer. For example, we're seeing it added to lots of data-gathering tools, your surveys, your feedback platforms. These tools are brilliant at what they're built for, which is gathering the data. But the AI has been added to do a slightly different function, namely analysing that data. And as users, we don't know how robust these AI features actually are. Have they just tacked on an LLM to chat with you, with no real guardrails or methodology? Or is it genuinely well thought out? If we think back to Hugh's matrix, have they built it so it sits in that top-right of the quadrant, or is it landing in those more risky areas?
When you're evaluating whether to use these built-in features, helpful questions to ask yourself are: how do I know it's looked at all of my data and not just a sample? How has it decided to categorise my feedback, and can I influence that? Can I track specific conversations I care about, or am I stuck with whatever it decides to show me? Am I able to get to the root cause of problems, or am I seeing surface-level insights? Because if we go all the way back to the goals Nat mentioned at the beginning, depth, flexibility, accuracy and speed, with these add-on features, understanding what's happening under the hood is super important. You might ask it a question and it'll give you an answer, but you don't know how much of your data has been considered, how it arrived at the answer, and whether it would adjust based on more business context. It's this lack of transparency and control that tends to be where you have to make compromises.
As well as those considerations, you're still in a situation where you're using different AI across different tools to analyse your data. By using each tool to analyse the data it brings in, you've got different AI models using different analysis methodologies, and this causes conflicting insights from different sources. No single source of truth, and departments each proposing different priorities of what needs fixing. There are obviously blind spots, because you're not bringing all that data together to see everything as a whole. And naturally teams are arguing over whose data's right.
So if you can relate to any of this, it's likely a specialist tool might be helpful for you. With specialist tools for your text analytics, you're able to hit those goals without compromising on any of them. You can bring all of your data together and analyse it consistently with one methodology. This gives you the ability to do proper prioritisation and gather the evidence you need to take to the board. You're able to get depth and scale within your programme to find real insights that help you change the customer experience for the better. And you also benefit from the continuous innovation that happens within these specialist tools.
So Damani's just walked you through our example of definition-led themes. That's come from us applying new technology to this problem. As a product team we're always looking to develop the platform, and this involves leveraging new tech. The real plus point is you get the benefit of that without any of the maintenance burdens Damani mentioned, the infrastructure, the security, all those overheads. With specialist tools, flexibility always tends to be a consideration; it's certainly one of the core pillars we focus on. So you also get the benefit of being able to mould it to your particular use case, but without any of the commitments that come with building it yourself. Hopefully that shows you the difference and the benefits you can get from intentional AI adoption in your VoC programme. I'll hand back over to you, Nat.
Natalie Grant: Fab, thank you so much, Steph. I know that conversation in particular is super noisy at the moment, as more and more tools add on AI insights, and it's really hard to understand what it all means. So now we're going to go back to where we started, look at those key learnings and what that means for you. We're also going to have that live Q&A, so please keep your questions coming in the chat.
If we go back to the start, these are the conversations we're looking to help you navigate and lead in your organisation after today: "let's just put it in Copilot", "let's build it ourselves", "we have AI insights in other tools", and one we haven't got onto yet, people who think it's all hype and want to wait and see.
So where are we now? I think the key takeaways are using the right AI for the right job, and bearing in mind that most of the hype you're hearing is about generative AI specifically. So using that assessment framework for adopting generative AI can be super helpful. An action you might take away today is to audit how you're using generative AI now and ask: does it pass that test? When it comes to building, build with AI when it makes sense. Consider: can non-technical people actually use what you're building, can your team maintain it, and can you keep on top of innovations? What Hugh and Damani walked through is our latest version of themes; our last version was out in the last 18 months, and I'm sure there'll be something else in the next 18 months. It's a constant job to keep on top of the latest technology. And finally, how is the AI in your tech stack really working for you? Are the right tools and right AI doing the right job, is it joined up, and ultimately can you make confident decisions?
Final thought on "it's all hype, let's wait and see". There is a lot of hype, but don't let that stop you. AI advancements present amazing opportunities and you don't want to miss out, but it is moving fast. Being intentional is the key to getting value, and that's what we're hoping we've helped you do today.
Do keep those questions coming in the chat; we're going to move over to them shortly. Hugh, Damani and Steph will all be available to take your questions. And just a reminder that we'll be sending out the recording today straight after the session, and later in the week you'll get a full resource kit including that assessment framework and all of these sections broken down into chapters. We're also going to be rolling out the definition-led themes next month and we'll be in touch about that. So over to our Q&A.
Going first to a question from Theo: "The ease and effort to train looks great, it's similar to what we currently use. In terms of time, can I assume this will also take a lot less time to train, or is it just about the accuracy?" Hopefully it's both. Who wants to take that?
Stephanie Clish: Should I take it? Hi Theo. So, this is definitely quicker on both counts, both the speed and the accuracy. That's certainly been the case during our internal testing. We do have a bit more internal testing to go, where we're hoping to increase the speed of training it even more. The whole point of the manual tagging process Hugh described, which I know you're familiar with, is that we're removing that completely. So it's really just about refining that definition. As soon as you get to a good definition where you can see the results are categorising it correctly, you can enable it, and you also get the speed of it being applied to your data. Hopefully that answered your question.
Natalie Grant: Thanks so much, Steph. Okay, next we've got a question from Anthony, this looks like a meaty one, I reckon this is one for you, Damani: "This is super cool and really interesting, thank you. You mentioned the outcome of your exploration and testing was that a hybrid approach performed best. Did this involve using Gemini to train your custom-built model, or training an open-source SLM, for example?"
Damani Richards: Yeah, I can answer that. So our final approach was a combination. We use Gemini to help us tag a small amount of data, and then we distil that performance into a really small neural network, one that we custom-built. We experimented with a few different architectures and structures, and different ways of distilling it into a much smaller neural network. But it's a very small neural network that we're able to use, which is why we get the massive improvement in speed and cost. And we don't really need to compromise on accuracy, which is one of the positive things we found: we could get comparable results to just using an LLM on its own, with a much smaller, more lightweight neural network. I wouldn't even call it a language model; it's a very, very small neural network.
Hugh Volpe: I'll add one thing. The reason we didn't end up going down the small-language-model approach is that our benchmarking suggested we were never going to achieve the kind of speeds of applying the theme that we want to get to. From our point of view, the way this should work is somebody figures out what they want to measure, makes it, sets it off, and knows the answer in about 10 minutes. It shouldn't be a thing where we come back tomorrow. We couldn't make a small language model work that fast. So we have a very small network making the predictions, but it bases that on pre-generated embeddings of all the text. We've done the heavy lifting as the data arrives in our platform, because data normally arrives in an upload a week, or trickles in through an API. We do the heavy lifting at the start with a sizeable sentence-embedding model, so that by the time you want to apply a theme, we already have the meaty bit done, and we can rapidly apply small models on top. That's what gives us the power to do it so quickly.
Natalie Grant: Perfect, thanks so much Hugh and Damani for explaining that. Next, I've got a couple of questions about the definition-led themes. "Can't wait to try, you and me both, Isla. Can I keep my old themes, and if not, what will that migration process look like?" Damani, is that one for you as well?
Damani Richards: I can take that one. The goal is that eventually we want all of our themes to move over to these new definition-led themes. They are better; we've done a lot of testing to validate that. We've looked at what happens if we move our older themes across, and we've come up with a migration process. We take the data you'd already tagged, and we use generative AI to develop a definition based on what you've tagged already, so you won't need to go and rewrite all these definitions if you've got loads of themes. Then it generates the definition and does the regular process that happens for new definition-led themes: we tag a small sample of data, distil that into one of these lightweight models, and it works in exactly the same way. And what we should see, for the vast majority of cases, is an improvement in the accuracy of those themes as well.
Natalie Grant: Perfect, thanks so much, Damani, that all sounds really good. Okay, our last question, and I reckon this is one for Hugh: "Interesting, can I check, having also seen your new definition-led theme tech, do you use my data to train LLMs?"
Hugh Volpe: Yeah, it's a great question. The short answer is no, we don't. We're not using anyone's data that they brought to us, or the public data you see in your projects, to train the LLMs. It's a great question, because if you're just using ChatGPT through a normal free account, it absolutely will be training that LLM on it. But that's not what we're doing. We're using a properly set up bit of technology within Google to do this for this kind of purpose, where the LLM we're using sees the data, gives us the answer, and then it's gone, and that's all locked down and well documented. So that's a very legitimate concern, but not an issue here.
Natalie Grant: Perfect, thanks so much. And as you said, it's a really legitimate question, actually one we get quite a lot, so thank you for going through that. Well, it doesn't look like there are any more questions, and there's just two minutes left before three, so I'll let you go and get yourself another cup of tea before your next meeting. Thanks again for spending your Tuesday afternoon with us. And remember to keep an eye on your inbox, both for the recording today and the full resource pack later this week. Thank you.
About Wordnerds
Wordnerds is a UK-first Voice of Customer platform that makes customer feedback a strategic asset across the whole organisation. We ingest feedback from surveys, complaints, reviews, calls and social, apply transparent AI to surface themes and drivers, and integrate the output natively into Microsoft Power BI — so operational teams, boards and regulators see what customers are saying without logging into another platform. We specialise in housing associations, transport operators, utilities and regulated UK sectors where auditable evidence is non-negotiable. Beyond the software, our Nerd-assisted service layer co-designs themes, frameworks and integrations so insight actually drives action, not reports that go nowhere.
%20(1).png?width=660&height=165&name=Wordnerds-Logo-Yellow-and-White-On-Transparent-(RGB)%20(1).png)