Can I use Copilot or ChatGPT to analyse my customer feedback?
General-purpose AI like Microsoft Copilot gives fast, confident answers about your customer feedback, then quietly changes them. Watch Wordnerds test Copilot against 12,000 rows of housing data.
TL;DR
Copilot and ChatGPT can summarise customer feedback, but they cannot reliably quantify it. When Wordnerds asked Microsoft Copilot to size the issues in over 12,000 rows of synthetic housing feedback, it returned confident percentages that changed every time it ran — and pointed to spreadsheet rows that did not contain what it claimed.
Pete Daykin, Wordnerds' CEO, explains why: generative AI is built on sentence embeddings, so it is brilliant at spotting patterns in language but not built to count, reason or tell you how many people said a thing — it guesses, and sounds certain doing it. Wordnerds keeps the embeddings but adds the two things Copilot is missing: a human in the loop to decide what counts, and a structured, auditable layer that turns language into numbers you can defend to a board.
That structured layer is what lets Wordnerds report feedback in Microsoft Power BI — drilling from any theme down to the verbatim comment behind it, ranking issues by what actually moves a TSM or satisfaction score, even predicting which complaints will escalate. The honest answer is not "AI can't help"; it is to use the right kind of AI for the right job — Gen AI for spotting language patterns, a specialist structured layer for counting, auditing and prioritising.
Why watch this webinar?
Pete and analyst Stella Dooris run Microsoft Copilot live against a 12,000-row housing dataset, so you watch the confident-but-wrong moment happen rather than take our word for it — the percentages that shift overnight, the citations to rows that say something else, the "row zero" that does not exist. Then Stella shows the same questions answered in Power BI, traceable to the verbatim. If your leadership team has ever asked "why don't we just use Copilot?", this is the 57 minutes that answers them.
Duration: 57 minutes.
What this webinar covers
This webinar started with a real question. A prospective customer took a Wordnerds proof of concept to their senior leadership team, who reasonably asked: "we've already got Copilot — why don't we just use that?" Rather than argue, Wordnerds set up a head-to-head and recorded it.
Using a synthetic dataset — a fictitious landlord, Acme Housing, with over 12,000 rows of tenant satisfaction measure (TSM) feedback, generated so no real tenant data appears on a webinar — Pete Daykin feeds the data into Microsoft Copilot and asks it to size issues, score sentiment and recommend actions. It looks impressive, then falls apart under questioning: the same prompt returns different numbers on different days, and the rows it cites do not back up its claims.
The second half is the alternative. Pete explains the sentence-embedding science underneath generative AI in plain terms, then shows how Wordnerds adds a human and a structured layer on top. Analyst Stella Dooris takes a live audience-chosen topic — repairs — through both Copilot and Power BI so you can compare what each actually returns. It closes with a long, practical Q&A.
Sarah Wilson | Account Manager | Wordnerds
Sarah is an account manager at Wordnerds and chaired this session, framing the central question and running the polls and live Q&A.
Pete Daykin | CEO | Wordnerds
Pete is Wordnerds' CEO. He led the live Copilot challenge and walked the audience through the sentence-embedding science that explains why general-purpose AI struggles with deterministic feedback analysis.
Stella Dooris | Insights Analyst | Wordnerds
Stella works in the Wordnerds analyst team. She ran the live head-to-head, putting the audience-chosen topic through both Copilot and a Wordnerds Power BI dashboard.
Can Copilot or ChatGPT analyse customer feedback accurately?
Copilot and ChatGPT can produce a fast, plausible summary of customer feedback, but they are unreliable at the numbers a board acts on. In this webinar, Microsoft Copilot was asked the same question about Acme Housing's security issues on two different days. On day one it reported faulty locks at 15.3 per cent, broken security doors at 18.7 per cent and concerns about safety at 22.1 per cent. On day two, same prompt, same data, it returned 8.5 per cent, 6.2 per cent and 10.3 per cent.
It got worse on detail. Asked to list the spreadsheet rows behind a claim, Copilot cited rows that turned out to be about energy efficiency, not accessibility, and at one point offered "row zero" — which does not exist. Challenged on a claim that 20 per cent of residents mentioned accessibility (around 2,500 of 12,288 rows), it re-evaluated down to 2.6 per cent. That is not a rounding error; it is the difference between a real problem and a non-problem.
Why is generative AI unreliable at counting themes in feedback?
Generative AI is unreliable at counting because it is not doing arithmetic — it is predicting plausible language. As Pete explains, tools like Copilot are built on sentence embeddings inside a large language model: every word and pattern of words is mapped by how closely it associates with others, so the model is superb at suggesting what plausibly comes next. That is why it writes good marketing copy or generates a recognisable Midjourney kitten.
Wordnerds borrows the analyst Benedict Evans' framing: Gen AI is strong at tasks with no single right answer and weak at deterministic tasks where an answer is either right or wrong. Counting how many tenants raised damp and mould is deterministic, so the model approximates — and sounds confident regardless. Two things are missing: a human to decide what actually counts as an issue, and a structured layer that turns language into numbers you can audit. Without them, you get a confident summary you cannot defend to a board.
How does Wordnerds analyse customer feedback differently?
Wordnerds starts with the same sentence embeddings, then adds the human and the structure that general-purpose AI leaves out. An analyst gives the model an example of what they are looking for — say, repairs — and the embeddings surface contextually similar comments even when they share no words ("they didn't turn up" and "I waited in all morning and it was a no-show"). The human confirms what belongs, and in about 10 to 15 minutes trains a classification model that is roughly 85 to 95 per cent accurate, with F1 accuracy checks reporting figures like 89 or 92 per cent so the tolerance is always visible.
Around 130 housing classifications come out of the box, arranged in a hierarchy. You can see that 878 comments mention doors, or that 843 specifically describe a long time to repair — then drill all the way down to the verbatim comment driving each number, the step Copilot could not complete.
What can you do with feedback analysis that Copilot can't?
Once feedback is in a structured, auditable layer, Wordnerds reports it in Microsoft Power BI — and that unlocks work generative AI cannot do. Volume and sentiment over time are a given. More usefully, correlation analysis produces a priority order: of everything tenants raise, which themes statistically drag a TSM or satisfaction score down most, so you know what to fix first rather than facing a flat list. Some customers go further into prediction — Guinness reached about 86 per cent accuracy forecasting whether a complaint would escalate to the Housing Ombudsman, within a couple of weeks of starting.
The other thing you control is what counts. Pete's B&Q example makes it concrete: when a retailer that sells toilets analyses customers who "couldn't find the toilet", the difference between not finding the facilities and not finding the product is subtle and decisive. Deciding that boundary is a human judgement Copilot cannot make for you.
Full Webinar Transcript
Sarah Wilson: So thank you so much for joining us everybody. This is the Gen AI for Voice of the Customer webinar. You've got three varieties of nerds today. So that's me. I'm Sarah Wilson. I'm an account manager at Wordnerds. I don't always just hang around holding on to chairs in case you're wondering — that was a professional photo shoot. The other guys did me dirty and did a more casual one, but yeah, that's me.
We've also got Pete. Most of you have probably seen Pete before — he gets around. He's our CEO. And as you put here, he's definitely not Helen. She's our head of account management, on maternity leave and coming back in May. I'm definitely counting down the days, I'm not going to lie. I'm very excited to have her back. And then we've also got Stella. Stella's new to the Wordnerds team. She's been with us for a couple of months now, and we're all absolutely amazed at how quick she is and how she's just picked it all up so quickly.
She works in the analyst team now. We thought, to welcome her to her first webinar, we'd give her a live demonstration, which would make anybody nervous. So yeah, you're welcome. Now, what do we do? We've got lots of familiar faces on the call, and familiar names at least. But for people joining us who perhaps haven't heard of Wordnerds before, we are a customer feedback analytics platform. We work across four main spaces.
Really big in the social housing space, also financial services, one of our original use cases with retail, and we've recently made forays into travel and hospitality. So Hotel Planet, one of our newest clients, I think they're on the call as well, so hey. You can see the kind of breadth of clients we've got — some really cool clients and some amazing use cases from very data-obsessed people. And that leads me nicely on to the question of today's webinar.
So, can I use Copilot or ChatGPT to analyse my customer feedback? As you can probably imagine, now that I've just said we're a customer feedback analytics platform, it's a slightly leading question. I'm never in a million years going to come out and say yes, you can use ChatGPT for your customer feedback — not least because Pete's threatened he'll sack me if I say things like that in public. All jokes aside, we're really hoping you can leave today's webinar a little less ignorant than you came.
I don't know about you, but I'm going to be honest: Gen AI has definitely intimidated me — working out what it can do, what it can't do, what I should be using it for and what I shouldn't. It's just a bit of a minefield, and I really struggled to understand its capabilities. I definitely don't want to boost Pete's ego any more than it's already been boosted, but I'm really hoping you leave today's session feeling a lot more confident about the solution, what it's good for and what it's not good for.
And whether or not you see the value in the Wordnerds platform, we just want to make sure you understand the different choices out there, what they're good for, and really what's best for your business. Now, the question is: are we faithful, or are we a traitor? We absolutely love this show. I'm sure we've got some fans out there too, so we really want to find out.
Do you agree? Do you see the value in it? Do you see some of the problems we're going to pose? You can let us know at the end if you think we're faithful or a traitor. For an agenda, we've got about 40 minutes planned and then plenty of time for questions at the end. We always say this works so much better as a dialogue and not a monologue, especially with Pete's voice — it can get a little bit repetitive. So definitely jump in, ask questions in the chat, make it interactive, engage with our polls.
We're going to start with a challenge. Pete's going to outline why this is so difficult, which platform you should use, how you set about doing this in the first place, and the dataset we're going to be using. We are using synthesised housing data for this webinar today. Ironically, it's been made by Claude — we gave Claude some prompts and it pulled back some synthesised data.
And Pete's going to walk you through that dataset using the Gen AI approach and unpick a little bit more how it works, what it's good for and what it's not so good for. Then we're going to bring in the Wordnerds approach — what we do, how we take the best bits of AI and turn them into analysis. And then we're going to have a live head-to-head. We'll bring Stella back and find out what Wordnerds found and how that compares to the Gen AI findings.
For that, we ask that you have a think about potential topics you might look at in a housing dataset. We'll come to a point and ask you to pop some ideas in the chat, and we'll pick one live. From there we'll go into a Q&A, so start thinking about your questions. We'll be giving out a care package as well. And if you get to the end of this and think you want to find out more, you can have a chat with us directly.
So, we've got a poll popping up now, hopefully. We'd love to start with a baseline of how you're feeling about Gen AI — have you used it before, how do you think this will work out? Give your answer there, and we'll take another poll at the end and see if your perceptions and expectations have changed. And guys, be honest — there are no wrong answers here. We're all very new to this approach. I'm going to hand over to Pete.
Pete Daykin: Thank you, Saz, for that only mildly abusive introduction. I would never give you a P45 — who would I get to do all of my work? But I'm very much looking forward to my one-to-one this afternoon after that. Hello everybody, thanks so much for turning up. There's loads of you here today. We are always staggered at how and why so many of you choose to hang out with us on a Thursday morning at these things.
We know there are lots of different calls on your time and we don't take it lightly, so thank you so much for turning up. As Saz says, we thought we'd make this fun by turning it into a live challenge. The challenge is: can I use Copilot or ChatGPT to analyse my customer feedback? We could easily have said Gemini or Claude or any of the other Gen AI platforms as well.
So, in terms of unpacking the challenge, the first thing we need to do is look at which one of these. We've selected Copilot as the one we're going to use today, for one simple reason. This whole webinar came from an actual question a potential customer asked us. We had a client signing up for a proof of concept; they went to their senior leadership team to ask for some budget, and the SLT, not unreasonably, said: well, we've already got Copilot, why don't we just get Copilot to do it?
And they came back and said, why don't we just get Copilot to do it? So we said, that's a great question. We thought we'd do something like this as a means to explore it. Most people we deal with are on the Microsoft stack; most have the easiest access to Copilot, and it tends to be where most people start. So right or wrong, that's what we've chosen. If you feel indignant with rage that we've picked the wrong one, feel free to let us know, and we can rerun this session using another Gen AI platform of choice.
On the dataset — Saz mentioned it's a synthetic dataset. Clearly, GDPR, we can't be showing people's real data on a webinar, so we've made some up. We've created a fictitious housing association we've called Acme Housing. I know, we're not imagination nerds, we are Wordnerds — it's not the most imaginative name in the world, but it'll do. Acme Housing has 12,289 rows of TSM survey data. If you're not in the housing industry, TSM stands for tenant satisfaction measures. There's a whole regulatory push about making people really understand the feedback of their residents, and it's basically a survey asking what people think of them.
Although it's synthetic, it's based on real datasets. We take issues that are in the real datasets we see, and we got Gen AI to create a certain number of bits of feedback on each of the key issues. It's not perfect — it has all of the problems associated with Gen AI. When we first did it, the stuff that came out was very formal: "dear housing association, I wish to protest about the way I've been treated." So we said, can you be a bit more relaxed and laid back, please? And it went, "yo man, my boiler repair was whack." So there's all kinds of weird things in between formal and too informal. If you see any weird verbatim, that's all that is.
What do we mean by analyse customer feedback? Generally it means two things. The two questions insights analysts in most large organisations get are: what are our customers telling us generally — can we put numbers on things, can we see what our big issues are, what's the sentiment? And then the wonderful question people usually ask at 4:35 on a Friday afternoon: what does our data tell us about X? Where you then have to go through all of your data and come up with a useful answer. And being a Wordnerds webinar, we thought we'd get you to decide what topic you'd like us to look into for this live head-to-head. So jump on the chat and put in the kind of question somebody might ask you — what does our data tell us about? Pick any subject you want.
There's loads of stuff coming through — repairs, tenant engagement, ASB, damp and mould. Skin complaints and damp are a very big part of TSM. Stella, go on, I'm going to be generous. What would you like?
Stella Dooris: I'm not known for my decision making. Repairs seems to be the most talked about, so we'll do repairs.
Pete Daykin: Yeah, okay. Everybody happy with that? If anybody's really unhappy, please put it in the chat and we will probably ignore you. Right, Stella, you go and have a look at this stuff. She'll be tapping away in the background, and we'll bring Stella back in in about 10 or 15 minutes.
When we initially conceived this webinar, we thought we'd do it completely live, put some stuff in Copilot and play it back as we went. We found very quickly though that it has a bit of a lag — it can take 30 seconds or so to come back with stuff, and on a webinar that's going to be really boring. So instead we picked an issue, went in and did some bits and pieces, and we're going to take you through what we did, then explain why that happens, and then show you the alternative: a specialist tool like Wordnerds.
Our opening gambit: we've got these 12,000 lines of data, and we put in a prompt. We said, hello Copilot, I hope you're having a good day. We are always very nice to Gen AI — when the robots take over the world, we do not want to be first against the wall come the revolution. So we're extremely polite. I've got a dataset, it's for UK social housing landlord Acme Housing, please can you analyse it for me. There are three columns: the published date, the content, which is the verbatim, and where it's come from. And the key things we asked it to pull out are the size of each issue, the sentiment of the key themes, and crucially how we can improve what we actually do with this data.
And this is what came back. The first thing was the key themes — it picked out maintenance and repairs, communication and customer service, security and safety, community and environment, and a couple of others. For each it went into some of the sub-issues. We thought, that's a pretty good summary of key themes, that's useful to know at a top level. Then we asked for size and sentiment, and it came back with maintenance and repairs as the most frequently mentioned issue — but the sizing was too general, the sentiment too vague, no numbers on it. That's fine, that's our fault at this stage because we haven't asked it properly.
We also asked for recommendations, and it said "improve maintenance response — ensure repair requests are addressed promptly and appointments are kept." Now, I'm sure no housing association has ever thought about making sure repair requests are addressed promptly. So clearly that's extremely general advice — self-evidently it's not going to help. We also didn't like the thing at the bottom that said "AI-generated content may be incorrect." We would love a little disclaimer at the bottom of Wordnerds reports saying Wordnerds reports may be incorrect, and we'll see how happy our customers are about that.
So we went back and said, this is awesome, you are a legend — we like to make Gen AI feel good about itself. However, please provide some figures to back it up. For the issue of security, what percentage of respondents mentioned each of the common issues — faulty locks, broken security doors, concerns about safety — and are they increasing or decreasing over time? And what came back was really good: faulty locks 15.3 per cent, broken security doors 18.7 per cent, concerns about safety 22.1 per cent. This is exactly the kind of stuff we need. And over time it said mentions had been relatively stable with a slight increase recently.
At this point we're thinking, should we just pack up, go home and get a different job? We asked for a visual representation, and it said, here's a visual representation — tried to load a graph, and couldn't. We found that hit and miss; sometimes it does, sometimes it doesn't. So we tried again, and it said, something went wrong, please try again later — which I'm now adopting as a mantra in the Wordnerds office.
So we had a nice night, came back the next day, and tried it all again, exactly the same prompt. Except this time, when it did the summary, we got back: faulty locks 8.5 per cent, broken security door 6.2 per cent, concerns about safety 10.3 per cent. And we were like, hang on a minute — yesterday it said 15, 18 and 22. There's a disparity. It basically appears to have made these numbers up.
So we said, you mentioned 15.3 per cent of residents mentioned faulty locks, that seems high — please list the rows in the spreadsheet where residents mention faulty locks. And it said, sorry, looks like there's an error in my analysis, I'll re-evaluate. So we waited, and we waited some more, and nothing happened. We said, hi, you've gone really quiet — if you're busy we can pop back later, hate to interrupt you if you're hanging out with the other Gen AIs at the pub. It apologised again, then went quiet again, and we ended up in a death spiral of it apologising and promising to do stuff and never doing it.
Which was weird, because the next day, with a different topic, it was able to give us the rows where it thought we were talking about something — in that case accessibility. It came back and said, here are all the row numbers where accessibility is mentioned. Being the tedious data nerds we are, we checked: row nine, yep, definitely accessibility. Row 33 was not — it was about energy efficiency in the flat, new windows and insulation, nothing resembling accessibility. We asked it to check and it agreed it did not mention accessibility, so it said, let me go and recheck the dataset and provide the correct row numbers, and it came back with something completely different.
We also went back and said, you said 20 per cent mentioned accessibility — there are 12,288 lines, so 20 per cent would be 2,547, is that correct? It said yes. And then it re-evaluated its initial 20 per cent down to 2.6 per cent, which is a massive delta. This is not slightly wrong, this is completely wrong. So what do we surmise about how well Copilot answered our question? Firstly, it's really quick. Secondly, it's really confident. It gives you plausible summaries, but it doesn't like doing the hard analysis, and it's prone to talking total bollocks. Our conclusion is basically that Copilot probably went to private school.
But why does it do that? There is a reason, and hilariously you can ask Copilot itself why it isn't good at this kind of thing, and it'll tell you. To save you the bother, we'll run through it. It does get a little technical, but not too technical. All of this stuff, by the way — like all writing on AI — we steal off more intelligent people than ourselves. There's a guy online called Benedict Evans, Ben Evans; I recommend his blogs. He has a contention that Gen AI is really good at tasks where there's no wrong answer, but really bad at deterministic tasks where there is either a right or a wrong answer.
So he says, look, there's Midjourney — the image generation models. The early ones at the top are pretty scary kittens, they verge into gremlin territory, but they are objectively kittens on a keyboard. The ones at the bottom, the latest models, are much cuter. It's improving, and it can't be wrong — it gets that there's a kitten on the keyboard, just like it's great at writing a job description or your marketing copy, because you can look at it and go, I like that or I don't, but it's still marketing copy.
The reason it doesn't work for analysing customer feedback is the way Gen AI is based on sentence embeddings in large language models. Ben Evans says Gen AI doesn't actually think — it can't do reasoning, it can't work out whether something is right or wrong, and it won't tell you how many people think a particular thing, because it doesn't know. It's not built for that. What it is built for is spotting patterns in language, and when it sees a selection of words, putting forward other words that are plausible next.
Imagine a large language model whose job is to map the connections between all the different words in the English language. It takes words — living being, feline, human, gender, royalty — and for each it analyses loads of language and looks for how often they're associated. It gives a score for how closely associated words are across loads of criteria. How close are cat and kitten in regard to gender? In regard to feline? Much closer on feline than gender, but still close-ish, because at least they have a gender, unlike, say, a pencil. Then it condenses all of that into a map.
In this case it's a 2D map, but really it works on thousands and thousands of vertices in a weird virtual space. On the 2D map, cat and kitten are really close together. Dog is pretty close to both, clearly not as close. But they're all closer together than, say, a house, which is at least still a noun but not a living being. It does this for every word, and for patterns of words, and patterns of patterns of words, so that what comes back is an approximation of what it thinks might come next.
So clearly the problem is it's not thinking. What can we do to make it better, because a lot of that is really useful? Copilot starts with this map of the language — sentence embeddings — then takes your customer feedback, references one against the other, and presents things you might find interesting straight from those embeddings. Two things are missing. One is a human. Your idea of what constitutes a damp and mould issue might be different to somebody else's. We all have unique geographies and dialects — people in the north of England talk very differently from people in Devon and Cornwall — and somewhere you always need a human to say this is interesting, this isn't. You can't fully automate this; so much of it is contextual.
We might have tried to solve an issue a thousand times before, so we're not interested in people talking about it — but there's a part we can fix, so we are interested in that. You need a human in the loop. And that human needs to be able to quickly take all that understanding from sentence embeddings and turn it into a structured layer: putting numbers on things, putting things in pots. This is a damp and mould issue, this is not. The sentiment of this is 25, of this is 70 out of 100. Once you've got that, Gen AI is then really good at taking that data and writing a report from it — but it can't do it directly from the source data.
Very quickly, how does Wordnerds go about that process? We start with the same sentence embeddings — they're really useful. We get people's data in, map it against those embeddings, and then get a human to give us an example of something they want to look for, in this case repairs. Based on the relationships between words the large language models understand, it pulls stuff back: we think this is similar, we think this is similar. You've said "people didn't turn up to do our repair"; somebody else said "I waited in all morning and it was a no-show." They haven't used any of the same words, but contextually they're the same, and Gen AI is great for spotting that.
We put all of those in front of a human and get them to say, yep, that's in the dataset, no, that isn't. The process takes about 10 to 15 minutes, and within that time you can train a classification model to an accuracy somewhere between 85 and 95 per cent. It's really important we're clear: it's never 100 per cent. Words are weird, humans are weirder. One good thing is it can tell you how accurate it is — we've got automated tests so you can see the degree of tolerance you're dealing with.
When we've got that, we stick them in a dashboard that's hierarchical. Within safety there's a bunch of themes we've trained — damp and mould, building security, health and safety. We can put an exact number on things: nearly 10 per cent of that dataset talks about building security. Within that you drill down further — doors. Copilot picked this up, it said some people talked about doors, but we can see exactly how many, 878, and exactly what their sentiment was. Click on doors and it goes into what people are saying, all the way down to the verbatim driving the thing you've selected. We've got about 130 classifications for housing associations available out of the box, and you can train your own. So you can drill down to the root cause of any issue.
That then spits out into a reporting platform. We now exclusively use Power BI — it's just really good at it. Volume and sentiment over time is a given. One of our asks was understanding what to do; volume and sentiment give you a big list, but they won't tell you where to start. What we can use in Power BI is things like correlation analysis. Depending on what you want to solve for — TSM score, customer satisfaction, NPS — we can look at what the drivers are statistically: of all the things people talk about, what correlates so that when mentions go up, the score goes down. That gives you a batting order: the biggest issue to solve is this, the second biggest is that.
Some of our customers — I think Chris from Guinness is on — are doing amazing stuff using this predictively. When a complaint comes in, can we see whether it's going to be escalated to the Ombudsman? Spoiler alert: yes, within a couple of weeks they got to about 86 per cent accuracy on that. And it's all in real time and available to everyone, because it's in Power BI — you're not having to bother the insights team. So that's why a specialist tool can do more than a Gen AI out of the box. I'm going to bring Stella back to take us through what she did with Copilot and what she found, and then the BI dashboard.
Stella Dooris: Yes, hello, I'm Stella. I've got some cool stuff to show, so I'll share my screen. To start, I went onto Copilot, put in the data, and said, what can you tell me about the repairs? It gave some key issues, which was pretty useful — delays, quality of work. And I said, can you tell me how many comments? And it said 3,700, which is where I got a bit nervous, because that's actually quite close to what we found on the platform. So, uh-oh.
Then I said, can you give me the row numbers? And it did my favourite thing I've seen Copilot do — it gave me row number zero, which I don't think is a thing. I'm not great on Excel, but I think it starts at one, so we can rule that one out. Then I checked the other numbers. If I go down to row 14, which it told me, it said "pleased with the efforts" — nothing about repairs. And row 23 said "disappointed with the lack of diversity." I also noticed that's not 3,000 rows. So I asked it to do it again, and it did the same numbers, zero again. I asked again because that's still not 3,000 rows, and it did the same thing again. So that's as far as I got — it was frustrating me, I had to leave it.
So then I went onto Power BI, which is a bit more comfortable to watch. I went onto repairs. Repairs is quite general, so I started general. Down the right is where our crossover themes are — the main themes that come up when people talk about repairs: contractor behaviour, quality of repair, handled in a reasonable time. That's useful to get a number on those crossover themes. I can also look at the cross table, where I've got repairs on the column and the whole category of maintenance and repairs. You can see what proportion of comments about repairs are talking about, say, contractor behaviour, 41 per cent, quality of repair, 40 per cent, only temporary, not much.
I'm also able to go into home fixtures and fittings and see what people are talking about when they mention repairs — is it doors, things like that. We can see 15 per cent of mentions of repairs talk about doors, kitchens 12.5 per cent, windows 5.4 per cent. That's quite useful to see where people are talking about the repairs. I can then look at it over time, on rises and fallers, which tracks month to month the mentions of specific themes. Repairs is quite general, so I went to "long time to repair" to get more specific. Obviously it's a synthesised dataset so it won't show much, but with a real dataset you'd often see a spike in a month — say for damp and mould, seasonal, in wetter months.
Finally I went back to the platform to see our verbatim. It goes down into repairs and then into long time to repair. On the left-hand side are the themes that cross over, like doors. On the right-hand side is where we get the verbatim. So I click long time to repair, and these are 843 comments that talk specifically about long time to repair. For example, "repair is still unresolved, two visits so far, but issues persist." So you're able to really drill down into your data and see the verbatim that Copilot unfortunately couldn't. That's me — stop sharing.
Sarah Wilson: I think that was brilliant, thank you. I know when I was 22 I was travelling around Asia, not manipulating Power BI dashboards. That was brilliant. There's some really good chat in the comments too, so thanks for that, Chris, around row zero. Let me reshare and we can continue. We're going to jump into a Q&A — this is the time you put questions in the chat. Ideally make some difficult questions for Pete; that's my favourite kind of Q&A.
Any questions about AI and how the Wordnerds solution copes with TSM data? Any more general questions? We're also going to pop up a second poll. I was just having a look at the poll answers, and it's really interesting because a lot of people have tried AI and the majority thought it was really great, so I'm super interested to see how this pans out in the second poll. One of the key things I'm always told from housing associations in particular is "I don't feel confident in the data." When you send to the board and want change, you've really got to be certain of the data, and that's a big worry.
If you've looked at this and think you'd like to do something a bit more granular, to be more confident, we can definitely help. I'll give you a quick breakdown of how that looks. We start with a chat with me — I'm a friendly person, not too salesy, probably could do with being a bit more salesy. We'll spend 20 minutes chatting about personal things, 10 minutes about your fit, your role, any challenges, making it a personalised consultation. Then we can jump straight into the platform — about 45 minutes, so you've spent less than a couple of hours at this point. You can see the solution, the Power BI dashboards, completely free and no obligation, as are all these steps.
And then, if you'd like to see a bit more, we've got a really nice route into giving it a try. A proof of concept costs £5,575. For that we analyse typically a couple of years of historical data; it takes four weeks to get the report back, and we're looking at slots now from April. We've had a very nice run recently of people signing up, which is great news for me and Pete — not so good news for Stella. Longer term, clients who've been with us a bit longer are on a subscription, generally starting from around 21k for the year. That's a combination of either a SaaS solution where you do all the work yourself, or a managed service where you also get reports, fully supported with the customer access team. Have we got some questions in the chat? Let me jump in.
We've got loads of questions, I'm so excited. We've got 15 minutes left, so hopefully we can get through a few. Pete, do you want to take this one or shall I?
Pete Daykin: Whatever you want, you take it. There's a great one from Chris, and I'm really interested in the chat about Python. So you do this one, I'll do the next one.
Sarah Wilson: Thanks, Joe — super good question. So: in the framework, can you analyse and compare data from multiple sources at the same time, and how can you compare the different parts of the surveys? We absolutely can. Most clients have a holistic view where they can see all the different sources, but also broken down by project and by survey type. And if you're tracking a tenant ID associated with different people, we track that across the different surveys. So you can see building safety in one survey, see it across another survey, see how the sentiment changes, and then see a holistic view across all your surveys of overall sentiment. So the answer is yes. Right, we'll go to Chris's.
Pete Daykin: Hi Chris — great result for Palace the other night, I was thinking about you. The question: if you change your prompt to specify that you're writing a PhD, then no hypotheses would be unsupported by references and it would not bluff, would it have given you better answers? There was some chat earlier about whether row zero is because it was doing it in Python. I don't know the answer to that — we'll check. I am not a data scientist; Chris very much is.
I think, for the purposes of this, we've done the kind of prompting most people would do. There's a lot of talk at the minute about agentic AI, and there are new research and reasoning models coming out all the time. We're finding some of the sub-models do better jobs of this. And there's no doubt that in a certain amount of time Gen AI will be much better at this. A lot of the classification work we do, it will eventually be able to do — and I'll say eventually is probably relatively soon, maybe reliable in 12, 18 months, two years.
One of the issues is that it's not always right. You can tell it to be correct, but that doesn't mean it will be — they still have the disclaimer saying it might be wrong. There's also the issue of who decides what's in a dataset. One of our customers is B&Q. We were trying to train some data — they wanted to know whether customers can find the toilet in their stores. That's an easy question for most people, but for them it was complicated by the fact that they sell toilets. So often when people said they couldn't find the toilet, the difference between not being able to find the toilets and not being able to find the toilets is very subtle but very important — it means a completely different thing. The advantage of controlling what's in the dataset and what isn't is the difference between something making sense and not.
Just one part of this is the classification, one part is putting numbers on it. The really interesting stuff is how you present it, how people consume it, and what you then do with it — the prioritisation, the correlation analysis, the predictive stuff. So yes, it's getting better, you could prompt it better, you could find ways of getting models to check each other. It still wouldn't be accurate — it would be much closer, but you wouldn't know when it wasn't, and that's part of the problem. If you go to the board and say this is a really big problem we need to solve, and then somebody checks the data and it's wrong, you look like the idiot — you've just spent ten million quid sorting it out and it's not based on good data.
Sarah Wilson: Thanks, Pete, really good answer. We've had some really good questions and we'll try to get through as many as we can. So, Stephen asked: given that Gen AI can give you rubbish that sounds good, how do other competitors in the industry who use this in their products consider this?
Pete Daykin: I guess you'll have to ask them. We tend not to come up against other software providers that much when we're talking to people. Most of the time the thing we're replacing is manual coding or agencies, and a lot of those agencies use manual coding. Manual coding is interesting — it's a pain to do, takes ages, it's expensive, and it's not that good. It's very good at top-level stuff, but when you start breaking it down, getting humans to do 130 different classification models and put sentiment on it, and then you come up with something new and have to go back through all your old data again — it's just not good.
Some of our competitors do similar things to us, creating that structured layer in between. What we do differently is we get the human to do that — we've got a graphic user interface that lets you train that stuff yourself. We also do a lot more with unsupervised learning that we haven't had time to go into today: we pull out the natural topics in the language without you having to train anything, which isn't something Gen AI or AI generally is very good at yet. My co-founder, Steve, is a recovering linguist — he deals with the structure of language — and we use a combination of AI and linguistics and some other stuff to do that a bit better. There are other tools available, please do speak to them, we love a fair fight. One of the reasons we do a proof of concept is to say to people, if you want to try us out, try us out. We're arrogant enough to believe we can go head to head with other solutions and come back with things that are different, better and more useful. We might be wrong — you can decide. Are we faithful or a traitor?
Sarah Wilson: Next question. I mentioned the learning and fine-tuning process in Wordnerds — how does this work in practice? And it links to a question further down: if we've got an insights team that can build something similar using Power BI, then why would we need the help of Wordnerds? That ties in nicely to the human aspect of the AI. Pete, do you want to touch on that?
Pete Daykin: I'll do the insights team one first. Obviously if you've got an insights team and tools available, you can totally build this yourself — just like with WordPress, you don't need an agency to build your website. Why do you use an agency? Because what takes you three months takes them three weeks; they do it all the time, better, deeper and quicker, because there are subtleties. The stuff your insights team can build is what's out of the box. When they get to things like sarcasm, or use of emojis, if the tools they're using don't deal with that well, it's very difficult to find workarounds. We've got teams of people on this five days a week; we've been working at this problem for six or seven years, and all the problems you'll come up against, we've already seen.
The other interesting thing is that BI analysts are the new bottleneck. A couple of years ago web devs were the people you could never get development out of; now everybody's moving everything into BI, and some people are waiting three, four, five months for a dashboard. So yes, you can totally do this yourself, have a go — and if you run into problems, come and speak to us. We do this as a self-serve thing, but also as a managed service, and we can flex in and out around your insights team, helping with stuff they can't necessarily do.
In terms of the fine-tuning process itself — the training into buckets is context-dependent, they're literally called context themes. Two people can say the same thing without using any of the same language, and the sentence embedding is great at finding that. Once a human has trained it and said yes, those two are the same, it makes that association and learns from it. There are also times when a keyword is really useful — place names, product names — so we give you the opportunity to do both.
Does it learn from itself over time? It doesn't, because that would take out the human in the loop. What it does do is built-in accuracy checking — we use something called F1 accuracy. There are two main ways we can be wrong: we can miss something that should have been in the dataset, or we can misclassify something as being in it that shouldn't. When you train a theme beyond a certain level, it'll come back and say this is 89 per cent accurate, this is 92 per cent accurate, so you always know how accurate your theme is over time. Sometimes that changes as new data comes in and language changes; it's a 10 to 15-minute job to retrain something. On sentiment, if we've got something wrong you can flag it — that doesn't retrain the rest of the set, because that can confuse it, but it goes straight to Hugh and Damani, our data scientists, and they look into it. We've found all kinds of weird and wonderful reasons why we've misclassified sentiment that are sometimes hilarious.
Sarah Wilson: That was really good, thanks Pete. I think, Aldwyn, where you made that spelling mistake of "damper mold" — that's a good example of the way it'll pick it up. It can pick up multiple different spellings of the same word, and people often spell things wrong, so it's a good way of bringing those together. We've probably got time for one more question, and I really like this one: is your use of software a way for companies to do data analyst roles, or is it purely meant to work alongside and complement a specialist skill set?
Pete Daykin: I love this question, it's one we get asked all the time. The majority of people we work with use us because they want to spend less time going through data manually and more time working on the solutions to these problems. There's a bunch of things we do with data science teams — Chris at Guinness with his data science team, we're doing the classification piece and helping with the pre-processing of the data, and they're then taking all of this and doing interesting stuff with it. With all of these things, we're not replacing anybody.
There are going to be two types of company in the future: companies that embrace AI and companies that cease to exist. Everybody is going to be bringing AI into everything they do; eventually it's going to be second nature, and we're all going to have to do more with less — get further with fewer people, get deeper with less resource. What this does is take away a lot of the heavy lifting and let the analysts and insights teams do the higher-value stuff: the persuading, the action planning, the co-creating with your customers, the monitoring over time — the stuff that's actually going to move the needle on making residents' and customers' lives better.
Sarah Wilson: Brilliant. Well, we're at time, guys. I really enjoyed that, so thank you so much for your attention, and the questions were great. If you want to have a chat with us, just click on the link or drop me an email — I'm just sarah@wordnerds.ai — and we can continue the conversation. There were a couple of questions we didn't get to, so I'll send a little message with some answers. You know where we are. We've got another one coming on another hot topic, Power BI, so keep an eye out in your inbox for that invite. Thank you so much, really enjoyed that, I'd love your feedback if you've got any. Thanks guys, thanks Stella, thanks Pete.
About Wordnerds
Wordnerds makes customer feedback a strategic asset for the whole organisation, not just the insight team. We ingest feedback from surveys, complaints, reviews, calls and social; apply transparent, explainable AI to surface themes and drivers; and deliver the insight directly into Microsoft Power BI, where operational teams already work. We're built for UK housing associations, transport operators and regulated sectors that need auditable evidence, not a black box.
%20(1).png?width=660&height=165&name=Wordnerds-Logo-Yellow-and-White-On-Transparent-(RGB)%20(1).png)