The Truth About AI Search Optimization Tools
If you're a B2B marketer using an AI search optimization tool, you're probably making decisions based on the data it gives you.
- Which prompts to track
- Where you stand against competitors
- Whether your strategy is working.
But what if that data isn't reliable?
This is the biggest problem with most AI search tools today. And the place where it shows up most is in your visibility score, the most important metric in AI visibility. It's supposed to be your north star, the one number that tells you how you're doing in AI search and where to go next.
But most tools calculate it in a way that makes it completely unreliable. And if you can't rely on your north star metric, you have no idea where you're actually headed.
So let's talk about why this happens.
The base of every AI visibility tool is prompt tracking
Every tool out there, at its core, does the same thing. It tracks prompts.
What that means is: you give it a list of prompts, it runs those prompts, and it tells you how often you show up in the answers. That number is your visibility score.
While this sounds simple, here's what makes it tricky. Visibility score is not like website traffic. If 500 people visited your website today, that's 500. It doesn't change based on how you measure it.
Visibility score is different. It completely changes based on what prompts you're tracking.
Track different prompts, get a different score. Track more of the right prompts, score goes up. Track more of the wrong ones, score goes down.
This means the tool you use, and how it tracks prompts, matters a lot. More than most people realize.
So let's go through all the ways prompt tracking can go wrong.
So what does a reliable visibility score actually look like?
Think of it like share of voice from your SEO days.
Out of all the searches that happen in your category, how many of them include you in the answer? That percentage is your visibility score.
And the only way a brand shows up in a ChatGPT answer is when ChatGPT is recommending that brand. So visibility score is almost directly proportional to how often you're getting recommended.
As more people use LLMs to make purchase decisions, the total number of these searches is also growing. So you're not just competing for a fixed pool. The pool is getting bigger.
Your visibility score determines what percentage of that growing pool you're capturing.
If your visibility score is going up, your MQLs, SQLs, and pipeline should go up too. Not exactly in a straight line, but the direction should match. Almost proportionally.
If your visibility score is all over the place, going up one week and down the next without anything changing in your strategy, that's not a strategy problem. That's a data problem.
7 reasons why prompt tracking fails
Reason 1: LLMs don't always give the same answer
Ask ChatGPT the same question twice and you might get two different answers. One time it recommends company A, B, and C. The next time it's B, C, and D. Sometimes it changes completely. This is just how LLMs work, variability is built in. So if your tool doesn't know how to handle this, your visibility score is going to jump around all the time, not because anything changed in your strategy, but just because the model gave a different answer that day. We have written about exactly how this works and what it means for your data in our blog on whether AI search tools are even useful.
Reason 2: You're tracking prompts where no vendor can ever be recommended
Not every prompt will ever recommend a vendor. Some are too broad, some are informational, and some are just not the kind of question where ChatGPT will say "you should check out this company". Your team might not always know this upfront, it's not always obvious.
But if your tool doesn't flag it, those prompts get tracked anyway and drag your visibility score down for no reason. Your tool should tell you before you start tracking, "hey, this prompt is never going to recommend a vendor, don't bother".
Reason 3: Too many variations of the same prompt
Prompts are not like keywords. Keywords are finite, prompts are infinite. If you're not careful, you end up tracking twenty versions of one type of prompt and only two versions of another. If you happen to show up a lot in the first type, your visibility score looks higher, not because you're doing better, just because of which prompts you chose to track.
The reverse is true too. Track more of the prompts where you don't show up and your score drops, without you doing anything differently. This is why plain prompt tracking doesn't really work in B2B. The score becomes a reflection of what you tracked, not how you're actually performing.
Reason 4: Database tools vs. real-time data collection
Some GEO tools don't actually run prompts for you. They give you access to a database of prompts they've already run. That sounds convenient, but if the tool adds new prompts to its database and you happen to show up in them, your score goes up, and if you don't, it goes down, and you didn't do anything either way.
On top of that, if your industry isn't well covered in their database, you're missing a big chunk of the prompts that actually matter for you and you'd never even know. One easy way to tell the difference: if you get results the moment you search, it's pulling from a saved database. If it takes time to run, it's actually going out and collecting real-time data.
Reason 5: API tracking vs. what actually shows up in ChatGPT
A lot of tools use the API to track answers. They run a call, get a response, and that becomes your data. The problem is that the answer you get from the API is different from what actually shows up when a real user types that same question into ChatGPT.
So if your tool is only using the API, your visibility score is based on something your buyers are never actually seeing, and you want a tool that tracks what real users see, not just what the API returns.
Reason 6: How regions are handled
If your business operates in multiple countries, you want to know how you show up in each of those regions. But here's how some tools handle regional tracking, including some of the biggest tools out there: they just add the region inside the prompt, like "In Poland, what is the best project management tool?" That's not how it works in real life because when someone in Poland searches on ChatGPT, they're not typing "in Poland" every time.
The tool should actually run the search from that region the way a real user in that country would, and the difference in answers can be significant. So if your tool is just adding the country name to the prompt, the regional data you're looking at isn't accurate.
Reason 7: Starting with too few prompts
A lot of tools tell you to start small, track 20 prompts, or 30, or 40, and that advice made some sense in SEO, but it doesn't work in GEO. If you're still building out your approach, our GEO strategy guide walks through how to set this up properly. Everything in your AI search strategy is about prioritization: which listicles should you reach out to, which topics should you create content on, where should you spend your budget? To answer any of those questions well, you need to see the full picture, and you can only see that if you're tracking everything.
The reason most tools cap you is because plain prompt tracking makes it unmanageable at scale, but if you break prompts down into seed keywords and expand from there, you can map out the entire industry without it becoming impossible to manage, and that's the only way to make good prioritization decisions.
Not sure if your current tool is giving you reliable data? Book a call with us and we'll show you whether your visibility score can actually be trusted