Good bot, bad bot: Using AI and ML to solve data quality problems

Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success.. Learn more

More than 40% of all website traffic in 2021 he wasn’t even human.

This may sound alarming, but it’s not necessarily a bad thing; Bots are essential to the functioning of the Internet. They make our lives easier in ways that aren’t always obvious, like getting push notifications about promotions and discounts.

But of course there are bad bots and they infest almost 28% of all website traffic. From spam, account takeovers, personal information scraping, and malware, it’s usually how people implement bots that separates the good from the bad.

With the release of accessible generative AI like ChatGPT, it will become more difficult to discern where bots end and humans begin. These systems are getting better with reasoning: GPT-4 passed the bar exam at the top 10% of the examinees and the bots have even defeated CAPTCHA tests.


transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they’ve integrated and optimized AI investments to achieve success and avoid common pitfalls.

Register now

In many ways, we could be at the forefront of a critical mass of bots on the internet, and that could be a serious problem for consumer data.

The existential threat

Companies spend around $90 billion in market research each year to decipher trends, customer behavior and demographics.

But even with this direct line to consumers, innovation failure rates are appalling. Catalina projects that the failure rate for consumer packaged goods (CPG) is at a scary 80%while the University of Toronto found that 75% of new grocery products fail.

What if the data these creators rely on was riddled with AI-generated responses and didn’t actually represent the thoughts and feelings of a consumer? We would be living in a world where companies lack the critical resources to inform, validate and inspire their best ideas, causing failure rates to skyrocket – a crisis they cannot afford now.

Bots have been around for a long time, and for the most part, market research has relied on manual processes and instinct to analyze, interpret, and weed out low-quality respondents.

But while humans are exceptional at making data right, we are incapable of deciphering bots from humans at scale. The reality for consumer data is that the emerging threat of Long Language Models (LLMs) will soon outpace our manual processes through which we can identify bad bots.

Bad bot, meet the good bot

Where bots can be a problem, they could also be the answer. By creating a layered approach using AI, including deep learning or machine learning (ML) models, researchers can build systems to separate low-quality data and rely on good bots to carry them out.

This technology is ideal for detecting subtle patterns that humans can easily miss or understand. And if managed correctly, these processes can feed ML algorithms to constantly evaluate and cleanse data to ensure quality is AI-proof.

That is how:

Create a quality measure

Instead of relying solely on manual intervention, teams can ensure quality by creating a scoring system through which they identify common bot tactics. Building a quality measure requires subjectivity to achieve it. Researchers can establish guardrails for responses across factors. For example:

  • Spam probability: Are responses made up of embedded or cut and pasted content?
  • Gibberish: A human response will contain brand names, proper names, or misspellings, but generally leads to a convincing answer.
  • Skip recall questions: While the AI ​​can sufficiently predict the next word in a sequence, it cannot replicate personal memories.

These fact checks can be subjective, that’s the point. Now more than ever, we must be skeptical of data and build systems to standardize quality. By applying a point system to these traits, researchers can compile a composite score and remove low-quality data before moving on to the next layer of verification.

Look at the quality behind the data

With the rise of human-like AI, bots can get under the radar just through quality scores. This is why it is imperative to overlay these signals with data around the output itself. Real people take the time to read, reread, and analyze before responding; Bad actors often don’t, so it’s important to look at the level of response to understand bad actor trends.

Factors such as response time, repetition, and insight can go beyond the superficial level to deeply analyze the nature of the responses. If responses are too fast, or if nearly identical responses are documented in a survey (or surveys), that can be a telltale sign of low-quality data. Finally, moving beyond the nonsensical answers to identify the factors that make an answer insightful, by critically looking at the length of the answer and the string or number of adjectives, you can weed out the lowest quality answers.

By looking beyond the obvious data, we can establish trends and build a consistent model from high-quality data.

Let the AI ​​do the cleaning for you

Ensuring high-quality data is not a “set it and forget it” process; it requires constant moderation and ingestion of good and bad data to reach the moving goal of data quality. Humans play an integral role in this steering wheel, where they configure the system and then sit on the data to spot patterns that influence the standard, then feed these features into the model, including rejected items.

Your existing data isn’t immune either. Existing data should not be set in stone, but held to the same rigorous standards as new data. By regularly cleaning regulatory databases and historical benchmarks, you can ensure that each new piece of data is checked against a high-quality benchmark, enabling more agile and confident decision-making at scale.

Once these scores are available, this methodology can be scaled across regions to identify high-risk markets where manual intervention might be needed.

Fight bad AI with a good AI

The market research industry is at a crossroads; Data quality is getting worse, and bots will soon make up an even bigger part of Internet traffic. It won’t be long, and investigators must act fast.

But the solution is to fight bad AI with good AI. This will allow a virtuoso shuttlecock to turn; the system gets smarter as the models ingest more data. The result is a continuous improvement in data quality. More importantly, it means that companies can trust their market research to make better strategic decisions.

Jack Millership is the data expertise leader at zappy.


Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including data technicians, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read more from DataDecisionMakers


Scroll to Top