Synthetic Data – The Good, The Bad, and The Ugly

In today’s rapidly evolving market research landscape, synthetic data has emerged as a buzzworthy solution to some age-old problems. From hard-to-reach audiences and long fieldwork times to bots and bad actors muddying the waters, researchers are always looking for ways around these challenges. But what exactly is synthetic data, and should we be embracing it with open arms or approaching with caution?
What is Synthetic Data?
Simply put, synthetic data is artificially generated information that mimics real-world data. In market research, it’s AI-generated survey responses designed to mirror actual human responses. Think of it as a digital stand-in for human participants – created to fill gaps and boost response rates where needed.
The most compelling use case for synthetic data is representing hard-to-reach populations. We’ve all been there – your research requires input from a specific demographic that’s proving nearly impossible to recruit. Maybe it’s high-income executives, residents of sparsely populated towns or counties, or members of a very niche interest group. Synthetic data offers a potential solution to ensure these voices aren’t missing from your analysis.
The Good
There are some genuine benefits to incorporating synthetic data in your research strategy:
- It helps represent otherwise underrepresented audiences, potentially making your research more inclusive.
- It can dramatically reduce fieldwork time – no more extended waiting periods while you hunt for those elusive respondents.
- When implemented well, it can effectively mirror high-level topline data from real human responses.
The Bad
However, synthetic data comes with significant limitations:
- No matter how sophisticated, AI-generated responses lack genuine human insight – they’re approximations, not authentic voices.
- Synthetic data relies on assumptions and historical trends. It’s subject to the biases of the programmer and can’t adapt to changes in the market such as economic conditions or new trends.
- There’s a clear danger of misuse. It can be tempting to over-rely on synthetic data to reduce costs or accelerate timelines, but doing so can lead to inaccurate conclusions and bad recommendations.
The Ugly
Recent research by Strat7, detailed in their report “Putting Synthetic Data to the Test,” highlights some serious concerns about synthetic data quality. Their findings, which I encourage you to read in full, reveal some troubling issues:
- The synthetic responses evaluated by the Strat7 researchers lacked logical consistency. On the individual question level, the synthetic responses closely mirrored the human data collected, but closer examination revealed a lack of logic and consistency when comparing data across multiple questions. Think of it this way – a synthetic survey response might say they are a vegetarian in one response, but later in the survey say that cheeseburgers are their favorite food – which obviously doesn’t make sense!
- The synthetic data showed a “bunching effect” – where responses clustered in the middle of scale questions rather than showing the natural distribution of extremes that human data typically presents.
- Perhaps most concerning for research professionals, synthetic data produced entirely different key drivers for purchasing behavior compared to human responses – potentially leading to misguided strategic recommendations.
- The researchers found synthetic data was unsuitable for segmentation analysis, one of the most valuable applications of market research.
My Take
I’m not particularly surprised by these findings. While AI continues to make remarkable strides, it’s no replacement for authentic human perspectives. Marketing research is all about uncovering what makes people tick and why – such as how they’ll respond to specific advertisements and messages, why they purchase one brand over another, pain points in their existing customer journeys, etc. – this is something that synthetic data simply cannot replicate.
Like most AI applications, synthetic data has its place – but that place should be carefully defined and limited. The Strat7 researchers recommend capping synthetic data at no more than 5% of your overall sample and using it exclusively to supplement underrepresented demographic groups. That seems like sound advice to me.
Will synthetic data improve? Absolutely. The technology will undoubtedly become more sophisticated, generating increasingly realistic responses. But I remain convinced that it should only complement human research in specific circumstances and never be used as a replacement.
My recommendation – proceed with caution. Synthetic data might be a useful tool in specific situations, but it’s no substitute for the rich, nuanced insights that come from real human participants.
Posted by
-
Stewart LawRESEARCH MANAGER