James Arthur, CTO, Hazy

We often hear about how artificial intelligence has come a long way since the advent of large datasets made possible by the rise of social media and our increasing reliance on digital solutions for everyday life. This is undoubtedly true; but in reality, there are many other factors at play, including improvements in the quality of data, as well as advances in the way that algorithms are trained.

Between the data and the Deep Blue

Many people point to the moment that IBM’s Deep Blue defeated chess world champion Garry Kasparov as the turning point for AI. No doubt, it was a remarkable demonstration of the superior capabilities of computing power over the human brain, sparking worldwide fears of a Kubrickian future where the human race is subjugated and enslaved by a ruling class of evil, robotic overlords. But oh how far we have come.

Deep Blue was trained using a dataset consisting of 700,000 Grandmaster chess games, enabling it to algorithmically determine which potential move had the highest mathematical chances of winning the game. Compare this with AlphaZero; the AI computer program developed by Google’s DeepMind, which learnt to play better than anybody else - human or machine - with no access to external data but by simply using the rules of the game.

AlphaZero was trained solely by playing against itself, thus generating its own synthetic data. After nine hours of training, it had mastered the game with a playing style unlike that of any other player or traditional chess engine.

“It’s like discovering the secret notebooks of some great player from the past.”

Within 24 hours, AlphaZero had gone on to do the same for the games shogi and go, demonstrating the ease at which it was possible for the program to achieve superhuman skills simply through self-play.

Quality over quantity

For AI, high quality “small datasets” - i.e. well-curated, up-to-date and reliable data - are much more useful than enormous sets of poor quality data. But good quality data can be extremely difficult, time-intensive and expensive to procure in the real world - and in our current climate of cyberattacks and data breaches, real data (or “historical data”) is increasingly at risk of being compromised.

This is where synthetic data (which we’ve written about before) has real world application. It has all the same properties as historical data but without the same degree of sensitivity or procurement difficulties. Another benefit is that the AI has complete control over the input and output of its processes, never having to leave its “digital space” while the data is generated, analysed and tested.

For this reason, synthetic data is particularly useful in data-limited situations where specific data is hard to capture or there are significant privacy risks. In fact, computer scientists have already begun using this approach to deal with medical and military data.


For us at Hazy, the most exciting application of synthetic data is when it is combined with anonymised historical data (e.g. identifiable features are removed or masked) to create brand new hybrid data. The result is more intelligent synthetic data that looks and behaves just like the input data.

An easy way to visualise this is by visiting the website This Person Does Not Exist, which uses a dataset of real celebrity faces to produce a new image of a randomly-generated human with each refresh of the page. In this example, the faces are the historical data and the synthetic human faces are the hybrid data.

One particular hybridisation technique utilises Generative Adversarial Networks (GANs), which are particularly adept at identifying structure in datasets that would otherwise elude human brains. The technology made headlines very recently when Nvidia announced that it had used GANs to develop a tool that can turn Microsoft Paint-style sketches into photorealistic images of landscapes.

Whether you should utilise fully synthetic, partially synthetic or hybridised data depends entirely on each specific use case. And because applications are so vast - from stress-testing security systems to helping to identify rare diseases - it’s crucial that data scientists are given tools that select the best possible generation algorithms for the job.


In the age of GDPR, data privacy has never been so important - and therefore many organisations are losing utility from their datasets due to more restricted methods of data collection and pseudonymisation techniques. Therefore, synthetic data generation offers a solution with minimal privacy risks and maximum utility.

No doubt, it’s a great tool for testing models and, ultimately, for accelerating AI innovation. Although as it currently stands, it can be difficult to accurately synthesise the outliers that prove insightful in real world datasets (e.g. instances of rare disease). In fact, the US Census has  a strict disclaimer about false insights drawn from synthetic data on their website. It seems that for some problems, you just can’t beat real data… yet.

So although it’s not the one-size-fits-all solution that some people are touting it to be, synthetic data generation and GANs are rapidly becoming some of the most popular new tools in the data science toolkit.