Training AI with Fake Data: A Flawed Solution?
The word 'fake' is loaded with negative connotations. It brings to mind inferior quality, unscrupulous motives and the feeling of being short-changed. In the world of machine learning and artificial intelligence development though, the nature of what it means to be fake is more complex.
Machine learning, at its foundation, is built on comprehensive datasets. The process of creating algorithms that can begin making reliable predictions is reliant on having access to reams of data. Such data doesn't come quickly, or cheaply.
For tech giants like Google, Apple and Amazon, this doesn't pose too much of an issue. They have a virtually limitless supply of diverse data streams, creating the perfect ecosystem for researchers to train their algorithms. For smaller companies, particularly startups, the access to these resources is severely limited, or non-existent.
It is under these circumstances that the case for the use of synthetic or 'fake' data begins to emerge. Just as 'real world' data sources and methods of obtaining them are diverse, so it is for synthetic data. When researching the topic, you'll often find its definition condensed – that synthetic data is created algorithmically, mimicking traditionally harvested datasets in order to train machine learning models.
The demand for and the vast scope of the potential applications of synthetic data is immense, but so is the contrast of resources that different researchers and tech companies can devote to producing it. BDJ spoke to an expert in the field, Andreas Mueller, lecturer at the Columbia Data Science Institute and author of 'Introduction to Machine Learning with Python', for an insight into the depth of synthetic data production.
New Data Through Augmentation
Grouping synthetic data under the one moniker is misleading, as the variation of this field of machine learning reveals different subsets within the data's production and the resulting applications. The first, Andreas explains, is the creation of new data by the process of data augmentation. This is common in image and video analysis, and acts to create new data examples by using known, established examples.
“We can create a ‘new’ image of a car by mirroring an existing image. Standard learning systems for computer vision don't have a built-in way to model mirror symmetry, so this new example will provide new learning information, without the need to collect more data.”
“Say one of them is cars,” Andreas explains. “We can create a ‘new’ image of a car by mirroring an existing image (a left-to-right flip). Cars are (mostly) mirror-symmetric, so if you mirror an image of a car, it will still be a car.
“Standard learning systems for computer vision (convolutional neural networks) don't have a built-in way to model mirror symmetry, so this new example will provide new learning information, without the need to collect more data.
“Similarly, you can change the crop, and, in some cases, the rotation of an image to get a new example. These transformations change the image, but we still know the right label – if it was a car before, it's still a car.”
Subtle augmentation effectively creates 'free' and potentially infinite data streams, Andreas adds. The relative simplicity of the augmentation is another plus for this data synthesising method.
This doesn't mean that researchers can afford to be complacent with this 'mirror/flip' implementation though. There are potential hiccups stemming from issues like human cultural distinctions that could render data labels incorrect, depending on the region where the machine learning program is being used.
“The only thing you need to be careful about is that the transformation actually reflects something that may occur in the data. Depending on the country, there might be norms on who stands left and right in a wedding photo, so mirroring might create the bride and groom in the ‘wrong’ positions,” Andreas explains.
What’s Supposed to Happen… And What’s Not
It is possible to take the synthetic data model a step further. Certain developmental projects can forgo using 'real world' training data to kickstart their machine learning models and, instead, create fully artificial datasets for supervised learning.
Andreas points out that this is feasible in certain areas where previous research has put in place reliable models of what to expect in a given scenario.
“In physics, we can use computer simulations to create data with known outcome, build a machine learning model to estimate the unknown parameters from the sensor data, then apply this model trained on synthetic data on real sensor data, and determine the unknown parameters for this real data.”
“In physics, say particle physics, there are very good models of what's supposed to happen, and we can use computer simulations to create data with known outcomes,” says Andreas. “We would run a full simulation of an experiment with known parameters, and also simulate what sensor readings we would get from this experiment.
“Then we can build a machine learning model to estimate the (in the real world) unknown parameters from the sensor data. We can then apply this model trained on synthetic data on real sensor data, and determine the unknown parameters for this real data.”
Again, it is with the unpredictability of human behaviour that potential limitations in synthetic data can crop up. Andreas explains that: “In physics, we often know a model. In situations that involve humans, like social sciences or customer behaviour, we usually don't have a good model. There is not really a good model that can accurately create synthetic customers with known properties and known behaviours.”
In most cases it is therefore prudent to collect some degree of 'real' data in order to validate the machine learning model.
Creating Fake Data
The production of synthetic data can be taken another step further by actually creating a simulated environment in which a reinforcement learning algorithm can operate, and therefore generate data streams based on its actions.
The key issue is the complexity of the simulated environment that is needed to train the algorithm. A series of well-publicised matches between champion GO players and artificial intelligence programs have resulted in comprehensive victories for the latter. Andreas points out that the relative simplicity of a GO board allowed researchers to create a simulated version for their programs to train on.
“Simulating a GO board is really very easy,” says Andreas. “By having the agent play itself, the researchers were also able to simulate the opponent (this was probably not that easy to actually make work). Given this 'simulation', the agent can play many more games than would be possible in the real world (it needs millions) in a very short time, and can become very good.”
While the work around GO is impressive, when the necessary simulation becomes more complex than a game board, the implications of relying on the generation of synthetic data quickly becomes apparent.
Chelsea Finn is a PhD candidate at Berkeley, and will be joining the faculty at Stanford in 2019. Specialising in machine learning and its intersection with robotic perception and control, she echoed Andreas's viewpoints relating to research areas like self-driving cars having the potential compatibility with synthetic data use:
“If you want a system to learn how to drive,” Chelsea explains, “you don't want to have the system experience going through a major crash – learning what a crash is like, that it is bad and how to avoid it. You don't want it to drive off a cliff or learn how to drive off cliffs! This is a setting where I think that having at least a mental model of how the world works – this may not be synthetic – and what would happen if you drive towards the cliff can be helpful.”
“We want robots and we want machine learning systems to operate in very complex real world environments, and creating those synthetic environments is a huge engineering burden.”
Chelsea points out though that the logistics of creating these more complex simulated environments for machine learning processes is a challenge in itself. She notes that “if you have a very simple setting like you want a robot to pick up an object and to learn how to grasp a variety of different objects, it's relatively simple to set up that environment in simulation and to import a lot of different objects.
“But, ultimately we want robots and we want machine learning systems to operate in very complex real world environments, and creating those synthetic environments is a huge engineering burden.”
A Useful Tool
Aside from these specialised limitations, Chelsea cites examples where synthetic data shows itself to be tremendously useful. “I think that graphics engines are very good, and I think that, for example, getting synthetic images of objects and using that to augment training data is also good.”
The potential for synthetic data usage is clear across numerous applications, but it is not a universal fix-all. As current computing framework stands, we are not at the stage where the complexities of real-world environments can be simulated with the type of ease and accuracy to make synthetic data feasible by itself.
“I think that people might take for granted the fact that for these simple settings it was pretty easy to generate synthetic data and it's pretty easy to generate synthetic environments,” Chelsea says. “However, as you increase the complexity, it's going to be significantly more challenging to create those environments and at some point it would become unfeasible.”
As machine learning and artificial intelligence development continues to progress, the complementary relationship of ‘real’ and synthetic data is one that is bound to continue. When faced with high computational costs and questions of fidelity for synthetic data usage, as Chelsea says, “Without a doubt, the real world experience is something that you can continuously rely on.”
When it comes to success in using synthetic data, a perfect example is the Berlin startup Spil.ly. They were creating an augmented reality app that transformed users' bodies by means of creative filters, similar to Snapchat and Instagram's selfie options. In order to make it work, the developers needed to use machine learning algorithms to track human bodies in videos.
With an ambitious product and a lack of resources available to collect the necessary tens or hundreds of thousands of hand-labelled images, the team began generating their own images. They used techniques similar to those used in the production of video games and movie graphics.
The digital humans that resulted from this process weren't necessarily lifelike, but they were sufficient for the algorithm to learn, and the team ended up with around 10 million images.
Illustrations by Kseniya Forbender
To contact the editor responsible for this story:
Margarita Khartanovich at firstname.lastname@example.org