The time is now for the synthetic data revolution – ITProPortal

We’re going into the final sprint of the year. Time will fly by as we get stuck into the festive season, and before we know it, we’ll have a brand new year stretching out ahead of us. 

Not that I like to wish time away, but I can’t wait for 2022. That’s because it looks set to be a huge year for synthetic data. Forbes picked it in a list of the 5 biggest data science trends for 2022, while Gartner put synthetic data at number one in its top strategic predictions for 2022 and beyond. 

If we’re talking about the passing of time, going from founding Mindtech four years ago to the brink of a synthetic data revolution feels like it’s happened in the blink of an eye. 

It seems fitting though, given that time, or time-saving is one of the driving forces behind synthetic data development. Did you know, it takes a long and laborious 20 weeks to gather and annotate the 100,000 real-world images required to train a visual AI system to see and understand the world as a human? 

That’s roughly 80 percent of machine learning project time, just for something novel, like training a system to pick out a lost child in a busy shopping mall. It takes even longer to help a delivery robot service safely navigate spaces where children are playing, leaving no room for network development or gleaning insights from the data.

It’s time for a sea change in the way we view data and how we train AI. Synthetic data derived from computer-generated images and video is easily as good—sometimes better—than that which comes from real-world images and it can rapidly shrink the process of gathering and analyzing from months to hours. 

All this, without impacting on the AIs being trained. That’s because to an AI, there is no ‘real’ or ‘synthetic’; there’s only the data we give it. 

The technology is ready; it’s time for us humans to stop seeing synthetic data as secondary, and start to understand the opportunity in our hands to scale the AI industry exponentially.

Challenging the ‘big four’

It doesn’t matter if they’re start-ups, scale-ups, or global enterprises, teams trying to secure the required high-quality images to train their new AI system will all be up against the ‘Big Four’: Apple, Amazon, Facebook, and Google. Engineers at the latter have access to more than 4 trillion images alone stored in Google Photos. 

These major players tend to restrict access to this wealth of potential training data because it hands them a competitive advantage to develop new products and monetize their datasets. Even they’re not totally immune to the issues the industry faces though; searching through trillions of images to find the relevant ones is non-trivial, and once found, they still need annotating. 

Every company has to navigate the challenge of more readily enforced data privacy regulations too—including the EU’s General Data Protection Regulation (GDPR). Just ask Facebook/Meta, who recently announced it will delete its facial recognition system and database, saying it is doing so because regulators can’t keep up with announcements. The move comes shortly after the release of the damning ‘Facebook papers’ and a number of lawsuits regarding the technology too.

What companies are left with is a scarcity of real-world visual data, with only the very largest tech companies able to compete, driving down the competition and, ultimately, the quality of AI systems on the market.

Can synthetic data level the playing field? 

If we want the best AI, then we need a competitive landscape filled with businesses of all sizes spurring each other on. Realizing that vision requires three things: democratized access to training data, training data that meets privacy regulations, and data that can be annotated faster. 

Synthetic data meets these three demands. It gives machine learning engineers the ability to create photo-realistic 3D worlds and extract unlimited data to fuel and train their visual AI models. 

They can use synthetic data software creation platforms for AI training to generate the 100,000 high-quality images needed in a couple of days, instead of months. 

And because the data is computer-generated, there are no privacy concerns, while biases that exist in real-world visual data can be eliminated too. In the virtual world, different ethnicities, age groups, and diversity in terms of color of clothing or sex are much easier to create. And as data changes over time, it’s easier to reflect this in a virtual environment to avoid data drift impacting an AI model’s performance.

Enhanced accuracy and flexible training scenarios 

Findings show that synthetic data enhances a machine learning model’s accuracy too. Last year, McKinsey revealed that 49 percent of the highest-performing AI companies are already using synthetic data to train their AI models.

Along with this, hundreds of thousands of corner cases or scenarios (camera location modeling, different lighting, and other variables) which would be hard to create in the real world, can be quickly and easily created in a 3D virtual environment. Extreme, ‘nightmare’ scenarios—gun crime, for example—can be also simulated risk-free to create the kind of data that’s difficult to come by from real-world sources.

We’re already seeing multiple uses: in healthcare to train machines to monitor patients recovering from surgery; in security and safety systems to detect suspicious objects or unusual patterns of behavior inside shopping centers or sports arenas; or in training delivery drones that need to understand the world around them.

Real-world data can’t be totally counted out yet; research suggests good training results come from data sets with 90 percent of synthetic data and 10 percent of real-world data. But we’re getting close; the accuracy of an AI model trained using 80 percent synthetic data is close to that of one fueled by data taken from the real world, according to Deloitte.

Gartner recently predicted that 60 percent of data used for AI and data analytics projects will be synthetic by 2024, and by 2030, synthetic data will have completely overtaken real data in AI models. Eight years might seem some way off but take it from me, that time will fly by.

Steve Harris, CEO, Mindtech

Spread the love

Leave a Reply

Your email address will not be published.