Advertisement
Canada markets closed
  • S&P/TSX

    24,471.17
    +168.91 (+0.70%)
     
  • S&P 500

    5,815.03
    +34.98 (+0.61%)
     
  • DOW

    42,863.86
    +409.74 (+0.97%)
     
  • CAD/USD

    0.7266
    -0.0011 (-0.16%)
     
  • CRUDE OIL

    75.49
    -0.36 (-0.47%)
     
  • Bitcoin CAD

    86,164.89
    +2,718.72 (+3.26%)
     
  • XRP CAD

    0.74
    +0.01 (+0.72%)
     
  • GOLD FUTURES

    2,674.20
    +34.90 (+1.32%)
     
  • RUSSELL 2000

    2,234.41
    +45.99 (+2.10%)
     
  • 10-Yr Bond

    4.0730
    -0.0230 (-0.56%)
     
  • NASDAQ

    18,342.94
    +60.89 (+0.33%)
     
  • VOLATILITY

    20.46
    -0.47 (-2.25%)
     
  • FTSE

    8,253.65
    +15.92 (+0.19%)
     
  • NIKKEI 225

    39,605.80
    +224.91 (+0.57%)
     
  • CAD/EUR

    0.6642
    -0.0011 (-0.17%)
     

Synthetic data to train machine learning models may be key in building stakeholder trust in AI

Andriy Onufriyenko/Getty Images

Companies can’t avoid working with data, but management of that data can pose serious challenges.

Customer and other personal data keep escaping, courtesy of breaches that surged 78% last year in the U.S., hitting a record 3,205. Total victims? An eye-popping 353 million.

And don’t forget the trust issues created by using real-world data to train AI. That hasn’t worked out so well for accident-prone autonomous cars, or for reliably racist chatbots.

Part of the solution? Synthetic data.

To be clear, synthetic data isn’t fake. In fact, it can be better than the real thing. Let me explain, with help from executives at a pair of synthetic data providers.

Synthetic data falls into two buckets, says Yashar Behzadi, founder and CEO of San Francisco–based Synthesis AI.

Structured data is what you find in database tables from industries like banking and health care. Let’s say a hospital doesn’t want to expose any patient data. “What you can essentially do is create a copy of that data that has all the statistical properties, but none of the actual information or data,” Behzadi says. “That allows folks to then work on it or share it and take it outside of specific safety bounds.”

Then there’s unstructured data—images and video used by applications based on computer vision. That’s where Synthesis plays, using CGI and generative AI to create data that helps train the systems behind technologies such as identity verification, extended reality (XR), and driver monitoring.

For example, if a facial recognition model is trained without a balanced dataset, it might have biases against dark-skinned or older people. To avoid that, Synthesis builds digital humans and uses them to generate high-quality data. “We can easily represent every ethnicity, every age, every different demographic to ensure our systems are completely bias-free,” says Behzadi, whose customers include Fortune 500 companies. “If it’s synthetic, it’s completely privacy-compliant as well.”

Alexandra Ebert is chief trust officer of Vienna-headquartered MOSTLY AI, which provides AI-generated, structured synthetic data for banks, insurers, telecoms, and health care companies. “They have plenty of existing data, but of course, it’s privacy-sensitive,” says Ebert, who runs an online course on synthetic data. “What they want to use synthetic data for is to basically anonymize it so that they’re out of scope from privacy laws.”

One of MOSTLY’s clients, bank Erste Group, likes synthetic data because it’s considered superior to traditional anonymization methods, which offer ways to piece the original data back together.

Synthetic data is taking off. By this year, 60% of the data used to train Al models will be synthetic, Gartner has predicted. That’s a huge jump from just 1% in 2021.

With help from generative AI, it’s now possible to create sophisticated simulations using unstructured synthetic data, Behzadi notes. Because that data is easier and cheaper to generate than real data, some applications will explode, he reckons. Rather than spend billions deploying fleets, autonomous vehicle makers can build simulations that include so-called edge cases, like a child running in front of a car. Another use: creating digital doubles of robots.

Ebert highlights data augmentation—using a synthetic data generator to create information that wasn’t in the original data set. For instance, a bank could take that approach to better understand fraud cases.

She also sees a chance for companies to democratize data by launching internal synthetic data hubs. The goal: “to go from synthetic data as a resource that belongs to the high priests of data science within an organization to data that is used by everyone.”

That would be real progress.

Nick Rockel
nick.rockel@consultant.fortune.com

This story was originally featured on Fortune.com