Training vs Generating Synthetic Data

Dr. Muckai Girish-

[Dr. Muckai Girish is co-founder & CEO of Rockfish Data – https://www.rockfish.ai]

Girish Muckai, founder and CEO, Rockfish Data

Our planet is abuzz with Generative AI these days and this is being referred to as the Fourth Industrial Revolution. It almost feels like not a single conversation goes on without referring to myriad generative AI tools such as ChatGPT or Gemini.

Though this has unleashed a game-changing wave across the industry, data remains a fundamental bottleneck in our ability to develop, train and deploy machine learning models and letting businesses and consumers realize the full potential of the expected and associated benefits.

This is when synthetic data comes to the rescue. Privacy-compliant and high-fidelity synthetic data has the potential to propel the generative AI industry forward while unblocking the data constraints issues. According to Allied Market Research this week, the global synthetic data market is expected to grow to $3.5B by 2031 at an impressive 35% CAGR.

Thanks to massive advances in generative AI technologies, we can now generate data that closely mimics the statistical properties of the real data. These models can also generate additional scenarios, which are exceptionally useful. By definition and by applying privacy techniques, synthetic data offers privacy guarantees across the entire spectrum of needs for both unregulated and regulated industries, alike.

State-of-the-art generative AI models for synthetic data involve a training phase, followed by a generation phase – separating the two functions, logically. The available source data is used to train a model and the associated model is used to generate synthetic data. GANs (Generative Adversarial Networks), Transformers and Diffusion models have been effective depending on various data types and enterprise needs.

Much like any deep learning models, training these effectively and in reasonable time often requires GPUs, albeit not the high-end GPUs that are now experiencing demands like nothing else. Mid-range GPUs are the most cost-effective for this phase. Once a model is trained on a specific set of source data, the resulting model and the associated model parameters represent the data and can be used to generate synthetic data. For instance, if we use a week’s e-commerce transactions to train the generative model, the associated model parameters would represent that week’s data. We can then use this to generate synthetic data for that week. In fact, we can generate as much quantity of synthetic data as needed. Generation consumes computing resources at an order of magnitude less than that of training.

If we do this for every week, then we would have a model for every week, assuming the model parameters are stored. It turns out that storing the data in this format has a number of benefits – reduced storage, minimized risk of data compliance, reduced risk of data leakage, increased data security protection etc. Moreover, as and when needed, the enterprise can generate synthetic data from any of the stored models, based on the need. Appropriate privacy postures can be adopted depending on the context and use case of the generated synthetic data.

One can also stitch together a synthetic dataset that represents a specific set of properties and events and one can also create scenarios and variations in the data, often needed for effective training of ML models.

By decoupling the training and generation phases of the synthetic data process, one can effectively use the available models and computing resources and get a significantly better outcome for the use cases.

 

Related posts