By Dr. Muckai Girish-

(MuckAI Girish is the co-founder & CEO of the Generative AI company, Rockfish Data, which is developing a state-of-the-art Synthetic Data Workbench Software for Data Scientists. The views in this column are his own)
As we all wade through the incredible era of life dominated by Artificial Intelligence (AI), we can’t help but notice that not everything is perfect. In particular, it is not difficult to notice that bias is inherent in the datasets that we rely on so much for making AI transform everything, but often result in upending our lives. AI models in general and Generative AI (GenAI) models in particular are trained with available datasets. If this data turns out to be biased, then every aspect of the model and its prediction or classification or recommendation could be biased and could also have a massive bullwhip effect. Data remains a fundamental bottleneck in our ability to let businesses and consumers realize the full potential of the expected and associated benefits.
We can get a ride in just a few minutes with a click on the phone; we can get anything delivered soon after we think about it. However, data is something we find lacking, almost always. This is especially true for enterprises. Though businesses collect a dizzying amount of data, it is increasingly becoming harder for the team that needs certain data for applications such as training AI/ML models, testing software, sharing data with collaborators and creating digital twins. This is primarily driven by data sparsity and sharing constraints. Unfortunately, data scientists are trying to live with the data they can get their hands on, which is often heavily biased.
Bias in data can be due to many factors including response/activity, selection, system drift, societal and omitted variable (please see https://towardsdatascience.com/survey-d4f168791e57).
Bias can be intended or unintended. The key is to find ways to recognize and tackle unintended bias. Intended bias, if driven by malicious intent, may require regulations and compliance frameworks and is beyond the scope of this article. Bias can also be in the model or in the data. Bias in models can be mitigated by tweaking the model itself. A more interesting and difficult problem is bias in the data and that is what we look at in this article.
The first step in dealing with bias is to recognize it. One should look at the input data to see if it covers various scenarios in a reasonable level – in other words, data visualization. As any above-average statistician would say, “always plot your data first”. One should follow this up with statistical analysis. For example, we can study the output of the AI model and comparing it against expected or normal outcome. Another way is to input a curated subset of data into the model and seeing the outcome and comparing it to what should have been expected. Realizing there is bias and which parts of the data exhibit bias is key to doing something about it.
The second step is to measure bias. The best way to measure bias is to see its effect on the model. The model outcome wrt a dataset population would be a key indicator of the impact of the data bias.
Finally, we have to find a way to mitigate bias. Bias in datasets can be mitigated by using the right data, rather than all the data. Synthetic data is a powerful way to make this happen. Synthetic data tools allow a data scientist to produce an unbiased dataset. Synthetic data tools can create data based on conditions or scenarios and that can be tailored to exactly or almost exactly mitigate the bias in the data, whether the bias is in one or more fields or features. Moreover, it allows the creation of more or less data points of a certain characteristic that would alleviate the inherent bias in the original dataset.
In addition to visualization and statistical analysis of the synthetic and augmented datasets, one can see the positive impact on the model by measuring and comparing the outcomes.
According to Allied Market Research, the global synthetic data market is expected to grow to $3.5B by 2031 at an impressive 35% CAGR. Thanks to massive advances in generative AI technologies, we can now generate synthetic data tailored to any needs. By being able to mitigate bias, we can ensure that GenAI technologies have a positive and lasting impact on our society.