João Victor - Synthetic Data Generation

Synthetic Data Generation


João Victor · Jun 18, 2021

One of the topics that has been heating up the Machine Leaning(ML)/Artificial Intelligence(AI) market lately is the Synthetic Data. But what exactly is that ? Synthetic Data is computer generated data, in order to get rid of the big problems that common data brings. Over the years the data is costing to much money and it's coming with the copyright problem, that is killing the companies. And worst of all, we don't have enough data to train ML/AI models.


All the data needs to be labeled and that's require manual process, that costs money, time and, furthermore, it is error prone. Getting back to the copyright problem a little, we have so much data on internet, but doesn't mean you can get them to train your model. Speaking for real? You are taking a serious risk doing this.


That's where the Synthetic Data comes in, you might be thinking: "But how can I do this?". It's simple, let's do it by steps, one of the famous strategies is the data augmentation: it consists in get real data and use it as basis to generate new data. But the central idea of synthetic data is: create it from scratch, it sounds difficult, but, now a days we have so much tools that can do it for us. Blender and Unity belong to these tools - the first one uses python as scripting language. With this two tools you may create very realistic images, in fact, it will simulate so many different cases that the data distribution will be large and really look like the real world. Below, you can see the comparison between the real world and the synthetic one. I will confess to you that my eyes are more pleased to see the synthetic world.

Stephan R. Richter, Hassan Abu AlHaija, and Vladlen Koltun.
"Enhancing Photorealism Enhancement" arXiv:2105.04619 (2021)

You can also see a video of BMW Factory Digital Twin to understand a bit more how it's work.

    Some synthetic data advantages:

    1. Automatic annotation.

    2. Control over the amount of examples and class balancing.

    3. Don't need to be related to real people (privacy).

    4. Copyright.

    5. Data democratization.

In conclusion, we already can say that's the common way to use data is in the past. The faster companies adapt to creating synthetic data, they will be saving money and time. This is the new "new".

Share


References



  • Hallison Paz. (2021). "Geração de Dados Sintéticos para Aprendizado de Máquina. Seminário especial do Visgraf" https://visgraf.github.io/syntheticlearning/