Data Management Opportunities For Foundation Models

CIDR 2022 Series #2

May 17, 2022

The following paper (video) from CIDR 2022 came from the Stanford AI Group, a short paper addressing the list of foundation model data management opportunities. It doesn’t offer any solutions but it’s worth going through the list of challenges.

What is a foundation model?

AI models have been starting to shift from a model-centric world where data scientists have to carefully engineer and label features in each domain to train a model, to a data-centric model that’s bringing in a mass amount of unlabeled and variety of data to train large models that can be applied to multiple tasks (e.g: GPT-3).

New Challenges with foundation models

This brings in several new challenges, the first is solving the data integration challenge of bringing large volumes (petabytes) of multi-modal (text, images, etc) unstructured and structured data that’s repeatable to continue to train these models. Tracking the data provenance is a challenge to

The second is data monitoring, which given the large amounts of unlabeled data being used to train these models, it becomes hard to understand how to correlate unlabeled data into model performance drifts, and help engineers know how to pinpoint areas of data that is relevant for the performance problems.

The last is model patching, as comparing to have lots of individual models, engineers now are patching a larger model to fix undesirable behaviors while not creating more problems for other tasks. There is work needed for suggestions on how to fix the data rather the model architecture given that it’s fixed in these models.

Conclusion

Overall this is a pretty short paper which doesn’t really need a summary, but I think it’s a good reminder that the ML industry is still early where a lot of work is still evolving, and the engineering and research that is ongoing is continuing to make more improvements of how the industry evolves.

There is more work that’s really interesting of how foundational models will not just impact the data management, but also applications and related toolings which we will hopefully cover in other work.

Flash into the future

Discussion about this post