Abstract

The cost of creating large, accurately labeled datasets can challenge the pretraining of large-scale multi-modal models, sometimes leading to the use of large-scale data with noisy, machine-generated pseudo-labels. Some pretraining techniques may not effectively use the weak supervisory signal from these imperfect labels for certain downstream tasks. A system is described for iteratively pretraining a model using strong input masking. In this approach, a teacher model can generate pseudo-labels for a large dataset. A student model can then be trained to predict these labels using heavily masked inputs, for example, images with occluded patches and text with missing words. This process can be repeated, with the student model becoming the teacher for a subsequent iteration. The technique may improve a model’s robustness to label noise and can be used to produce a shared model backbone for multiple tasks while potentially reducing reliance on large-scale, human-verified datasets.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS