About Global AI Media Alliance Global AI Media Alliance (GAIMA) is established through mutual trust among its founding members. In the future, as the world starts to rely on AI, there will be massive and quick changes to cross-platform content creation, media production and broadcasting, and business models.
GAIMA aims to link businesses and industries involved with AI media, including: AI content application technologies Media production Brand marketing
GAIMA will focus on AI production broadcasting, media content, and brand marketing, which includes: Building a quick, frequent AI media industry interaction platform Promoting reliable AI media content and services Organizing AI-related events
AI media covers several topics, including text-to-image and text-to-video technologies.
Text-To-Image Text-to-image is a tool which generates an image based on text input. Before the rise in machine learning, text-to-image tools started as a tool which arranges existing, related images from text input to create a collage.[1][2] Machine learning-based text-to-image models have been developed since the mid-2010s, but it gained prominence in 2022 for its outputs that are similar to real-life human art. Examples of text-to-image models are OpenAI’s DALL-E 2, Google Brain’s Imagen and StabilityAI’s Stable Diffusion.
There are many text-to-image model architectures that have been developed over the years, but they are generally composed of: Text encoding Text encoding is usually done through transformer models. Alternatively, it can also be done using a recurrent neural network such as a long short-term memory (LSTM) network. Image generation The image generation step generally uses conditional generative adversarial networks, but diffusion models have also become a popular method recently. Rather than directly training a model to output a high-resolution image based on text input, a popular technique is to train a model to generate low-resolution images, and use auxiliary deep learning models to upscale the image and fill in finer details in the image.
Text-to-image models are trained on large datasets of text-image pairs that are often scraped from the web. Datasets commonly used are: COCO (Common Objects in Context) The COCO dataset was released by Microsoft in 2014, and consists of around 123,000 images representing a diversity of objects. Each image has five captions that are generated by human annotators. Oxford-120 Flowers and CUB-200 Birds These are smaller datasets of around 10,000 images each, with topics limited to flowers or birds.[3]
In 2022, Google Brain used their Imagen model and reported positive results from using a large language model trained separately on a text-only corpus, with its weights subsequently frozen. This method marks a departure from the standard approach in text-to-image model training.[4]
Text-To-Video Text-to-video models evolve from text-to-image models, in which the text-to-image models receive modifications to be able to generate videos from text prompts. Examples are CRAFT, Microsoft’s GODIVA[5], Google’s Imagen Video, and Meta’s Make-A-Video[6].
Text-to-video models are trained on datasets of video clips, with each video clip labeled with a particular topic. Datasets that have ever been used include: HowTo100M[7] HowTo100M is a dataset of narrated videos focusing on instructional videos where content creators teach complex tasks such as handcrafting, cooking, and personal care. The dataset contains 136 million video clips taken from 1.2 million YouTube videos which cover a total duration of 15 years, and 23,000 activities from various domains. All videos have a corresponding narration taken from the videos’ subtitles. HowTo100M is used for training the GODIVA model. Kinetics Human Action Video Dataset (Kinetics)[8] The Kinetics dataset focuses on human action recognition and contains around 500,000 10-second video clips taken from YouTube videos. Each clip is labeled with an action class, and each action class has at least 600 video clips. The dataset covers 600 action classes.
Aside from the above mentioned models, many text-to-video models also create their own training sets using publicly available videos.
References Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019). “A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis”. arXiv:1910.09399 [cs.CV]. Retrieved from https://arxiv.org/pdf/1910.09399.pdf Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). “A text-to-picture synthesis system for augmenting communication”. AAAI. 7: 1590–1595. Retrieved from https://www.aaai.org/Papers/AAAI/2007/AAAI07-252.pdf Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). “Adversarial text-to-image synthesis: A review”. Neural Networks. 144: 187–209. Retrieved from https://www.sciencedirect.com/science/article/pii/S0893608021002823 Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”. Retrieved from https://arxiv.org/abs/2205.11487 Narain, Rohit. “Smart Video Generation from Text Using Deep Neural Networks”. DataToBiz. https://www.datatobiz.com/blog/smart-video-generation-from-text/ “Introducing Make-A-Video: An AI system that generates videos from text”. Meta AI https://ai.facebook.com/blog/generative-ai-text-to-video/ “What is HowTo100M?” Retrieved from https://www.di.ens.fr/willow/research/howto100m/ Kay et al. (2017) “The kinetics human action video dataset”. ArXiv, vol. abs/1705.06950. Retrieved from https://arxiv.org/abs/1705.06950
回到上一頁