I have been repeatedly following the pc imaginative and prescient (CV) and picture synthesis analysis scene at Arxiv and elsewhere for round 5 years, so tendencies develop into evident over time, and so they shift in new instructions yearly.
Subsequently as 2024 attracts to a detailed, I assumed it applicable to check out some new or evolving traits in Arxiv submissions within the Pc Imaginative and prescient and Sample Recognition part. These observations, although knowledgeable by a whole bunch of hours finding out the scene, are strictly anecdata.
The Ongoing Rise of East Asia
By the tip of 2023, I had seen that almost all of the literature within the ‘voice synthesis’ category was coming out of China and other regions in east Asia. At the end of 2024, I have to observe (anecdotally) that this now applies also to the image and video synthesis research scene.
This does not mean that China and adjacent countries are necessarily always outputting the best work (indeed, there is some evidence to the contrary); nor does it take account of the high likelihood in China (as in the west) that some of the most interesting and powerful new developing systems are proprietary, and excluded from the research literature.
But it does suggest that east Asia is beating the west by volume, in this regard. What that’s worth depends on the extent to which you believe in the viability of Edison-style persistence, which usually proves ineffective in the face of intractable obstacles.
There are many such roadblocks in generative AI, and it is not easy to know which can be solved by addressing existing architectures, and which will need to be reconsidered from zero.
Though researchers from east Asia seem to be producing a greater number of computer vision papers, I have noticed an increase in the frequency of ‘Frankenstein’-style projects – initiatives that constitute a melding of prior works, while adding limited architectural novelty (or possibly just a different type of data).
This year a far higher number of east Asian (primarily Chinese or Chinese-involved collaborations) entries seemed to be quota-driven rather than merit-driven, significantly increasing the signal-to-noise ratio in an already over-subscribed field.
At the same time, a greater number of east Asian papers have also engaged my attention and admiration in 2024. So if this is all a numbers game, it’s not failing – but neither is it cheap.
Increasing Volume of Submissions
The volume of papers, across all originating countries, has evidently increased in 2024.
The most popular publication day shifts throughout the year; at the moment it is Tuesday, when the number of submissions to the Computer Vision and Pattern Recognition section is often around 300-350 in a single day, in the ‘peak’ periods (May-August and October-December, i.e., conference season and ‘annual quota deadline’ season, respectively).
Beyond my own experience, Arxiv itself reports a record number of submissions in October of 2024, with 6000 total new submissions, and the Computer Vision section the second-most submitted section after Machine Learning.
However, since the Machine Learning section at Arxiv is often used as an ‘additional’ or aggregated super-category, this argues for Computer Vision and Pattern Recognition actually being the most-submitted Arxiv category.
Arxiv’s own statistics certainly depict computer science as the clear leader in submissions:
Stanford University’s 2024 AI Index, though not able to report on most recent statistics yet, also emphasizes the notable rise in submissions of academic papers around machine learning in recent years:
Diffusion>Mesh Frameworks Proliferate
One other clear trend that emerged for me was a large upswing in papers that deal with leveraging Latent Diffusion Models (LDMs) as generators of mesh-based, ‘traditional’ CGI models.
Projects of this type include Tencent’s InstantMesh3D, 3Dtopia, Diffusion2, V3D, MVEdit, and GIMDiffusion, among a plenitude of similar offerings.
This emergent research strand could be taken as a tacit concession to the ongoing intractability of generative systems such as diffusion models, which only two years were being touted as a potential substitute for all the systems that diffusion>mesh models are now seeking to populate; relegating diffusion to the role of a tool in technologies and workflows that date back thirty or more years.
Stability.ai, originators of the open source Stable Diffusion model, have just released Stable Zero123, which can, among other things, use a Neural Radiance Fields (NeRF) interpretation of an AI-generated image as a bridge to create an explicit, mesh-based CGI model that can be used in CGI arenas such as Unity, in video-games, augmented reality, and in other platforms that require explicit 3D coordinates, as opposed to the implicit (hidden) coordinates of continuous functions.
Click to play. Images generated in Stable Diffusion can be converted to rational CGI meshes. Here we see the result of an image>CGI workflow using Stable Zero 123. Source: https://www.youtube.com/watch?v=RxsssDD48Xc
3D Semantics
The generative AI space makes a distinction between 2D and 3D systems implementations of vision and generative systems. For instance, facial landmarking frameworks, though representing 3D objects (faces) in all cases, do not all necessarily calculate addressable 3D coordinates.
The popular FANAlign system, widely used in 2017-era deepfake architectures (among others), can accommodate both these approaches:
So, just as ‘deepfake’ has become an ambiguous and hijacked term, ‘3D’ has likewise become a confusing term in computer vision research.
For consumers, it has typically signified stereo-enabled media (such as movies where the viewer has to wear special glasses); for visual effects practitioners and modelers, it provides the distinction between 2D artwork (such as conceptual sketches) and mesh-based models that can be manipulated in a ‘3D program’ like Maya or Cinema4D.
But in computer vision, it simply means that a Cartesian coordinate system exists somewhere in the latent space of the model – not that it can necessarily be addressed or directly manipulated by a user; at least, not without third-party interpretative CGI-based systems such as 3DMM or FLAME.
Therefore the notion of diffusion>3D is inexact; not only can any type of image (including a real photo) be used as input to produce a generative CGI model, but the less ambiguous term ‘mesh’ is more appropriate.
However, to compound the ambiguity, diffusion is needed to interpret the source photo into a mesh, in the majority of emerging projects. So a better description might be image-to-mesh, while image>diffusion>mesh is an even more accurate description.
But that’s a hard sell at a board meeting, or in a publicity release designed to engage investors.
Evidence of Architectural Stalemates
Even compared to 2023, the last 12 months’ crop of papers exhibits a growing desperation around removing the hard practical limits on diffusion-based generation.
The key stumbling block remains the generation of narratively and temporally consistent video, and maintaining a consistent appearance of characters and objects –  not only across different video clips, but even across the short runtime of a single generated video clip.
The last epochal innovation in diffusion-based synthesis was the advent of LoRA in 2022. While newer systems such as Flux have improved on some of the outlier problems, such as Stable Diffusion’s former inability to reproduce text content inside a generated image, and overall image quality has improved, the majority of papers I studied in 2024 were essentially just moving the food around on the plate.
These stalemates have occurred before, with Generative Adversarial Networks (GANs) and with Neural Radiance Fields (NeRF), both of which failed to live up to their apparent initial potential – and both of which are increasingly being leveraged in more conventional systems (such as the use of NeRF in Stable Zero 123, see above). This also appears to be happening with diffusion models.
Gaussian Splatting Research Pivots
It seemed at the end of 2023 that the rasterization method 3D Gaussian Splatting (3DGS), which debuted as a medical imaging technique in the early 1990s, was set to suddenly overtake autoencoder-based systems of human image synthesis challenges (such as facial simulation and recreation, as well as identity transfer).
The 2023 ASH paper promised full-body 3DGS humans, while Gaussian Avatars offered massively improved detail (compared to autoencoder and other competing methods), together with impressive cross-reenactment.
This year, however, has been relatively short on any such breakthrough moments for 3DGS human synthesis; most of the papers that tackled the problem were either derivative of the above works, or failed to exceed their capabilities.
Instead, the emphasis on 3DGS has been in improving its fundamental architectural feasibility, leading to a rash of papers that offer improved 3DGS exterior environments. Particular attention has been paid to Simultaneous Localization and Mapping (SLAM) 3DGS approaches, in projects such as Gaussian Splatting SLAM, Splat-SLAM, Gaussian-SLAM, DROID-Splat, among many others.
Those projects that did attempt to continue or extend splat-based human synthesis included MIGS, GEM, EVA, OccFusion, FAGhead, HumanSplat, GGHead, HGM, and Topo4D. Though there are others besides, none of these outings matched the initial impact of the papers that emerged in late 2023.
The ‘Weinstein Era’ of Take a look at Samples Is in (Sluggish) Decline
Analysis from south east Asia on the whole (and China specifically) usually options check examples which can be problematic to republish in a evaluate article, as a result of they characteristic materials that may be a little ‘spicy’.
Whether or not it is because analysis scientists in that a part of the world are searching for to garner consideration for his or her output is up for debate; however for the final 18 months, an growing variety of papers round generative AI (picture and/or video) have defaulted to utilizing younger and scantily-clad ladies and ladies in venture examples. Borderline NSFW examples of this embody UniAnimate, ControlNext, and even very ‘dry’ papers reminiscent of Evaluating Movement Consistency by Fréchet Video Movement Distance (FVMD).
This follows the overall tendencies of subreddits and different communities which have gathered round Latent Diffusion Fashions (LDMs), the place Rule 34 stays very a lot in proof.
Celeb Face-Off
The sort of inappropriate instance overlaps with the rising recognition that AI processes shouldn’t arbitrarily exploit movie star likenesses – significantly in research that uncritically use examples that includes enticing celebrities, usually feminine, and place them in questionable contexts.
One instance is AnyDressing, which, apart from that includes very younger anime-style feminine characters, additionally liberally makes use of the identities of traditional celebrities reminiscent of Marilyn Monroe, and present ones reminiscent of Ann Hathaway (who has denounced this sort of utilization fairly vocally).
In western papers, this specific apply has been notably in decline all through 2024, led by the bigger releases from FAANG and different high-level analysis our bodies reminiscent of OpenAI. Critically conscious of the potential for future litigation, these main company gamers appear more and more unwilling to symbolize even fictional photorealistic folks.
Although the methods they’re creating (reminiscent of Imagen and Veo2) are clearly able to such output, examples from western generative AI initiatives now development in the direction of ‘cute’, Disneyfied and intensely ‘secure’ pictures and movies.
Face-Washing
Within the western CV literature, this disingenuous method is especially in proof for customization methods – strategies that are able to creating constant likenesses of a specific individual throughout a number of examples (i.e., like LoRA and the older DreamBooth).
Examples embody orthogonal visible embedding, LoRA-Composer, Google’s InstructBooth, and a mess extra.
Nevertheless, the rise of the ‘cute instance’ is seen in different CV and synthesis analysis strands, in initiatives reminiscent of Comp4D, V3D, DesignEdit, UniEdit, FaceChain (which concedes to extra life like consumer expectations on its GitHub web page), and DPG-T2I, amongst many others.
The convenience with which such methods (reminiscent of LoRAs) might be created by dwelling customers with comparatively modest {hardware} has led to an explosion of freely-downloadable movie star fashions on the civit.ai area and group. Such illicit utilization stays potential by means of the open sourcing of architectures reminiscent of Secure Diffusion and Flux.
Although it’s usually potential to punch by means of the protection options of generative text-to-image (T2I) and text-to-video (T2V) methods to provide materials banned by a platform’s phrases of use, the hole between the restricted capabilities of one of the best methods (reminiscent of RunwayML and Sora), and the limitless capabilities of the merely performant methods (reminiscent of Secure Video Diffusion, CogVideo and native deployments of Hunyuan), just isn’t actually closing, as many consider.
Quite, these proprietary and open-source methods, respectively, threaten to develop into equally ineffective: costly and hyperscale T2V methods might develop into excessively hamstrung because of fears of litigation, whereas the dearth of licensing infrastructure and dataset oversight in open supply methods might lock them completely out of the market as extra stringent rules take maintain.
Â
First printed Tuesday, December 24, 2024