I generated 400k unique avatars in 50 days with my personal GPU and it only cost me 65 dollars in electricity. The results are at the end of the post.
In our anonymous chat app, we're interested in having our users interact while remaining anonymous if they wish. But at the same time, one of the most obvious problems of anonymous applications arises: Users without profile pictures. I was tired of seeing generic silhouettes for users without profile pictures. I needed to modernize the app so that all users always have a profile picture and thus make the product feel more "full of life".
We wanted to have some sort of customization available for our Avatars but we clearly couldn't afford to spend thousands of dollars in some advanced 3d-model software for making custom avatars, so we decided to go for the super cheap (open source) and steerable Stable Diffusion solution, by making a preset of 400k unique avatars with varying combinations of hair, accessories, gender and ages.
This was one of the rare cases where I had no doubts about choosing the solution to implement. I had just received news that a company called Segmind had open-sourced a new model called "StableDiffusion XL" and a faster variant: "SSD-1B". I quickly downloaded the tensors and wrote a couple of very funny prompts. There was no refusal. The model simply generated what I asked for in less than 15 seconds. It was the ideal solution for making an army of Avatars.
StableDiffusion has some weaknesses and some advantages. Those low-parameter models are somewhat difficult to make them adhere to your prompt consistently, but they have several advantages: They are usually uncensored, they are extremely fast to run, and the most important advantage: Negative prompts. We can tell the model what to generate and what not to generate.
I needed to create a variety of cartoon-style Avatars, and for that I had to generate a variety of prompts. First, I had to test which adjectives and nouns the model could reproduce reliably and consistently. And also create characters with varied faces. In addition to using a great variety of negative prompts. Here are some negative prompt examples:
[tiling:out of frame:cropped:scary], photorealistic, [mutation:bad anatomy:bad proportions:distorted:disfigured:deformed], [glitch:duplicate:duplicated], [text:logo:watermark:words:digits:autograph:signature]
To generate sufficiently varied faces, it was necessary to research which famous characters the model had "seen" in its dataset, and then combine various celebrity first names with last names from others (without crossing genders or age ranges), but intentionally crossing races and skin colors to obtain unique faces that contain very few traits from the original character. To do this, it's necessary to use weight assignment to each word in the prompt to place additional emphasis on some particular adjectives, in the same way that's done with the negative prompt.
Each combination was repeated 4 times with different seeds to create a total of 400k unique Avatars.
To ensure that the avatars are truly coherent, a control script was designed with a face detector (FaceDet fp16), which discards Avatars that contain duplicate faces, contain no face, or where the face dimensions are too large, too small, or not sufficiently centered.
Designing all 400k avatars with SSD-1B fp16 at 768x768 with 16 steps took around 50 continuous days on an RTX3090 and consumed approximately 420kWh. The resulting JPEG images weighed 25GB in total and were compressed to Webp and uploaded to Cloudflare R2 for use in production.
Finally I made an Avatar creator built-into my Android and iOS social apps so users can customize their Avatars choosing from the pre-generated ones. The main limitation is that if you change anything, the avatar face will be different. To force one face across different accessories or hairstyles you may need to use ComfyUI.
To see this feature working in production you can visit my portfolio and download any of my Social apps.