I wanna doooominate CivitAI. Since this probably requires large-scale training on a server that possess multiple GPUs, Diffusers would be an invincible choice. My local GPU got only 16GB VRAM which is intensive for finetuning the whole SDXL network, I decided to train an LoRA simply for gathering practical experience. I won’t rent a server until everything, including my diffusers techniques, is ready.

Code I mentioned can be found in the repo below. Feel free to take it away if it works for you.

Project_I_Tools repository

Gathering Dataset

This LoRA is aimed at reproducing Mika Pikazo’s colorful art style, the dataset covered her works after 2018 (the year the target style started to emerge), including about 300 images. It won’t take more than 10 mins to download them from pixiv manually, but I still tried to do this automatically using gallery-dl——that make sense since it’s incomprehensible to download the dataset (probably larger than 100k) for the final model one by one, and I don’t want to prepare everything on-site when training the final one.

GPT claims that the ideology of ‘verify feasibility before execute in a terminal’ is widely admired in community, I believed him, so you can notice that my code was mostly wrote in Jupyter notebook.

Dataset Preprocess

  1. Deduplication (Notebooks/Dedup.ipynb)
  • MD5 Byte Comparison
  • Perceptual hashing
  • CNN Based Similarity

Frankly this workflow isn’t so effective. Even a threshold slightly lower than 90% for CNN will let it conclude tons of absolutely different clusters, without detecting image pairs derived from cropping or resizing (I will try to figure out a mechanism to find these clusters and enlarge the threshold later on).

Empowered with

imagededup repository
  1. Cleaning up

For this stage I simply cleaning up the whole dataset in person since there’s only under 1k images. For the main model we might need to:

  • Eliminate images that contain too many text
  • Get rid of over low-resolution images

You might wonder why I give up automating for this section. The reason is: Cleaning up a 1k and 100k dataset is totally different. When the dataset is large ‘enough’, the cases of instances that needed to get rid of can be overwhelming and can’t be compressed in a single Jupyter Notebook. I only line up situations I met during this experience, as noted, just a few examples.