Extracted Text


2312.02133.pdf

Style Aligned Image Generation via Shared Attention
Amir Hertz
*1
, Andrey Voynov
*1
, Shlomi Fruchter

, and Daniel Cohen-Or

1
Google Research
2
Tel Aviv University
Abstract
Large-scale Text-to-Image (T2I) models have rapidly
gained prominence across creative fields, generating visu-
ally compelling outputs from textual prompts. However,
controlling these models to ensure consistent style remains
challenging, with existing methods necessitating fine-tuning
and manual intervention to disentangle content and style. In
this paper, we introduce StyleAligned, a novel technique de-
signed to establish style alignment among a series of gener-
ated images. By employing minimal ‘attention sharing’ dur-
ing the diffusion process, our method maintains style con-
sistency across images within T2I models. This approach
allows for the creation of style-consistent images using a
reference style through a straightforward inversion opera-
tion. Our method’s evaluation across diverse styles and text
prompts demonstrates high-quality synthesis and fidelity,
underscoring its efficacy in achieving consistent style across
various inputs.
1. Introduction
Large-scale Text-to-Image (T2I) generative models [43,
45,] have emerged as an essential tool across creative
disciplines, such as art, graphic design, animation, archi-
tecture, gaming and more. These models show tremendous
capabilities of translating an input text into an appealing vi-
sual result that is aligned with the input description.
An envisioned application of T2I models revolves
around the rendition of various concepts in a way that shares
a consistent style and character, as though all were created
by the same artist and method (see Fig.). While profi-
cient in aligning with the textual description of the style,
state-of-the-art T2I models often create images that diverge
significantly in their interpretations of the same stylistic de-
scriptor, as depicted in Fig..
Recent methods mitigate this by fine-tuning the T2I
model over a set of images that share the same style [16,].
This optimization is computationally expensive and usually
*Equal contribution.

Equal Advising.“Toy train...” “ Toy airplane...”
“...drawing, vector art.”
“...colorful, macro photo.”
“...BW logo, high contrast.”
“...poster, ilustration.”
“ Toy bicycle...” “ Toy car...”
Figure 1.Style aligned image set generation.By fusing the fea-
tures of the toy train image (left) during the diffusion process, we
can generate an image set of different content that shares the style.
requires human input in order to find a plausible subset of
images and texts that enables the disentanglement of con-
tent and style.
We introduce StyleAligned, a method that enablescon-
sistent style interpretationacross a set of generated images
(Fig.). Our method requires no optimization and can
be applied to any attention-based text-to-image diffusion
model. We show that adding minimalattention sharingop-
erations along the diffusion process, from each generated
image to the first one in a batch, leads to a style-consistent
set. Moreover, using diffusion inversion, our method can
be applied to generate style-consistent images given a ref-
erence style image, with no optimization or fine-tuning.

Style AlignedStandard Text-to-Image Figure 2.Standard text-to-image vs. StyleAligned set gener-
ation.Given style description of “minimal origami”, standard
text-to-image generation (top) results with an unaligned image set
while our method (bottom) can generate variety of style aligned
content.
We present our results over diverse styles and text
prompts, demonstrating high-quality synthesis and fidelity
to the prompts and reference style. We show diverse ex-
amples of generated images that share their style with a
reference image that can possibly be a given input image.
Importantly, our technique stands as a zero-shot solution,
distinct from other personalization techniques, as it oper-
ates without any form of optimization or fine-tuning. For
our code and more examples, please visit the project page
style-aligned-gen.github.io
2. Related Work
Text-to-image generation.Text conditioned image gen-
erative models [10,,] show unprecedented capabilities
of generating high quality images from text descriptions.
In particular, T2I diffusion models [41,,] are pushing
the state of the art and they are quickly adopted for differ-
ent generative visual tasks like inpainting [5,], image-
to-image translation [61,], local image editing [12,],
subject-driven image generation [48,] and more.
Attention Control in diffusion models.Hertz et al. [20]
have shown how cross and self-attention maps within the
diffusion process determine the layout and content of the
generated images. Moreover, they showed how the atten-
tion maps can be used for controlled image generation.
Other studies have leveraged modifications in attention lay-
ers to enhance the fidelity or diversity of generated im-
ages [11,], or apply attention control for image edit-
ing [8,,,,,]. However, in contrast to prior
approaches that primarily enable structure-preserving im-
age editing, our method excels at generating images with
diverse structures and content while maintaining a consis-
tent style interpretation.
Style Transfer.Transferring a style from a reference im-
age to a target content image is well studied subject in com-
puter graphics. Classic works [13,,,] rely on opti-
mization of handcrafted features and texture resampling al-
gorithms from an input texture image, combined with struc-
ture constrains of a content image. With the progress of
deep learning research, another line of works utilizes deep
neural priors for style transfer optimization using deep fea-
tures of pre-trained networks [18,], or injecting attention
features from a style image to a target one [4]. More re-
lated to our approach, Huang et al. [26] introduced a real
time style transfer network based on Adaptive Instance Nor-
malization layers (AdaIN) that are used to normalize deep
features of a target image using deep features statistics of
a reference style image. Follow-up works, employ the
AdaIN layer for additional unsupervised learning tasks, like
style-based image generation [29] and Image2Image trans-
lation [27,].
T2I PersonalizationTo generalize T2I over new vi-
sual concepts, several works developed different optimiza-
tion techniques over a small collection of input images that
share the same concept [16,,,]. In instances where
the collection shares a consistent style, the acquired con-
cept becomes the style itself, affecting subsequent gen-
erations. Most close to our work is StyleDrop [55], a
style personalization method that relies on fine-tuning of
light weight adapter layers [24] at the end of each atten-
tion block in a non-autoregressive generative text-to-image
transformer [10]. StyleDrop can generate a set of images
in the same style of by training the adapter layers over a
collection of images that share the same style. However,
it struggles to generate a consistent image set of different
content when training on a single image.
Our method can generate a consistent image set without
optimization phase and without relying on several images
for training. To skip the training phase, recent works devel-
oped dedicated personalization encoders [17,,,,]
that can directly inject new priors from a single input im-
age to the T2I model. However, these methods encounter
challenges to disentangle style from content as they focus
on generating the same subject as in the input image.
3. Method overview
In the following section we start with an overview
of the T2I diffusion process, and in particular the self–
attention mechanism Sec.. We continue by present-

“Toy train...”“Toy airplane...”“Toy piano...”“Toy house...”“Toy boat...”“Toy drum set...”“Toy car...”“Toy kitchen...” Figure 3.Style Aligned Diffusion.Generation of images with a style aligned to the reference image on the left. In each diffusion denoising
step all the images, except the reference, perform a shared self-attention with the reference image. AdaIN
QtKr
AdaIN
KtVrVt
QtQrKrKt
Scaled Dot-Product Attention
Reference Features
Target Features
Figure 4.Shared attention layer.The target images attends to
the reference image by applying AdaIN over their queries and keys
using the reference queries and keys respectively. Then, we apply
shared attention where the target features are updated by both the
target valuesVtand the reference valuesVr.
ing our attention-sharing operation within the self–attention
layers that enable style aligned image set generation.
3.1. Preliminaries
Diffusion models [23,] are generative latent variable
models that aim to model a distributionpθ(x0)that approx-
imates the data distributionq(x0)and are easy to sample
from. Diffusion models are trained to reverse the diffusion
“forward process”:
xt=

αtx0+

1−αtϵ, ϵ∼N(0, I),
wheret∈[0,∞)and the values ofαtare determined by a
scheduler such thatα0= 1andlimt→∞αt= 0. During
inference, we sample an image by gradually denoising an
input noise imagexT∼ N(0, I)via the reverse process:
xt−1=µt−1+σtz, z∼N(0, I),
where the value ofσtis determined by the sampler andµt−1
is given by
µt−1=

αt−1xt

αt
+
ȷ
p
1−αt−1−

1−αt

αt
ff
ϵθ(xt, t),
whereϵθ(xt, t)is the output of a diffusion model parame-
terized byθ.
Moreover, this process can be generalized for learning
a marginal distribution using an additional input condition.
That leads text-to-image diffusion models (T2I), where the
output of the modelϵθ(xt, t, y)is conditioned on a text
prompty.
Self-Attention in T2I Diffusion Models.State-of-the-
art T2I diffusion models [7,,] employ a U-Net architec-
ture [46] that consists of convolution layers and transformer
attention blocks [60]. In these attention mechanisms, deep

W.O. Query-Key AdaINStyleAlign (full)
Full Attention Sharing
“A firewoman...” “ A farmer...” “ A unicorn...” “ Dino...” Figure 5.Ablation study – qualitative comparison.Each pair
of rows shows two sets of images generated by the same set of
prompts “...in minimal flat design illustartion” using different con-
figurations of our method, and each row in a pair uses a different
seed. Sharing the self–attention between all images in the set (bot-
tom) results with some diversity loss (style collapse across many
seeds) and content leakage within each set (colors from one image
leak to another). Disabling the queries–keys AdaIN opeartion re-
sults with less consistent image sets compared to our full method
(top) which keeps on both diversity between different sets and con-
sistency within each set.
image featuresϕ∈R
m×dh
attend to each other via self-
attention layers and to contextual text embedding via cross-
attention layers.
Our work operates at the self-attention layers where deep
features are being updated by attending to each other. First,
the features are projected into queriesQ∈m×dk, keys
K∈m×dkand valuesV∈m×dhthrough learned lin-
ear layers. Then, the attention is computed by the scaled
dot-product attention:
Attention(Q, K, V) =softmax
ȷ
QK
T

dk
V
ff
,
wheredkis the dimension ofQandK. Intuitively, each
image feature is updated by a weighted sum ofV, where
the weight depends on the correlation between the projected
queryqand the keysK. In practice, each self-attention
layer consists of several attention heads, and then the resid-
ual is computed by concatenating and projecting the atten-
tion heads output back to the image feature spacedh:
ˆ
ϕ=ϕ+Multi-Head-Attention(ϕ).
3.2. Style Aligned Image Set Generation
The goal of our method is to generate a set of images
I1. . .Inthat are aligned with an input set of text prompts
y1. . . ynand share a consistent style interpretation with
each other. For example, see the garnered image set of toy
objects in Fig.
to the input text on top. A na¨ıve way to generate a style
aligned image set of different content is to use the same
style description in the text prompts. As can be seen at the
bottom of Fig., generating different images using a shared
style description of “in minimal origami style” results in an
unaligned set, since each image is unaware of the exact ap-
pearance of other images in the set during the generation
process.
The key insight underlying our approach is the utiliza-
tion of the self-attention mechanism to allow communica-
tion among various generated images. This is achieved by
sharing attention layers across the generated images.
Formally, letQi,Ki, andVibe the queries, keys, and
values, projected from deep featuresϕiofIiin the set, then,
the attention update forϕiis given by:
ψψΨextrmψıAttention℘↼♮queriesψ⌢i,ψ♮keysψ⌢ı♮sharedψ℘ψ,ψ♮valuesψ⌢ı♮sharedψ℘↽,ψ♮labelψıeq.shared-attention℘ψ(1)
whereK1...n=





K1
K2
.
.
.
Kn





andV1...n=





V1
V2
.
.
.
Vn





. However, we
have noticed that by enabling full attention sharing, we may
harm the quality of the generated set. As can be seen in
Fig.
leakage among the images. For example, the unicorns got
green paint from the garnered dino in the set. Moreover, full
attention sharing results with less diverse sets of the same
set of prompts, see the two sets in Fig.
compared to the sets above.
To restrict the content leakage and allow diverse sets, we
share the attention to only one image in the generated set

0.2450.2500.2550.2700.2750.280.2850.290.2950.350.400.450.500.55////T2I Reference
SDRP
(SDXL)
SDRP
(unofficial)
DB–LoRAOurs (full)
Ours
(W.O. AdaIN)
Ours
(Full Attn. Share)
IP-AdapterELITEBLIP–Diff.Text Alignment→Set Consistency→
Figure 6.Quantitative Comparison.We compare the results of
the different methods (blue marks) and our ablation experiments
(orange marks) in terms of text alignment (CLIP score) and set
consistency (DINO embedding similarity).
(typically the first in the batch). That is,targetimage fea-
turesϕtare attending to themselves and to the features of
only onereferenceimage in the set using Eq.. As can be
seen in Fig.
image in the set results in diverse sets that share a similar
style. However, in that case, we have noticed that the style
of different images is not well aligned. We suspect that this
is due to low attention flow from the reference to the target
image.
As illustrated in Fig., to enable balanced attention ref-
erence, we normalize the queriesQtand keysKtof the
target image using the queriesQrand keysKrof the ref-
erence image using the adaptive normalization operation
(AdaIN) [26]:
ˆ
Qt=AdaIN(Qt, Qr)
ˆ
Kt=AdaIN(Kt, Kr),
where the AdaIn operation is given by:
AdaIN(x, y) =σ(y)
ȷ
x−µ(x)
σ(x)
ff
+µy,
andµ(x), σ(x)∈R
dk
are the mean and the standard devia-
tion of queries and keys across different pixels. Finally, our
shared attention is given by
Attention(
ˆ
Qt, K
T
rt, Vrt),
whereKrt=
ffi
Kr
ˆ
Kt
ffl
andVrt=
ffi
Vr
Vt
ffl
.
4. Evaluations and Experiments
We have implemented our method over Stable Diffusion
XL (SDXL) [41] by applying our attention sharing overall
70self-attention layers of the model. The generation of a
four images set takes29seconds on a singleA100GPU.
Notice that since the generation of the reference image is
Table 1.User evaluation for style aligned image set generation.
In each question, the user was asked to select between two image
sets, Which is better in terms of style consistency and match to the
text descriptions (see Sec.). We report the percentage of judg-
ments in favor of StyleAligned over 800 answers (2400 in total).
StyleDrop
(unofficial MUSE)
StyleDrop
(SDXL)
DreamBooth–LoRA
(SDXL)
85.2 % 67.1 % 61.3%
not influenced by other images in the batch, we can generate
larger sets by fixing the prompt and seed of the reference
image across the set generation.
For example, see the sets in Fig..
Evaluation set.With the support of ChatGPT,we have
generated100text prompts describing different image
styles over four random objects. For example, “{A guitar,
A hot air balloon, A sailboat, A mountain}in papercut art
style.” For each style and set of objects, we use our method
to generate a set of images. The full list of prompts is pro-
vided in the appendix.
Metrics.To verify that each generated image contains
its specified object, we measure the CLIP cosine similar-
ity [42] between the image and the text description of the
object. In addition, we evaluate the style consistency of
each generated set, by measuring the pairwise average co-
sine similarity between DINO VIT-B/8 [9] embeddings of
the generated images in each set. Following [47,], we
used DINO embeddings instead of CLIP image embeddings
for measuring image similarity, since CLIP was trained with
class labels and therefore it might give a high score for dif-
ferent images in the set that have similar content but with
a different style. On the other hand, DINO better distin-
guishes between different styles due to its self-supervised
training.
4.1. Ablation Study
The quantitative results are summarized in Fig., where
the right–top place on the chart means better text similar-
ity and style consistency, respectively. As a reference, we
report the score obtained by generating the set of images
using SDXL (T2I Reference) using the same seeds without
any intervention. As can be seen, our method achieves a
much higher style consistency score at the expense of text
similarity. See qualitative comparison in Fig..
In addition, we compared our method to additional two
variants of the shared attention as described in Sec.. The
first variant uses full attention sharing (Full Attn. Share)
where the keys and values are shared between each pair of
images in the set. In the second variant (W.A. AdaIN) we
omit the AdaIN operation over queries and keys. As ex-
pected, thisFull Attn. Sharevariant, results with higher
style consistency and lower text alignment. As can be seen

Reference image StyleAligned StyleDrop (SDXL) Dreamboo th- LoRA (SDXL)
A moo seA cute bear A bab y pen-
A cute koala
A moo se A bab y penguin
A cute koala
A moo se A bab y penguin
A cute koala
A moo se A bab y penguin
A cute koala
A friendly r obotA woman walking her d ogs Cherryblossom
Read in the par k
A friendly r obot Cherryblossom
Read in the par k
A robot Cherryblos-
Read in the par k
A friendly r obot Cherryblossom
Read in the par
k
A camera A tall hill
A cabin
A tall hill
A cabin
A tall hill
A cabin
A tall hill
A saxophone A compass A compass A compass A compass
Scones Scones Scones
A cabin
Socks Socks Socks Socks
Full moo nA wise owl A boo k Full moo n A boo k Full moo n A boo k Full moo n A boo k
An armch air An armch air An armch air An armch air
A laptop A laptop A laptop A laptop
A hot air ball oonA guitar A mountain
A sailboa t
A hot air ball oon A mountain
A sailboa t
A hot air ball oon A mountain
A sailboa t
A hot air ball oon A mountain
A sailboa t
A shipA duck A rocket
A pineapple
A ship A rocket
A pineapple
A ship A car
A pineapple
A ship A rocket
A pineapple Figure 7.Qualitative comparison to personalization based methods.

“Girl playing chess”
100% Shared attn. 50% Shared attn. 10% Shared attn.
Reference image
“A woman jogging in a flat illustration round logo.”
“Boy playing domino”“ Man surfing” “Woman rowing” Figure 8.Varying level of attention sharing.By reducing the
number of shared attention layers, i.e., allowing only self-attention
in part of the layers, we can get more varied results (bottom rows)
at the expense of style alignment (top row).
in Fig., Full Attn. Shareharms the overall quality of the
image sets and diversity across sets. Moreover, our method
without the use of AdaIN results in much lower style con-
sistency. Qualitative results can be seen in Fig..
4.2. Comparisons
For baselines, we compare our method to T2I person-
alization methods. We trained StyleDrop [55] and Dream-
Booth [47] over the first image in each set of our evalua-
tion data, and use the trained personalized weights to gen-
erate the additional three images in each set. We use a
public unofficial implementation of StyleDrop
1
(SDRP–
unofficial) over non-regressive T2I model. Due to the large
quality gap between the unofficial MUSE model
2
to the
official one [10], we follow StyleDrop and implement an
adapter model over SDXL (SDRP–SDXL), where we train
a low rank linear layer after each Feed-Forward layer at
the model’s attention blocks. For training DreamBooth, we
adapt the LoRA [25,] variant (DB–LoRA) over SDXL
using the public huggingface–diffusers implementation
3
.
We follow the hyperparameters tuning reported in [55] and
train both SDRP–SDXL and DB–LoRA for 400 steps to
prevent overfitting to the style training image.
1
github.com/aim-uofa/StyleDrop-PyTorch
2
github.com/baaivision/MUSE-Pytorch
3
github.com/huggingface/diffusers
As can be seen in the qualitative comparison, Fig., the
image sets generated by our method, are more consistent
across style attributes like color palette, drawing style, com-
position, and pose. Moreover, the personalization-based
methods may leak the content of the training reference im-
age (on the left) when generating the new images. For ex-
ample, see the repeated woman and dogs in the results of
DB–LoRA and SDRP–SDXL at the second row or the re-
peated owl at the bottom row. Similarly, because of the con-
tent leakage, these methods obtained lower text similarity
scores and higher set consistency scores compared to our
method.
We also apply two encoder-based personalization meth-
ods ELITE [64], IP–Adapter [66], and BLIP–Diffusion [32]
over our evaluation set. These methods receive as input the
first image in each set and use its embeddings to generate
images with other content. Unlike the optimization-based
techniques, these methods operate in a much faster feed-
forward diffusion loop, like our method. However, as can
be seen in Fig., their performance for style aligned image
generation is poor compared to the other baselines. We ar-
gue that current encoder-based personalization techniques
struggle to disentangle the content and the style of the input
image. We supply qualitative results in appendix.
User Study.In addition to the automatic evalua-Reference image A car A cat A cactus
Figure 9.Style aligned image generation to an input image.
Given an input reference image (left column) and text description,
we first apply DDIM inversion over the image to get the inverted
diffusion trajectoryxT, xT−1. . . x0. Then, starting fromxTand
a new set of prompts, we apply our method to generate new content
(right columns) with an aligned style to the input.

tion, we conducted a user study over the results of our
method, StyleDrop (unofficial MUSE), StyleDrop (SDXL),
and DreamBooth–LoRA (SDXL). In each question, we ran-
domly sample one of the evaluation examples and show
the user the 4 image set that resulted from our and another
method (in a random order). The user had to choose which
set is better in terms of style consistency, and text align-
ment. A print screen of the user study format is provided
in the appendix. Overall, we collected 2400 answers from
100 users using the Amazon Mechanical Turk service. The
results are summarized in Tab.
report the percentage of judgments in our favor. As can be
seen, most participants favored our method by a large mar-
gin. More information about our user study can be found in
appendix.
4.3. Additional Results
Style Alignment Control.We provide means of control
over the style alignment to the reference image by applying
the shared attention over only part of the self-attention lay-
ers. As can be seen in Fig., reducing the number of shared
attention layers results with a more diverse image set, which
still shares common attributes with the reference image.
StyleAligned from an Input Image.To generate style-
aligned images to an input image, we apply DDIM inver-
sion [56] using a provided text caption. Then, we apply our
method to generate new images in the style of the input us-
ing the inverted diffusion trajectoryxT, xT−1, . . . x0for the
reference image. Examples are shown in Fig.13, where
we use BLIP captioning [33] to get a caption for each in-
put image. For example, we used the prompt “A render of
a house with a yellow roof” for the DDIM inversion of the
top example and replaced the word house with other ob-
jects to generate the style-aligned images of a car, a cat,
and a cactus. Notice that this method does not require any
optimization. However, DDIM inversion may fail [36] or
results with an erroneous trajectory [28]. More results and
analysis, are provided in appendix
Shared Self-Attention Visualization.Figure
the self-attention probabilities from a generated target im-
age to the reference style image. In each of the rows, we
pick a point on the image and depict the associated proba-
bilities map for the token at this particular point. Notably
probabilities mapped on the reference image are semanti-
cally close to the query point location. This suggests that the
self-attention tokens sharing do not perform a global style
transfer, but rather match the styles in a semantically mean-
ingful way [4]. In addition, Figure
largest components of the average shared attention maps of
the rhino image, encoded in RGB channels. Note that the
shared attention map is composed of both self-attention and
cross-image attention to the giraffe. As can be seen, the
components highlight semantically related regions like theKey locationsProbabilities map
“A car …”
“A train,
simple wooden statue”“A bull …”
Figure 10.Self-Attention probabilities maps from different gener-
ated image locations (Key locationscolumn) to the reference train
image with the target style (top-left).Reference image Target image Principle components of
Figure 11.Principle components of the shared attention map.
On right, we visualize the principle components of the shared at-
tention map between the reference giraffe and the target rhino gen-
erated images. The three largest components of the shared maps
are encoded in RGB channels.
bodies, heads, and the background in the images.
StyleAligned with Other Methods.Since our method
doesn’t require training or optimization, it can be easily
combined on top of other diffusion based methods to gen-
erate style-consistent image sets. Fig.
such examples where we combine our method with Control-
Net [67], DreamBooth [48] and MultiDiffusion [6]. More
examples and details about the integration of StyleAligned
with other methods can be found in appendix.
5. Conclusions
We have presented StyleAligned, which addresses the
challenge of achieving style-aligned image generation
within the realm of large-scale Text-to-Image models.
By introducing minimal attention sharing operations with
AdaIN modulation during the diffusion process, our method

Reference image
Personalized
content
Reference image
Left reference
ContolNet
MultiDiffusion
DreamBooth
Right reference
Depth
condition Figure 12.StyleAligned with other methods.On top,
StyleAligned is combined with ControlNet to generate style-
aligned images conditioned on depth maps. In the middle, our
method combined with MultiDiffusion to generate panorama im-
ages that share multiple styles. On the bottom, style consistent and
personalized content created by combining our method with pre-
trained personalized DreamBooth–LoRA models.
successfully establishes style-consistency and visual coher-
ence across generated images. The demonstrated efficacy
of StyleAligned in producing high-quality, style-consistent
images across diverse styles and textual prompts under-
scores its potential in creative domains and practical appli-
cations. Our results affirm StyleAligned capability to faith-
fully adhere to provided descriptions and reference styles
while maintaining impressive synthesis quality.
In the future we would like to explore the scalability and
adaptability of StyleAligned to have more control over the
shape and appearance similarity among the generated im-
ages. Additionally, due to the limitation of current diffu-
sion inversion methods, a promising direction is to leverage
StyleAligned to assemble a style-aligned dataset which then
can be used to train style condition text-to-image models.
6. Acknowledgement
We thank Or Patashnik, Matan Cohen, Yael Pritch, and
Yael Vinker for their valuable inputs that helped improve
this work.
References
[1] https :
/ / huggingface . co / diffusers / controlnet -
depth-sdxl-1.0, 2023.
[2] https :
/ / huggingface . co / thibaud / controlnet -
openpose-sdxl-1.0, 2023.
[3] https :
/ / huggingface . co / docs / diffusers / api /
pipelines/panorama, 2023.
[4]
Elor, and Daniel Cohen-Or. Cross-image attention for zero-
shot appearance transfer, 2023.,
[5]
latent diffusion.ACM Trans. Graph., 42(4), jul 2023.
[6]
Multidiffusion: Fusing diffusion paths for controlled image
generation. 2023.,,
[7]
Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee,
YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu,
YunxinJiao, and Aditya Ramesh. Improving image gener-
ation with better captions. 2023.
[8]
aohu Qie, and Yinqiang Zheng. MasaCtrl: tuning-free mu-
tual self-attention control for consistent image synthesis and
editing. InProceedings of the IEEE/CVF International Con-
ference on Computer Vision (ICCV), pages 22560–22570,
October 2023.
[9] ´e J´egou,
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
ing properties in self-supervised vision transformers. InPro-
ceedings of the International Conference on Computer Vi-
sion (ICCV), 2021.
[10]
Jos´e Lezama, Lu Jiang, Ming Yang, Kevin P. Murphy,
William T. Freeman, Michael Rubinstein, Yuanzhen Li, and
Dilip Krishnan. Muse: Text-to-image generation via masked
generative transformers. InInternational Conference on Ma-
chine Learning, 2023.,
[11]
Daniel Cohen-Or. Attend-and-excite: Attention-based se-
mantic guidance for text-to-image diffusion models.ACM
Transactions on Graphics (TOG), 42:1 – 10, 2023.
[12]
Matthieu Cord. Diffedit: Diffusion-based semantic image
editing with mask guidance. InThe Eleventh International
Conference on Learning Representations, 2022.
[13]
texture synthesis and transfer. InSeminal Graphics Papers:
Pushing the Boundaries, Volume 2, pages 571–576. 2023.

Reference Image Figure 13.Various remarkable places depicted with the style taken from Bruegels’“The Tower of Babel”.
Top row: Rome Colosseum, Rio de Janeiro, Seattle Space Needle.
[14]
by non-parametric sampling. InProceedings of the sev-
enth IEEE international conference on computer vision, vol-
ume 2, pages 1033–1038. IEEE, 1999.
[15]
Aleksander Holynski. Diffusion self-guidance for control-
lable image generation.arXiv preprint arXiv:2306.00986,
2023.
[16]
Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An
image is worth one word: Personalizing text-to-image gener-
ation using textual inversion. InThe Eleventh International
Conference on Learning Representations, 2022.,
[17]
Gal Chechik, and Daniel Cohen-Or. Encoder-based domain
tuning for fast personalization of text-to-image models.ACM
Transactions on Graphics (TOG), 42(4):1–13, 2023.
[18]
age style transfer using convolutional neural networks. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2414–2423, 2016.
[19]
Dimitris N. Metaxas, and Feng Yang. Svdiff: Com-
pact parameter space for diffusion fine-tuning.ArXiv,
abs/2303.11305, 2023.,
[20]
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
age editing with cross attention control.arXiv preprint
arXiv:2208.01626, 2022.
[21]
Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image
editing with cross-attention control. InThe Eleventh Inter-
national Conference on Learning Representations, 2022.
[22]
Curless, and David H Salesin. Image analogies. InSem-
inal Graphics Papers: Pushing the Boundaries, Volume 2,
pages 557–570. 2023.
[23]
sion probabilistic models. InProc. NeurIPS, 2020.
[24]
Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona
Attariyan, and Sylvain Gelly. Parameter-efficient transfer

learning for nlp. InInternational Conference on Machine
Learning, pages 2790–2799. PMLR, 2019.
[25]
Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-
rank adaptation of large language models. InInternational
Conference on Learning Representations, 2021.,
[26]
real-time with adaptive instance normalization. InProceed-
ings of the IEEE international conference on computer vi-
sion, pages 1501–1510, 2017.,
[27]
Multimodal unsupervised image-to-image translation. In
Proceedings of the European conference on computer vision
(ECCV), pages 172–189, 2018.
[28]
Michaeli. An edit friendly ddpm noise space: Inversion and
manipulations.arXiv preprint arXiv:2304.06140, 2023.,
[29]
generator architecture for generative adversarial networks. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4401–4410, 2019.
[30]
Shechtman, and Jun-Yan Zhu. Multi-concept customization
of text-to-image diffusion. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 1931–1941, 2023.
[31]
Kyunghyun Yoon. Directional texture transfer. InPro-
ceedings of the 8th International Symposium on Non-
Photorealistic Animation and Rendering, pages 43–48,
2010.
[32]
Diffusion: Pre-trained subject representation for con-
trollable text-to-image generation and editing.ArXiv,
abs/2305.14720, 2023.,,
[33]
Blip: Bootstrapping language-image pre-training for unified
vision-language understanding and generation, 2022.
[34]
Aila, Jaakko Lehtinen, and Jan Kautz. Few-shot unsuper-
vised image-to-image translation. InIEEE International
Conference on Computer Vision (ICCV), 2019.
[35]
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided
image synthesis and editing with stochastic differential equa-
tions. InICLR, 2021.
[36]
Daniel Cohen-Or. Null-text inversion for editing real im-
ages using guided diffusion models. InProceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 6038–6047, 2023.,
[37]
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
Mark Chen. Glide: Towards photorealistic image generation
and editing with text-guided diffusion models. InInterna-
tional Conference on Machine Learning, 2021.
[38]
Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, and
Trevor Darrell. Shape-guided diffusion with inside-outside
attention. InProceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision, 2024.
[39]
Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image
translation. InACM SIGGRAPH 2023 Conference Proceed-
ings, pages 1–11, 2023.
[40]
Elor, and Daniel Cohen-Or. Localizing object-level shape
variations with text-to-image diffusion models.ArXiv,
abs/2303.11306, 2023.
[41]
Tim Dockhorn, Jonas Muller, Joe Penna, and Robin Rom-
bach. SDXL: Improving latent diffusion models for high-
resolution image synthesis.ArXiv, abs/2307.01952, 2023.,
3,
[42]
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, and Ilya Sutskever. Learning transferable visual
models from natural language supervision. InInternational
Conference on Machine Learning, 2021.
[43]
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Zero-shot text-to-image generation. InInternational Confer-
ence on Machine Learning, pages 8821–8831. PMLR, 2021.
1
[44]
Esser, and Bj¨orn Ommer. High-resolution image synthesis
with latent diffusion models.2022 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
10674–10685, 2021.
[45]
Patrick Esser, and Bj¨orn Ommer. High-resolution image
synthesis with latent diffusion models. InProceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 10684–10695, 2022.
[46]
net: Convolutional networks for biomedical image segmen-
tation. InMedical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference,
Munich, Germany, October 5-9, 2015, Proceedings, Part III
18, pages 234–241. Springer, 2015.
[47]
Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine
tuning text-to-image diffusion models for subject-driven
generation. InProceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 22500–
22510, 2023.,
[48]
Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein,
and Kfir Aberman. Hyperdreambooth: Hypernetworks for
fast personalization of text-to-image models.arXiv preprint
arXiv:2307.06949, 2023.,,
[49]
diffusion fine-tuning.https : / / github . com /
cloneofsimo/lora, 2022.

[50]
Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mo-
hammad Norouzi. Palette: Image-to-image diffusion mod-
els.ACM SIGGRAPH 2022 Conference Proceedings, 2021.
2
[51]
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep
language understanding.Advances in Neural Information
Processing Systems, 35:36479–36494, 2022.
[52]
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
et al. Photorealistic text-to-image diffusion models with deep
language understanding.Advances in Neural Information
Processing Systems, 35:36479–36494, 2022.,
[53]
stantbooth: Personalized text-to-image generation without
test-time finetuning.ArXiv, abs/2304.03411, 2023.
[54]
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. InInternational Confer-
ence on Machine Learning, pages 2256–2265. PMLR, 2015.
3
[55]
Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang,
Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image
generation in any style.arXiv preprint arXiv:2306.00983,
2023.,,
[56]
ing diffusion implicit models. InInternational Conference
on Learning Representations, 2020.
[57]
Key-locked rank one editing for text-to-image personaliza-
tion.SIGGRAPH 2023 Conference Proceedings, 2023.,
13
[58]
Dekel. Splicing vit features for semantic appearance transfer.
InProceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 10748–10757, 2022.
2
[59]
Dekel. Plug-and-play diffusion features for text-driven
image-to-image translation. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pages 1921–1930, 2023.
[60]
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Il-
lia Polosukhin. Attention is all you need. In I. Guyon,
U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors,Advances in Neural Infor-
mation Processing Systems, volume 30. Curran Associates,
Inc., 2017.
[61]
Sketch-guided text-to-image diffusion models.arXiv
preprint arXiv:2211.13752, 2022.…, in style of Bruegels’ “the Tower of Babel” Painting
Figure 14.Text-to-image generation with explicit style descrip-
tion.Unlike our approach, this fails to produce fine and style-
aligned results. See Fig.
[62]
man. P+: Extended textual conditioning in text-to-image
generation.ArXiv, abs/2303.09522, 2023.,
[63]
Sheikh. Convolutional pose machines. InCVPR, 2016.
[64] https:
//github.com/csyxwei/ELITE , 2023.,
[65]
Zhang, and Wangmeng Zuo. ELITE: Encoding visual con-
cepts into textual embeddings for customized text-to-image
generation.ArXiv, abs/2302.13848, 2023.
[66]
Adapter: Text compatible image prompt adapter for text-to-
image diffusion models.ArXiv, abs/2308.06721, 2023.,,
13
[67]
conditional control to text-to-image diffusion models. In
ICCV, 2023.,
Appendix
A. StyleAligned from an Input Image
Figure
transfer for the Peter Bruegels’ ”The Tower of Babel” to
multiple places around the world. As for the prompt we
always use the places’ followed by”Pieter Bruegel Paint-
ing”, e.g.“Rome Coliseum, Pieter Bruegel Painting”. Even
though the original masterpiece is known to model, it fails
to reproduce its style with only text guidance. Fig.
some of the places generated with the direct instruction to
resemble the original painting, without self-attention shar-
ing. Notably, the model fails to produce an accurate style
alignment with the original picture.
Further examples of style transferring from real exam-
ples are presented in Figures.

We also noticed that once the style transfer is performed
from an extremely famous image, the default approach may
sometimes completely ignore the target prompt, generating
an image almost identical to the reference. We suppose that
this happens because the outputs of the denoising model for
the famous reference image have very high confidence and
activations magnitudes. Thus in the shared self-attention,
most of the attention is taken by the reference keys. To com-
pensate for it, we propose the simple trick of the attention
scores rescaling. In the self-attention sharing mechanism,
for some fixed scaleλ <1, we rescale the queries and
keys products conducting the new scoresλ· ⟨Q, Ktarget⟩.
We apply this only to the reference image keys. First, this
suppresses extra-high keys. Also, this makes the attention
scores more uniformly distributed, encouraging the gener-
ated image to capture style aggregated from the whole ref-
erence image. Fig.
variation effect for the particularly popular reference”Starr
Night”by Van Gogh. Notably, without rescaling, the model
generates an image almost identical to the reference, while
the scale relaxation produces a plausible transfer.
B. Integration with Other Methods
Below, we show different examples where our method
can provide style aligned image generation capability on top
of different diffusion-based image generation methods.
Style Aligned Subject Driven Generation.To use our
method on top of a personalized diffusion model, first, given
a collection of images (3-6) of the personalized content, we
follow DreamBooth–LoRA training [25,] where the lay-
ers of the attention layers are fine-tuned via low-rank adap-
tation weights (LoRA). Then, during inference, we apply
our method by sharing the attention of personalized gener-
ated images with a generated reference style image. During
this process, the LoRA weights are used only for the gen-
eration of personalized content. Examples of style aligned
personalized images are shown in Fig. The results of our
method on top of different personalized models are shown
in Fig.
model over the image collection on top and generated the
personalized content with the reference images on the left.
It can be seen that in some cases, like in the backpack
photos on the right, the subject in the image remains in
the same style as in the original photos. This is a known
limitation of training-based personalization methods [57]
which we believe can be improved by applying our method
over other T2I personalization techniques [19,] or more
careful search for training hyperparameters that allow better
generalization of the personalized model to different styles.
Style Aligned MultiDiffusion Image Generation.Bar
et al. [6] presented MultiDiffusion, a method for generat-
ing images in any resolution by aggregating diffusion pre-
dictions of overlapped squared crops. Our method can be<latexit sha1_base64="l6v7b8WEz3KlQh/xTtq1s64OQWE=">AAAB7nicbVDLSsNAFL2pr1pfVZduBovgqiQi6rLoxmUF+4A2lJvJpB06mYSZiVBCP8KNC0Xc+j3u/BunbRbaemDgcM65zL0nSAXXxnW/ndLa+sbmVnm7srO7t39QPTxq6yRTlLVoIhLVDVAzwSVrGW4E66aKYRwI1gnGdzO/88SU5ol8NJOU+TEOJY84RWOlTl/YaIiDas2tu3OQVeIVpAYFmoPqVz9MaBYzaahArXuemxo/R2U4FWxa6WeapUjHOGQ9SyXGTPv5fN0pObNKSKJE2ScNmau/J3KMtZ7EgU3GaEZ62ZuJ/3m9zEQ3fs5lmhkm6eKjKBPEJGR2Owm5YtSIiSVIFbe7EjpChdTYhiq2BG/55FXSvqh7V/XLh8ta47aoowwncArn4ME1NOAemtACCmN4hld4c1LnxXl3PhbRklPMHMMfOJ8/P7OPhg==</latexit>
!
<latexit sha1_base64="DSs/otLGM3tGcbw7Y+b2owP1uy4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptcvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukfVH1Lqu1Zq1Sv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPfYuMvg==</latexit>
1
<latexit sha1_base64="V+zfGDil62ZaIhKQ8N0P64yoyzE=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0hErceiF48V7Ae0oWy2m3bp7ibsboQS+he8eFDEq3/Im//GTZuDtj4YeLw3w8y8MOFMG8/7dkpr6xubW+Xtys7u3v5B9fCoreNUEdoiMY9VN8SaciZpyzDDaTdRFIuQ0044ucv9zhNVmsXy0UwTGgg8kixiBJtc8tz61aBa81xvDrRK/ILUoEBzUP3qD2OSCioN4Vjrnu8lJsiwMoxwOqv0U00TTCZ4RHuWSiyoDrL5rTN0ZpUhimJlSxo0V39PZFhoPRWh7RTYjPWyl4v/eb3URDdBxmSSGirJYlGUcmRilD+OhkxRYvjUEkwUs7ciMsYKE2PjqdgQ/OWXV0n7wvWv3cuHy1rjtoijDCdwCufgQx0acA9NaAGBMTzDK7w5wnlx3p2PRWvJKWaO4Q+czx/VU411</latexit>
0.75
Figure 15.Reference attention rescaling factor variation used
for extremely popular reference image assets.
used on top of MultiDiffusion by enabling our shared at-
tention between the crops to a reference image that is gen-
erated in parallel. Fig.
images generated with MultiDiffusion in conjunction with
our method using the public implementation of MultiDiffu-
sion over Stable Diffusion V2 [3]. Notice that compared
to avanillaMultiDiffusion image generation (small im-
ages in), our method not only enables the generation of
style aligned panoramas but also helps to preserve the style
within each image.
StyleAligned with Additional Conditions.Lastly, we
show how our method can be combined with ContolNet [67]
which enriches the conditioning signals of diffusion text-
to-image generation to include additional inputs, like depth
map and pose. ContolNet injects the additional informa-
tion by predicting residual features that are added to the
diffusion image features outputs of the down and middle
U-Net blocks. Similar to previous modifications, we ap-
ply StyleAligned image generation by sharing the attention
of ControlNet conditioned images to a reference image that
isn’t conditioned on additional input. Fig.
aligned image set (different rows) that are conditioned on
depth maps (different columns) using ControlNet depth en-
coder over SDXL [1]. Fig.
set (different rows) that are conditioned on pose estimation
obtained by OpenPose [63] (different columns) using Con-
trolNet pose encoder over SDXL [2].
C. Additional Comparisons
We provide additional comparisons of our method to
encoder-based text-to-image personalization methods and
editing approaches over the evaluation set presented in Sec-
tion 4 in the main paper. Table
titative results presented in the paper and here.
Encoder Based ApproachesAs reported in the paper,
we compare our method to encoder-based text-to-image
personalization methods: BLIP-Diffusion [32], ELITE
[64], and IP-Adapter [66]. These methods train an image
encoder and fine-tune the T2I diffusion model to be condi-
tioned on visual input. Fig.
ison on the same set shown in the paper (Fig. 7). As can be
seen, our image sets are more consistent and aligned to the
reference. Notice that, currently, only IP-Adapter provides
an encoder model for Stable Diffusion XL (SDXL). Nev-

Table 2.Full quantitative comparison for style aligned image
generation.We evaluate the generated image sets in terms of of
text alignment (CLIP score) and set consistency (DINO embed-
ding similarity).±Xdenotes the standard deviation of the score
across 100 image set results.
Method
Text Alignment
(CLIP↑)
Set Consistency
(DINO↑)
StyleDrop (SDXL) 0.272±0.04 0 .529±0.15
StyleDrop (unofficial MUSE)0.271±0.04 0 .301±0.14
DreamBooth-LoRA (SDXL) 0.276±0.03 0 .537±0.17
IP-Adapter (SDXL) 0.281±0.03 0 .44±0.13
ELITE (SD 1.4) 0.253±0.03 0 .481±0.13
BLIP-Diffusion (SD 1.4)0.245±0.04 0 .475±0.12
Prompt-to-Prompt (SDXL)0.283±0.03 0 .454±0.18
SDEdit (SDXL) 0.274±0.03 0 .453±0.16
StyleAligned (SDXL) 0.287±0.03 0 .51±0.14
StyleAligned (W.O. AdaIN)0.289±0.03 0 .428±0.14
StyleAligned (Full Attn.)0.28±0.03 0 .55±0.15
ertheless, BLIP-Diffusion and ELITE struggle to produce
consistent image sets that match the text descriptions.
0.2350.2650.2750.2850.2950.500.600.70////SDEdit 70%SDEdit 80%SDEdit 90%P2P 40%P2P 30%P2P 20%
StyleAlign (Ours)
100%
StyleAlign
90%
StyleAlign
80%
Text Alignment→Set Consistency→
Figure 16.Quantitative Comparison to zero shot editing ap-
proaces.We compare the results of the different methods in terms
of text alignment (CLIP score) and set consistency (DINO embed-
ding similarity).
Zero Shot Editing ApproachesOther baselines that can
be used for style aligned image set generation are diffusion-
based editing methods when applied over the reference
images. However, unlike our method, these methods as-
sume structure preservation of the input image. We re-
port the results of two diffusion-based editing approaches:
SDEdit [35] and Prompt-to-Prompt (P2P) [21] in Fig..
Notice that similar to our method, these methods provide a
level of control that trade-off between alignment to text and
alignment to the input image. To get higher text alignment,
SDEdit can be applied over an increased percentage of dif-
fusion steps, and P2P can reduce the number of attention
injection steps. Our method can achieve higher text align-
ment, as described in Section 4 in the main paper, by us-
ing our shared attention over only a subset of self-attention
layers. Fig.
the different methods. As can be seen, only our method
can achieve text alignment while preserving high set con-
sistency.
D. User Study and Evaluation Settings
As described in the main paper, we generate the images
for evaluation using a list of 100 text prompts where each
prompt describes 4 objects in the same style. The full list is
provided at the end of supplementary materials. We eval-
uated the results of the different methods using the auto-
matic CLIP and DINO scores and through user evaluation.
The format of the user study is provided in Fig.
user has to select between the results of two methods. For
each method from StyleDrop (SDXL), StyleDrop (unoffi-
cial Muse), and DreamBooth-LoRA (SDXL), we collected
800 answers compared to our results. In total, we collected
2400 answers from 100 participants.

Reference image Space rocket Boy riding a bicycleMatterhorn mountain Mime artist Seattle needle Twin boys on a balcony Figure 17.Samples of the proposed style transfer techniques applied for a variety of different images and target prompts.

Reference image A bear eating honeyCats on a roof in ParisWoman selling fruits Wolf howling to the moonMan fishing from a boat Snowman Figure 18.Samples of the proposed style transfer techniques applied for a variety of different images and target prompts.

Reference image
dezilanosreP
content Figure 19.Personalized T2I diffusion with StyleAligned.Each row shows style aligned image st using the reference image on the left,
applied on different personalized diffusion models, fine-tuned over the personalized content on top. The top two rows were generated using
the prompt ”[my subject] in the style of a beautiful papercut art.“ The bottom two rows were generated using the prompt ”[my subject] in
beautiful flat design.“ where [my subject] is replaced with the subject name.

Reference image
“A poster in a flat design
style.”
Reference image
“A poster in
a papercut art style.”
“Houses in a flat design style.”
“Mountains in a flat design style.”
“Girrafes in a flat design style.”
“A village in a papercut art style.”
“Futuristic city scape in a papercut art style.”
“A jungle in a papercut art style.” Figure 20.MultiDiffusion with StyleAligned.The panoramas were generated with MultiDiffusion [6] using the text prompt beneath and
the left image as reference. The small images in the bottom right corners are the results of MultiDiffusion results without our method.

Depth condition
Reference image Figure 21.ControlNet Depth with StyleAligned.

Pose condition
Reference image Figure 22.ControlNet pose with StyleAligned.

Reference image StyleAligned IP-Adapter (SDXL) ELITE (SD 1.4) BLIP Diffusion (SD 1.4)
A mooseA cute bear A baby penguin
A cute koala
A moose A baby penguin
A cute koala
A moose A baby penguin
A cute koala
A moose A baby penguin
A cute koala
A friendly robotA woman walking her dogs Cherryblossom
Read in the park
A friendly robot Cherryblossom
Read in the park
A friendly robot Cherryblossom
Read in the park
A friendly robot Cherryblossom
Read in the park
A firemanA camera A tall hill
A cabin
A fireman A tall hill
A cabin
A fireman A tall hill
A cabin
A fireman A tall hill
A sunflowerA saxophone A compass A sunflower A compass A sunflower A compass A sunflower A compass
SconesA bottle of wine A muffin Scones A muffin A fireman A muffin Scones A muffin
A cabin
Socks Socks Socks Socks
Full moonA wise owl A book Full moon A book Full moon A book Full moon A book
An armchair An armchair An armchair An armchair
A laptop A laptop A laptop A laptop
A hot air balloonA guitar A mountain
A sailboat
A hot air balloon A mountain
A sailboat
A hot air balloon A mountain
A sailboat
A hot air balloon A mountain
A sailboat
A shipA duck A rocket
A pineapple
A ship A rocket
A pineapple
A ship A rocket
A pineapple
A ship A rocket
A pineapple Figure 23.Qualitative comparison to encoders based personalization methods.

Figure 24.Screenshot from the user study.Each row of images represents the result obtained by different method. The user had to assess
which row is better in terms of style alignment and text alignment.

List of prompts for our evaluation set generation:
1 .{A h o u s e , A t e m p l e , A dog , A l i o n}i n s t i c k e r s t y l e .
2 .{F l o w e r s , G o l d e n G a t e b r i d g e , A c h a i r , T r e e s , An a i r p l a n e}i n w a t e r c o l o r p a i n t i n g s t y l e .
3 .{A v i l l a g e , A b u i l d i n g , A c h i l d r u n n i n g i n t h e p a r k , A r a c i n g c a r}i n l i n e d r a w i n g s t y l e .
4 .{A phone , A k n i g h t on a h o r s e , A t r a i n p a s s i n g a v i l l a g e , A t o m a t o i n a bowl}i n c a r t o o n l i n e d r a w i n g s t y l e .
5 .{S l i c e s o f w a t e r m e l o n a n d c l o u d s i n t h e b a c k g r o u n d , A f o x , A bowl w i t h c o r n f l a k e s , A model o f a t r u c k}i n 3 d
r e n d e r i n g s t y l e .
6 .{A mushroom , An E l f , A d r a g o n , A d w a r f}i n g l o w i n g s t y l e .
7 .{A t h u m b s up , A crown , An a v o c a d o , A b i g s m i l e y f a c e}i n g l o w i n g 3 d r e n d e r i n g s t y l e .
8 .{A b e a r , A moose , A c u t e k o a l a , A b a b y p e n g u i n}i n k i d c r a y o n d r a w i n g s t y l e .
9 .{An o r c h i d , A V i k i n g f a c e w i t h b e a r d , A b i r d , An e l e p h a n t}i n wooden s c u l p t u r e .
1 0 .{A p o r t r a i t o f a p e r s o n w e a r i n g a h a t , A p o r t r a i t o f a woman w i t h a l o n g h a i r , A p e r s o n d a n c i n g , A p e r s o n f i s h i n g}
i n o i l p a i n t i n g s t y l e .
1 1 .{A woman w a l k i n g a dog , A f r i e n d l y r o b o t , A woman r e a d i n g i n t h e p a r k , C h e r r y b l o s s o m}i n f l a t c a r t o o n i l l u s t r a t i o n
s t y l e .
1 2 .{A b i t h d a y c a k e , The l e t t e r A , An e s p r e s s o m a c h i n e , A C a r}i n a b s t r a c t r a i n b o w c o l o r e d f l o w i n g smoke wave d e s i g n .
1 3 .{A f l o w e r , A p i a n o , A b u t t e r f l y , A g u i t e r}i n m e l t i n g g o l d e n 3 d r e n d e r i n g s t y l e .
1 4 .{A t r a i n , A c a r , A b i c y c l e , An a i r p l a n e}i n m i n i m a l i s t r o u n d BW l o g o .
1 5 .{A r o c k e t , An a s t r o n a u t , A man r i d i n g a s n o w b o a r d , A p a i r o f r i n g s}i n n e o n g r a f f i t i s t y l e .
1 6 .{A t e a p o t , A t e a c u p , A s t a c k o f boo ks , A c o z y a r m c h a i r}i n v i n t a g e p o s t e r s t y l e .
1 7 .{A m o u n t a i n r a n g e , A b e a r , A c a m p f i r e , A p i n e f o r e s t}i n w o o d b l o c k p r i n t s t y l e .
1 8 .{A s u r f b o a r d , A b e a c h s h a c k , A wave , A s e a g u l l}i n r e t r o s u r f a r t s t y l e .
1 9 .{A p a i n t b r u s h , A s u n f l o w e r f i e l d , A s c a r e c r o w , A r u s t i c b a r n}i n a m i n i m a l o r i g a m i s t y l e .
2 0 .{A c i t y s c a p e , H o v e r i n g v e h i c l e s , D r a g o n s , B o a t s}i n c y b e r p u n k a r t s t y l e .
2 1 .{A t r e a s u r e box , A p i r a t e s h i p , A p a r r o t , A s k u l l}i n t a t t o o a r t s t y l e .
2 2 .{Music s t a n d , A v i n t a g e m i c r o p h o n e , A t u r t l e , A s a x o p h o n e}i n a r t d e c o s t y l e .
2 3 .{A t r o p i c a l i s l a n d , A mushroom , A palm t r e e , A c o c k t a i l}i n v i n t a g e t r a v e l p o s t e r s t y l e .
2 4 .{A c a r o u s e l , C o t t o n ca ndy , A f e r r i s wh ee l , B a l l o o n s}i n r e t r o amusement p a r k s t y l e .
2 5 .{A s e r e n e r i v e r , A r o w b o a t , A b r i d g e , A w i l l o w t r e e}i n 3D r e n d e r , a n i m a t i o n s t u d i o s t y l e .
2 6 .{A r e t r o g u i t a r , A j u k e b o x , A c h e s s p i e c e , A m i l k s h a k e}i n 1950 s d i n e r a r t s t y l e .
2 7 .{A snowy c a b i n , A s l e i g h , A snowman , A w i n t e r f o r e s t}i n S c a n d i n a v i a n f o l k a r t s t y l e .
2 8 .{A bowl w i t h a p p l e s , A p e n c i l , A b i g ar mo r , A m a g i c a l s u n g l a s s e s}i n f a n t a s y p o i s o n book s t y l e .
2 9 .{A k i w i f r u i t , A s e t o f d0rums , A hammer , A t r e e}i n H a w a i i a n s u n s e t p a i n t i n g s t y l e .
3 0 .{A g u i t a r , A h o t a i r b a l l o o n , A s a i l b o a t , A m o u n t a i n}i n p a p e r c u t a r t s t y l e .
3 1 .{A c o f f e e cup , A t y p e w r i t e r , A p a i r o f g l a s s e s , A v i n t a g e c a m e r a}i n r e t r o h i p s t e r s t y l e .
3 2 .{A b o a r d o f backgammon , A s h i r t a n d p a n t s , Sho es , A c o c k t a i l}i n v i n t a g e p o s t c a r d s t y l e .
3 3 .{A r o a r i n g l i o n , A s o a r i n g e a g l e , A d o l p h i n , A g a l l o p i n g h o r s e}i n t r i b a l t a t t o o s t y l e .
3 4 .{A p i z z a , C a n d l e s a n d r o s e s , A b o t t l e , A c h e f}i n J a p a n e s e u k i y o − e s t y l e .
3 5 .{A w i s e owl , A f u l l moon , A m a g i c a l c h a i r , A book o f s p e l l s}i n f a n t a s y book c o v e r s t y l e .
3 6 .{A c o z y c a b i n , Snow− c o v e r e d t r e e s , A warming f i r e p l a c e , A s t e a m i n g c u p o f c o c o a}i n h y g g e s t y l e .
3 7 .{A b o t t l e o f wine , A s c o n e , A m u f f i n , P a i r o f s o c k s}i n Zen g a r d e n s t y l e .
3 8 .{A d i v e r , Bowl o f f r u t i s , An a s t r o n a u t , A c a r o u s e l}i n c e l e s t i a l a r t w o r k s t y l e .
3 9 .{A h o r s e , A c a s t l e , A cow , An o l d p h o n e}i n m e d i e v a l f a n t a s y i l l u s t r a t i o n s t y l e .
4 0 .{A m y s t e r i o u s f o r e s t , B i o l u m i n e s c e n t p l a n t s , A g r a v e y a r d , A t r a i n s t a t i o n}i n e n c h a n t e d 3D r e n d e r i n g s t y l e .
4 1 .{A g l o b e , An a i r p l a n e , A s u i t c a s e , A c o m p a s s}i n t r a v e l a g e n c y l o g o s t y l e .
4 2 .{A p e r s i a n c a t p l a y i n g w i t h a b a l l o f wool , A man s k i i n g down t h e h i l l , A t r a i n a t t h e s t a t i o n , A b e a r e a t i n g
h o n e y}i n c a f e l o g o s t y l e .
4 3 .{A book , A q u i l l pen , An i n k w e l l , An u m b r e l l a}i n e d u c a t i o n a l i n s t i t u t i o n l o g o s t y l e .
4 4 .{A h a t , A s t r a w b e r r y , A s c r e w , A g i r a f f e}i n m e c h a n i c a l r e p a i r s h o p l o g o s t y l e .
4 5 .{A n o t e b o o k , A r u n n i n g s h o e , A r o b o t , A c a l c u l a t o r}i n h e a l t h c a r e a n d m e d i c a l c l i n i c l o g o s t y l e .
4 6 .{A r u b b e r duck , A p i r a t e s h i p , A r o c k e t , A p i n e a p p l e}i n d o o d l e a r t s t y l e .
4 7 .{A t r u m p e t , A f i s h b o w l , A palm t r e e , A b i c y c l e}i n a b s t r a c t g e o m e t r i c s t y l e .
4 8 .{A t e a p o t , A k a n g a r o o , A s k y s c r a p e r , A l i g h t h o u s e}i n m o s a i c a r t s t y l e .
4 9 .{A n i n j a , A h o t a i r b a l l o o n , A s u b m a r i n e , A w a t e r m e l o n}i n p a p e r c o l l a g e s t y l e .
5 0 .{A s a x o p h o n e , A s u n f l o w e r , A compass , A l a p t o p}i n o r i g a m i s t y l e .
5 1 .{A p e n g u i n , A b i c y c l e , A t o r n a d o , A p i n e a p p l e}i n a b s t r a c t g r a f f i t i s t y l e .
5 2 .{A m a g i c i a n ’ s h a t , A UFO , A r o l l e r c o a s t e r , A b e a c h b a l l}i n s t r e e t a r t s t y l e .
5 3 .{A c a c t u s , A s h o p p i n g c a r t , A c h i l d p l a y i n g w i t h c u b e s , A c a m e r a}i n mixed m e d i a a r t s t y l e .
5 4 .{A snowman , A s u r f b o a r d , A h e l i c o p t e r , A c a p p u c c i n o}i n a b s t r a c t e x p r e s s i o n i s m s t y l e .
5 5 .{A r o b o t , A c u p c a k e , A woman p l a y i n g b a s c k e t b a l l , A s u n f l o w e r}i n d i g i t a l g l i t c h a r t s t y l e .
5 6 .{A t r e e h o u s e , A d i s c o b a l l , A s a i l i n g b o a t , A c o c k t a i l}i n p s y c h e d e l i c a r t s t y l e .
5 7 .{A f o o t b a l l h e l m e t , A p l a y m o b i l , A t r u c k , A w a t c h}i n s t r e e t a r t g r a f f i t i s t y l e .
5 8 .{A c a b i n , A l e o p a r d , A s q u i r r e l , A r o s e}i n pop a r t s t y l e .
5 9 .{A bus , A drum , A r a b b i t , A s h o p p i n g m a l l}i n m i n i m a l i s t s u r r e a l i s m s t y l e .
6 0 .{A f r i s b e e , A monkey , A s n a k e , s k a t e s}i n a b s t r a c t c u b i s m s t y l e .
6 1 .{A p i a n o , A v i l l a , A s n o w b o a r d , A r u b b e r d u c k}i n a b s t r a c t i m p r e s s i o n i s m s t y l e .
6 2 .{A l a p t o p , A man p l a y i n g s o c c e r , A woman p l a y i n g t e n n i s , A r o l l i n g c h a i r}i n p o s t − modern a r t s t y l e .
6 3 .{A c u t e p u p p e t , A g l a s s o f b e e r , A v i o l i n , A c h i l d p l a y i n g w i t h a k i t e}i n neo − f u t u r i s m s t y l e .
6 4 .{A dog , A b r i c k h o u s e , A l o l l i p o p , A woman p l a y i n g on a g u i t e r}i n a b s t r a c t c o n s t r u c t i v i s m s t y l e .
6 5 .{A k i t e s u r f i n g , A p i z z a , A c h i l d d o i n g homework , A p e r s o n d o i n g y o g a}i n f l u i d a r t s t y l e .
6 6 .{I c e cream , A v i n t a g e t y p e w r i t e r , A p a i r o f r e a d i n g g l a s s e s , A h a n d w r i t t e n l e t t e r}i n macro p h o t o g r a p h y s t y l e .
6 7 .{A g o u r m e t b u r g e r , A s u s h i , A m i l k s h a k e , A p i z z a}i n p r o f e s s i o n a l f o o d p h o t o g r a p h y s t y l e f o r a menu .
6 8 .{A c r y s t a l v a s e , A p o c k e t wa tch , A compass , A l e a t h e r − bound j o u r n a l}i n v i n t a g e s t i l l l i f e p h o t o g r a p h y s t y l e .
6 9 .{A s a k e s e t , A s t a c k o f book s , A c o z y b l a n k e t , A c u p o f h o t c o c o a}i n m i n i a t u r e model s t y l e .
7 0 .{A r e t r o b i c y c l e , A s u n h a t , A p i c n i c b a s k e t , A k i t e}i n o u t d o o r l i f e s t y l e p h o t o g r a p h y s t y l e .
7 1 .{A g r o u p o f h i k e r s on a m o u n t a i n t r a i l , A w i n t e r e v e n i n g by t h e f i r e , A hen , A p e r s o n e n j o y i n g m u s i c}i n r e a l i s t i c
3D r e n d e r .
7 2 .{A t e n t , A p e r s o n k n i t t i n g , A r u r a l f a r m s c e n e , A b a s k e t o f f r e s h e g g s}i n r e t r o m u s i c a n d v i n y l p h o t o g r a p h y s t y l e
.

7 3 .{A g i r a f f e , A b l a n k e t , A f o r k a n d k n i f e , A p i l e o f c a n d i e s}i n c o z y w i n t e r l i f e s t y l e p h o t o g r a p h y s t y l e .
7 4 .{A w i l d f l o w e r , A l a d y b u g , An i g l o o i n a n t a r c t i c a , A p e r s o n r u n n i n g}i n b o k e h p h o t o g r a p h y s t y l e .
7 5 .{A c o f f e e m a c h i n e , A l a p t o p , A p e r s o n w o r k i n g , A p l a n t on t h e d e s k}i n m i n i m a l f l a t d e s i g n s t y l e .
7 6 .{A c a m e r a , A f i r e m a n , A wooden h o u s e , A t a l l h i l l}i n m i n i m a l v e c t o r a r t s t y l e .
7 7 .{A p e r s o n t e x t i n g , A p e r s o n s c r a w l i n g , A c o z y c h a i r , A lamp}i n m i n i m a l p a s t e l c o l o r s s t y l e .
7 8 .{A s m a r t p h o n e , A book , A d i n n e r t a b l e , A g l a s s o f w i n e}i n m i n i m a l d i g i t a l a r t s t y l e .
7 9 .{A b r u s h , An a r t i s t p a i n t i n g , A g i r l h o l d i n g u m b r e l l a , a p o o l t a b l e}i n m i n i m a l a b s t r a c t i l l u s t r a t i o n s t y l e .
8 0 .{A p a i r o f r u n n i n g s h o e s , A m o t o r c y c l e , Keys , A f i t n e s s m a c h i n e}i n m i n i m a l m o n o c h r o m a t i c s t y l e .
8 1 .{A c o m p a s s r o s e , A c a c t u s , A z e b r a , A b l i z z a r d}i n w o o d c u t p r i n t s t y l e .
8 2 .{A l a n t e r n , A t r i c y c l e , A s e a s h e l l , A swan}i n c h a l k a r t s t y l e .
8 3 .{M a g n i f y i n g g l a s s , G o r i l l a , A i r p l a n e , Swing}i n p i x e l a r t s t y l e .
8 4 .{H i k i n g b o o t s , Kangaroo , I c e c r e a m cone , Hammock}i n c o m i c book s t y l e .
8 5 .{H o r s e s h o e , V i n t a g e t y p e w r i t e r , S n a i l , T o r n a d o}i n v e c t o r i l l u s t r a t i o n s t y l e .
8 6 .{A l i g h t h o u s e , A h o t a i r b a l l o o n , A c a t , A c i t y s c a p e}i n i s o m e t r i c i l l u s t r a t i o n s t y l e .
8 7 .{A compass , A v i o l i n , A palm t r e e , A k o a l a}i n w i r e f r a m e 3D s t y l e .
8 8 .{Beach u m b r e l l a , R o c k e t s h i p , Fox , W a t e r f a l l}i n p a p e r c u t o u t s t y l e .
8 9 .{T r e e stump , Harp , Chameleon , Canyon}i n b l u e p r i n t s t y l e .
9 0 .{E l e p h a n t , UFO t o y , F l a m i n g o , L i g h t n i n g b o l t}i n r e t r o c o m i c book s t y l e .
9 1 .{Robot , Temple , J e l l y f i s h , S o f a}i n i n f o g r a p h i c s t y l e .
9 2 .{M i c r o s c o p e , G i r a f f e , L a p t o p , Rainbow}i n g e o m e t r i c s h a p e s s t y l e .
9 3 .{T e a p o t , Dragon t o y , S k a t e b o a r d , S t o r m c l o u d}i n c a r t o o n l i n e d r a w i n g s t y l e .
9 4 .{C r y s t a l b a l l , C a r o u s e l h o r s e , Hummingbird , G l a c i e r}i n w a t e r c o l o r a n d i n k wash s t y l e .
9 5 .{F e a t h e r q u i l l , S a t e l l i t e d i s h , Deer , D e s e r t s c e n e}i n dreamy s u r r e a l s t y l e .
9 6 .{Map , Saxophone , Mushroom , D o l p h i n}i n s t e a m p u n k m e c h a n i c a l s t y l e .
9 7 .{Anchor , Cl ock , Globe , B i c y c l e}i n 3D r e a l i s m s t y l e .
9 8 .{Cl ock , H e l i c o p t e r , Whale , S t a r f i s h}i n r e t r o p o s t e r s t y l e .
9 9 .{B i n o c u l a r s , Bus , P i l l o w , C l o u d}i n b o h e m i a n hand − drawn s t y l e .
1 0 0 .{Rhino , T e l e s c o p e , S t o o l , P a n d a}i n v i n t a g e s t a m p s t y l e .