Extracted Text


1904.09751.pdf

Published as a conference paper at ICLR 2020
THECURIOUSCASE OF
NEURALTEXTDeGENERATION
Ari Holtzman
yz
Jan Buys
xy
Li Du
y
Maxwell Forbes
yz
Yejin Choi
yz
y
Paul G. Allen School of Computer Science & Engineering, University of Washington
z
Allen Institute for Articial Intelligence
x
Department of Computer Science, University of Cape Town
fahai,dul2,mbforbes,yejin g@cs.washington.edu, jbuys@cs.uct.ac.za
ABSTRACT
Despite considerable advances in neural language modeling, it remains an open
question what the bestdecoding strategyis for text generation from a language
model (e.g. to generate a story). The counter-intuitive empirical observation is
that even though the use of likelihood as training objective leads to high quality
models for a broad range of language understanding tasks, maximization-based
decoding methods such as beam search lead todegeneration— output text that is
bland, incoherent, or gets stuck in repetitive loops.
To address this we proposeNucleus Sampling, a simple but effective method to
draw considerably higher quality text out of neural language models than previ-
ous decoding strategies. Our approach avoids textdegeneration by truncating the
unreliable tail of the probability distribution, sampling from the dynamic nucleus
of tokens containing the vast majority of the probability mass.
To properly examine current maximization-based and stochastic decoding meth-
ods, we compare generations from each of these methods to the distribution of
human text along several axes such as likelihood, diversity, and repetition. Our re-
sults show that (1) maximization is an inappropriate decoding objective for open-
ended text generation, (2) the probability distributions of the best current language
models have an unreliable tail which needs to be truncated during generation and
(3) Nucleus Sampling is currently the best available decoding strategy for gener-
ating long-form text that is both high-quality — as measured by human evaluation
— and as diverse as human-written text.                   
                    
  
   !   "   
#   $   % $ 
 &"#$'   !   
% # (  ) &%#' 
 % # (  )
&%#*% # ( 
)*% # ( 
)*% # ( 
)*% # ( +
   
   , -.    
  !      
!  ,      
/0  /  /      
" - 1  $  #  / 
!      / !  ! 2
 3 /     !  !
2 ! /      /
   , -
Figure 1: Even with substantial human context and the powerful GPT-2 Large language model,
Beam Search (size 32) leads to degenerate repetition (highlighted in blue) while pure sampling
leads to incoherent gibberish (highlighted in red). Whenb64, both GPT-2 Large and XL (774M
and 1542M parameters, respectively) prefer to stop generating immediately after the given context.
1 INTRODUCTION
On February 14th 2019, OpenAI surprised the scientic community with an impressively high-
quality article about Ovid's Unicorn, written by GPT-2.
1
Notably, the top-quality generations ob-
1
https://openai.com/blog/better-language-models/
1

Published as a conference paper at ICLR 2020
tained from the model rely onrandomnessin the decoding method, in particular throughtop-ksam-
pling that samples the next word from the topkmost probable choices (Fan et al., 2018; Holtzman
et al., 2018; Radford et al., 2019), instead of aiming to decode text thatmaximizeslikelihood.
In fact, decoding strategies that optimize for output with high probability, such as beam search, lead
to text that is incredibly degenerate, even when using state-of-the-art models such as GPT-2 Large,
as shown in Figure 1. This may seem counter-intuitive, as one would expect that good models
would assign higher probability to more human-like, grammatical text. Indeed, language models
do generally assign high scores to well-formed text, yet thehighestscores for longer texts are often
generic, repetitive, and awkward. Figure 2 exposes how different the distribution of probabilities
assigned to beam search decoded text and naturally occurring text really are.
Perhaps equally surprising is the right side of Figure 1, which shows that pure sampling — sampling
directly from the probabilities predicted by the model — results in text that is incoherent and almost
unrelated to the context. Why is text produced by pure sampling so degenerate? In this work we
show that the “unreliable tail” is to blame. This unreliable tail is composed of tens of thousands of
candidate tokens with relatively low probability that are over-represented in the aggregate.
To overcome these issues we introduceNucleus Sampling(x3.1). The key intuition of Nucleus
Sampling is that the vast majority of probability mass at each time step is concentrated in thenucleus,
a small subset of the vocabulary that tends to range between one and a thousand candidates. Instead
of relying on a xed top-k, or using a temperature parameter to control the shape of the distribution
without sufciently suppressing the unreliable tail, we propose sampling from the top-pportion of
the probability mass, expanding and contracting the candidate pool dynamically.
In order to compare current methods to Nucleus Sampling, we compare vari-
ous distributional properties of generated text to the reference distribution, such
as the likelihood of veering into repetition and the perplexity ofgeneratedtext.
 
 
 
 

     
 

         
   
 
 
  !   " # 
   $#$$    %!
#   !  
 & !   ! 
 " #   
 $#$$    %! #
  !  
 & !   ! 
 " #   
 $#$$    %! #
  !  
 & !   ! 
 " #   
 $#$$    %! #
  !  
 & !

"     ! #  
!     "  
' ()    # % 
! "     
#  *  +*   "
!    "     !
!  ()$    
 " ! "  ,  %!
*    #   - ! .
 " " "    
' ()$/& "   !
  &   /*   
!    '.  
#  $  0 
,!
Figure 2: The probability assigned to tokens gen-
erated by Beam Search and humans, given the
same context. Note the increased variance that
characterizes human text, in contrast with the end-
less repetition of text decoded by Beam Search.
The latter reveals that text generated by maxi-
mization or top-ksampling istooprobable, in-
dicating a lack of diversity and divergence in
vocabulary usage from the human distribution.
On the other hand, pure sampling produces text
that is signicantlylesslikely than the gold,
corresponding to lower generation quality.
Vocabulary usage and Self-BLEU (Zhu et al.,
2018) statistics reveal that high values ofkare
needed to make top-ksampling match human
statistics. Yet, generations based on high val-
ues ofkoften have high variance in likelihood,
hinting at qualitatively observable incoherency
issues. Nucleus Sampling can easily match ref-
erence perplexity through tuning the value of
p, avoiding the incoherence caused by settingk
high enough to match distributional statistics.
Finally, we perform Human Unied with Sta-
tistical Evaluation (HUSE; Hashimoto et al.,
2019) to jointly assess the overall quality and
diversity of the decoding strategies, which can-
not be captured using either human or auto-
matic evaluation alone. The HUSE evaluation
demonstrates that Nucleus Sampling is the best
overall decoding strategy. We include gener-
ated examples for qualitative analysis – see Fig-
ure 3 for a representative example, and further
examples in the appendix.
2
2
Code and all generations are available athttps://github.com/ari-holtzman/degen
2

Published as a conference paper at ICLR 2020
2 BACKGROUND
2.1 TEXTGENERATIONDECODINGSTRATEGIES
A number of recent works have alluded to the disadvantages of generation by maximization, which
tend to generate output with high grammaticality but low diversity (Kulikov et al., 2019; Holtzman
et al., 2018; Fan et al., 2018). Generative Adversarial Networks (GANs) have been a prominent
research direction (Yu et al., 2017; Xu et al., 2018), but recent work has shown that when qual-
ity and diversity are considered jointly, GAN-generated text fails to outperform generations from
language models (Caccia et al., 2018; Tevet et al., 2019; Semeniuta et al., 2018). Work on neural di-
alog systems have proposed methods for diverse beam search, using a task-specic diversity scoring
function or constraining beam hypotheses to be sufciently different (Li et al., 2016a; Vijayakumar
et al., 2018; Kulikov et al., 2019; Pal et al., 2006). While such utility functions encourage desirable
properties in generations, they do not remove the need to choose an appropriate decoding strategy,
and we believe that Nucleus Sampling will have complementary advantages in such approaches.
Finally, Welleck et al. (2020) begin to address the problem of neural text degeneration through an
“unlikelihood loss”, which decreases training loss on repeated tokens and thus implicitly reduces
gradients on frequent tokens as well. Our focus is on exposing neural text degeneration and provid-
ing adecodingsolution that can be used with arbitrary models, but future work will likely combine
training-time and inference-time solutions.
2.2 OPEN-ENDED VS DIRECTED GENERATION
Many text generation tasks are dened through (input, output) pairs, such that the output is a con-
strainedtransformationof the input. Example applications include machine translation (Bahdanau
et al., 2015), data-to-text generation (Wiseman et al., 2017), and summarization (Nallapati et al.,
2016). We refer to these tasks asdirectedgeneration. Typically encoder-decoder architectures
are used, often with an attention mechanism (Bahdanau et al., 2015; Luong et al., 2015) or using
attention-based architectures such as the Transformer (Vaswani et al., 2017). Generation is usually
performed using beam search; since output is tightly scoped by the input, repetition and generic-
ness are not as problematic. Still, similar issues have been reported when using large beam sizes
(Koehn & Knowles, 2017) and more recently with exact inference (Stahlberg & Byrne, 2019), a
counter-intuitive observation since more comprehensive search helps maximize probability.
Open-ended generation, which includes conditional story generation and contextual text continua-
tion (as in Figure 1), has recently become a promising research direction due to signicant advances
in neural language models (Clark et al., 2018; Holtzman et al., 2018; Fan et al., 2018; Peng et al.,
2018; Radford et al., 2019). While the input context restricts the space of acceptable output genera-
tions, there is a considerable degree of freedom in what can plausibly come next, unlike in directed
generation settings. Our work addresses the challenges faced by neural text generation with this
increased level of freedom, but we note that some tasks, such as goal-oriented dialog, may fall
somewhere in between open-ended and directed generation.
3 LANGUAGEMODELDECODING
Given an input text passage as context, the task ofopen-endedgeneration is to generate text that
forms a coherent continuation from the given context. More formally, given a sequence ofmtokens
x1: : : xmascontext, the task is to generate the nextncontinuationtokens to obtain the completed
sequencex1: : : xm+n. We assume that models computeP(x1:m+n)using the common left-to-right
decomposition of the text probability,
P(x1:m+n) =
m+n
Y
i=1
P(xijx1: : : xi1); (1)
which is used to generate the generation token-by-token using a particulardecoding strategy.
Maximization-based decodingThe most commonly used decoding objective, in particular for
directed generation, is maximization-based decoding. Assuming that the model assigns higher prob-
ability to higher quality text, these decoding strategies search for the continuation with the highest
3

Published as a conference paper at ICLR 2020
Figure 3: Example generations continuing an initial sentence. Maximization and top- k truncation
methods lead to copious repetition (highlighted in blue), while sampling with and without tempera-
ture tends to lead to incoherence (highlighted in red). Nucleus Sampling largely avoids both issues.
likelihood. Since nding the optimum argmax sequence from recurrent neural language models or
Transformers is not tractable (Chen et al., 2018), common practice is to use beam search (Li et al.,
2016b; Shen et al., 2017; Wiseman et al., 2017). However, several recent studies on open-ended
generation have reported that maximization-based decoding does not lead to high quality text (Fan
et al., 2018; Holtzman et al., 2018).
3 . 1 N U C L E U S S A M P L I N G
We propose a new stochastic decoding method: Nucleus Sampling. The key idea is to use the shape
of the probability distribution to determine the set of tokens to be sampled from. Given a distribution
P ( x j x
1: i 1
) , we dene its top- p vocabulary V
( p )
 V as the smallest set such that
X
x 2 V
( p )
P ( x j x
1: i 1
)  p: (2)
4

Published as a conference paper at ICLR 2020
sub-optimal across varying contexts. As illustrated on the left of Figure 5, in some contexts the head
of the next word distribution can be at across tens or hundreds of reasonable options (e.g. nouns or
verbs in generic contexts), while in other contexts most of the probability mass is concentrated in one
or a small number of tokens, as on the right of the gure. Therefore ifkis small, in some contexts
there is a risk of generating bland or generic text, while ifkis large the top-kvocabulary will
include inappropriate candidates which will have their probability of being sampledincreasedby
the renormalization. Under Nucleus Sampling, the number of candidates considered rises and falls
dynamically, corresponding to the changes in the model's condence region over the vocabulary
which top-ksampling fails to capture for any one choice ofk.
3.3 SAMPLING WITHTEMPERATURE
Another common approach to sampling-based generation is to shape a probability distribution
through temperature (Ackley et al., 1985). Temperature sampling has been applied widely to text
generation (Ficler & Goldberg, 2017; Fan et al., 2018; Caccia et al., 2018). Given the logitsu
1:jVj
and temperaturet, the softmax is re-estimated as
p(x=Vljx1:i1) =
exp(ul=t)
P
l
0exp(u
0
l
=t)
: (4)
Settingt2[0;1)skews the distribution towards high probability events, which implicitly lowers
the mass in the tail distribution. Low temperature sampling has also been used to partially alleviate
the issues of top-ksampling discussed above, by shaping the distribution before top-ksampling
(Radford et al., 2018; Fan et al., 2018). However, recent analysis has shown that, while lowering the
temperature improves generation quality, it comes at the cost of decreasing diversity (Caccia et al.,
2018; Hashimoto et al., 2019).
4 LIKELIHOODEVALUATION
4.1 EXPERIMENTALSETUP
While many neural network architectures have been proposed for language modeling, including
LSTMs (Sundermeyer et al., 2012) and convolutional networks (Dauphin et al., 2017), the Trans-
former architecture (Vaswani et al., 2017) has been the most successful in the extremely large-scale
training setups in recent literature (Radford et al., 2018; 2019). In this study we use the Generatively
Pre-trained Transformer, version 2 (GPT2; Radford et al., 2019), which was trained on WebText,
a 40GB collection of text scraped from the web.
3
We perform experiments using the Large model
(762M parameters). Our analysis is based on generating 5,000 text passages, which end upon reach-
ing an end-of-document token or a maximum length of 200 tokens. Texts are generated condition-
ally, conditioned on the initial paragraph (restricted to 1-40 tokens) of documents in the held-out
portion of WebText, except where otherwise mentioned.
4.2 PERPLEXITY
Our rst evaluation is to compute the perplexity ofgeneratedtext using various decoding strategies,
according to the model that is being generated from. We compare these perplexities against that of
the gold text (Figure 6). Importantly, we argue that the optimal generation strategy should produce
text which has a perplexityclose tothat of the gold text: Even though the model has the ability to
generate text that has lower perplexity (higher probability), such text tends to have low diversity and
get stuck in repetition loops, as shown inx5 and illustrated in Figure 4.
We see that perplexity of text obtained from pure sampling isworsethan the perplexity of the gold.
This indicates that the model is confusing itself: sampling too many unlikely tokens and creating
context that makes it difcult to recover the human distribution of text, as in Figure 1. Yet, setting
the temperature lower creates diversity and repetition issues, as we shall see inx5. Even with our
relatively ne-grained parameter sweep, Nucleus Sampling obtains closest perplexity to human text,
as shown in Table 1.
3
Available athttps://github.com/openai/gpt-2-output-dataset
6

Published as a conference paper at ICLR 2020
Method PerplexitySelf-BLEU4Zipf CoefcientRepetition %HUSE
Human 12.38 0.31 0.93 0.28 -
Greedy 1.50 0.50 1.00 73.66 -
Beam, b=16 1.48 0.44 0.94 28.94 -
Stochastic Beam, b=1619.20 0.28 0.91 0.32 -
Pure Sampling 22.73 0.28 0.93 0.22 0.67
Sampling,t=0.9 10.25 0.35 0.96 0.66 0.79
Top-k=40 6.88 0.39 0.96 0.78 0.19
Top-k=640 13.82 0.32 0.96 0.28 0.94
Top-k=40,t=0.7 3.48 0.44 1.00 8.86 0.08
Nucleusp=0.95 13.13 0.32 0.95 0.36 0.97
Table 1: Main results for comparing all decoding methods with selected parameters of each method.
The numbersclosest to human scoresare inboldexcept for HUSE (Hashimoto et al., 2019), a
combined human and statistical evaluation, where the highest (best) value isbolded. For Top-kand
Nucleus Sampling, HUSE is computed with interpolation rather than truncation (seex6.1).
4.3 NATURALLANGUAGEDOESNOTMAXIMIZEPROBABILITY
One might wonder if the issue with maximization is asearch error, i.e., there are higher quality
sentences to which the model assigns higher probability than to the decoded ones, beam search has
just failed to nd them. Yet Figures 2 & 6 show that the per-token probability of natural text is,
on average, muchlowerthan text generated by beam search. Natural language rarely remains in a
high probability zone for multiple consecutive time steps, instead veering into lower-probability but
more informative tokens. Nor does natural language tend to fall into repetition loops, even though
the model tends to assign high probability to this, as seen in Figure 4.
Why is human-written textnotthe most probable text? We conjecture that this is an intrinsic property
of human language. Language models that assign probabilities one word at a time without a global
model of the text will have trouble capturing this effect. Grice's Maxims of Communication (Grice,
1975) show that people optimize against stating the obvious. Thus, making every word as predictable
as possible will be disfavored. This makes solving the problem simply by training larger models or
improving neural architectures using standard per-word learning objectives unlikely: such models
are forced to favor the lowest common denominator, rather than informative language.
5 DISTRIBUTIONALSTATISTICALEVALUATION
5.1 ZIPFDISTRIBUTIONANALYSIS
In order to compare generations to the reference text, we begin by analyzing their use of vocabu-
lary. Zipf's law suggests that there is an exponential relationship between the rank of a word and its
frequency in text. The Zipan coefcientscan be used to compare the distribution in a given text5 10 15
Beam Width
5
10
15
20
Conditional PPL
Human
Beam Search
0.250.500.751.00
Temperature
Sampling
10
1
10
2
10
3
10
4
k
Top-k (t=1.0)
10
1
10
2
10
3
10
4
k
Top-k (t=0.7)
0.1 0.5 0.9
0.99
0.999
0.9999
p
Nucleus
Figure 6: Perplexities of generations from various decoding methods. Note that beam search has
unnaturally low perplexities. A similar effect is seen using a temperature of0:7with top-kas in both
Radford et al. (2019) and Fan et al. (2018). Sampling, Top-k, and Nucleus can all be calibrated to
human perplexities, but the rst two face coherency issues when their parameters are set this high.
7

Published as a conference paper at ICLR 2020
Figure 7: A rank-frequency plot of the distributional differences betweenn-gram frequencies of
human and machine text. Sampling and Nucleus Sampling are by far the closest to the human
distribution, while Beam Search clearly follows a very different distribution than natural language.










































 
 
 
  
  
  
  
   
  
  
   
        


Figure 8: Self-BLEU calculated on the unconditional generations produced by stochastic decoding
methods; lower Self-BLEU scores imply higher diversity. Horizontal blue and orange lines represent
human self-BLEU scores. Note how common values oft2[0:5;1]andk2[1;100]result in high
self-similarity, whereas “normal” values ofp2[0:9;1)closely match the human distribution of text.
to a theoretically perfect exponential curve, wheres= 1(Piantadosi, 2014). Figure 7 shows the
vocabulary distributions along with estimated Zipf coefcients for selected parameters of different
decoding methods. As expected, pure sampling is the closest to the human distribution, followed by
Nucleus Sampling. The visualization of the distribution shows that pure sampling slightlyoveres-
timatesthe use of rare words, likely one reason why pure sampling also has higher perplexity than
human text. Furthermore, lower temperature sampling avoids sampling these rare words from the
tail, which is why it has been used in some recent work (Fan et al., 2018; Radford et al., 2019).
5.2 SELF-BLEU
We follow previous work and compute Self-BLEU (Zhu et al., 2018) as a metric of diversity. Self-
BLEU is calculated by computing the BLEU score of each generated document usingall other
generationsin the evaluation set as references. Due to the expense of computing such an operation,
we sample 1000 generations, each of which is compared withall 4999 other generations as refer-
ences. A lower Self-BLEU score implies higher diversity. Figure 8 shows that Self-BLEU results
largely follow that of the Zipan distribution analysis as a diversity measure. It is worth noting that
8

Published as a conference paper at ICLR 2020
Figure 9: We visualize how often different decoding methods get “stuck” in loops within the rst
200 tokens. A phrase (minimum length 2) is considered a repetition when it repeats at leastthree
times at theendof the generation. We label points with their parameter values except fortandp
which follow the x-axis. Values ofkgreater than 100 are rarely used in practice and values ofpare
usually in[0:9;1); therefore Nucleus Sampling is far closer to the human distribution in its usual
parameter range. Sampling with temperatures lower than 0.9 severely increase repetition. Finally,
although beam search becomes less repetitive according to this metric as beam width increases, this
is largely because average length gets shorter asbincreases (see Appendix A).
very high values ofkandtare needed to get close to the reference distribution, though these result
in unnaturally high perplexity (x4).
5.3 REPETITION
One attribute of text quality that we can quantify is repetition. Figure 9 shows that Nucleus Sam-
pling and top-ksampling have the least repetition for reasonable parameter ranges. Generations
from temperature sampling have more repetition unless very high temperatures are used, which we
have shown negatively affects coherence (as measured by high perplexity). Further, all stochastic
methods face repetition issues when their tuning parameters are set too low, which tends toover-
truncate, mimicking greedy search. Therefore we conclude that only Nucleus Sampling satises all
the distributional criteria for desirable generations.
6 HUMANEVALUATION
6.1 HUMANUNIFIED WITHSTATISTICALEVALUATION(HUSE)
Statistical evaluations are unable to measure the coherence of generated text properly. While the
metrics in previous sections gave us vital insights into the different decoding methods we compare,
human evaluation is still required to get a full measure of the quality of the generated text. However,
pure human evaluation does not take into account the diversity of the generated text; therefore we use
HUSE (Hashimoto et al., 2019) to combine human and statistical evaluation. HUSE is computed by
training a discriminator to distinguish between text drawn from the human and model distributions,
based on only two features: The probability assigned by the language model, and human judgements
of typicality of generations. Text that is close to the human distribution in terms of quality and
diversity should perform well on both likelihood evaluation and human judgements.
As explored in the previous sections, the current best-performing decoding methods rely ontrunca-
tionof the probability distribution, which yields a probability of 0 for the vast majority of potential
tokens. Initial exploration of applying HUSE directly led to top-kand Nucleus Sampling receiving
scores of nearly 0 due to truncation, despite humans favoring these methods. As a proxy, when
generating the text used to compute HUSE, we interpolate (with mass0:1) the original probability
distribution with the top-kand Nucleus Sampling distribution, smoothing the truncated distribution.
For each decoding algorithm we annotate 200 generations for typicality, with each generation re-
ceiving 20 annotations from 20 different annotators. This results in a total of 4000 annotations per a
9

Published as a conference paper at ICLR 2020
decoding scheme. We use a KNN classier to compute HUSE, as in the original paper, withk= 13
neighbors, which we found led to the higher accuracy in discrimination. The results in Table 1
shows that Nucleus Sampling obtains the highest HUSE score, with Top-ksampling performing
second best.
6.2 QUALITATIVEANALYSIS
Figure 3 shows representative example generations. Unsurprisingly, beam search gets stuck in a
repetition loop it cannot escape. Of the stochastic decoding schemes, the output of full sampling is
clearly the hardest to understand, even inventing a new word “umidauda”, apparently a species of
bird. The generation produced by Nucleus Sampling isn't perfect – the model appears to confuse
whales with birds, and begins writing about those instead. Yet, top-ksampling immediately veers off
into an unrelated event. When top-ksampling is combined with a temperature of 0.7, as is commonly
done (Radford et al., 2019; Fan et al., 2018), the output devolves into repetition, exhibiting the classic
issues of low-temperature decoding. More generations are available in Appendix B.
7 CONCLUSION
This paper provided a deep analysis into the properties of the most common decoding methods for
open-ended language generation. We have shown that likelihood maximizing decoding causes repe-
tition and overly generic language usage, while sampling methods without truncation risk sampling
from the low-condence tail of a model's predicted distribution. Further, we proposed Nucleus Sam-
pling as a solution that captures the region of condence of language models effectively. In future
work, we wish to dynamically characterize this region of condence and include a more semantic
utility function to guide the decoding process.
ACKNOWLEDGMENTS
This research was supported in part by NSF (IIS-1524371), the National Science Foundation Gradu-
ate Research Fellowship under Grant No. DGE1256082, DARPA CwC through ARO (W911NF15-
1- 0543), DARPA MCS program through NIWC Pacic (N66001-19-2-4031), the South African
Centre for Articial Intelligence Research, and the Allen Institute for AI.
REFERENCES
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann
machines.Cognitive science, 9(1):147–169, 1985.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate.Proceedings of the 2015 International Conference on Learning
Representations, 2015.
Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Char-
lin. Language gans falling short. InCritiquing and Correcting Trends in Machine Learning:
NeurIPS 2018 Workshop, 2018. URLhttp://arxiv.org/abs/1811.02549 .
Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, and Kevin Knight. Recurrent neu-
ral networks as weighted language recognizers. InProceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pp. 2261–2271, New Orleans, Louisiana, June 2018.
Elizabeth Clark, Yangfeng Ji, and Noah A. Smith. Neural text generation in stories using entity rep-
resentations as context. InProceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), pp. 2250–2260, New Orleans, Louisiana, June 2018.
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated
convolutional networks. InProceedings of the 34th International Conference on Machine Learn-
ing, pp. 933–941, 2017.
10

Published as a conference paper at ICLR 2020
Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceedings
of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), volume 1, pp. 889–898, 2018.
Jessica Ficler and Yoav Goldberg. Controlling linguistic style aspects in neural language generation.
InProceedings of the Workshop on Stylistic Variation, pp. 94–104, 2017.
H Paul Grice. Logic and conversation. In P Cole and J L Morgan (eds.),Speech Acts, volume 3 of
Syntax and Semantics, pp. 41–58. Academic Press, 1975.
Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statistical evaluation
for natural language generation. InProceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. Learning
to write with cooperative discriminators. InProceedings of the Association for Computational
Linguistics, 2018.
Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. InProceedings
of the First Workshop on Neural Machine Translation, pp. 28–39, 2017.
Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. Importance of search and
evaluation strategies in neural dialogue modeling.International Conference on Natural Language
Generation, 2019.
Jiwei Li, Will Monroe, and Dan Jurafsky. A simple, fast diverse decoding algorithm for neural
generation.arXiv preprint arXiv:1611.08562, 2016a.
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep rein-
forcement learning for dialogue generation. InProceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pp. 1192–1202, 2016b.
Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based
neural machine translation. InProceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, pp. 1412–1421, 2015.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Abstractive
text summarization using sequence-to-sequence rnns and beyond. InProceedings of The 20th
SIGNLL Conference on Computational Natural Language Learning, pp. 280–290, 2016.
Chris Pal, Charles Sutton, and Andrew McCallum. Sparse forward-backward using minimum diver-
gence beams for fast training of conditional random elds. In2006 IEEE International Confer-
ence on Acoustics Speech and Signal Processing Proceedings, volume 5, May 2006.
Nanyun Peng, Marjan Ghazvininejad, Jonathan May, and Kevin Knight. Towards controllable story
generation. InProceedings of the First Workshop on Storytelling, pp. 43–49, New Orleans,
Louisiana, June 2018. doi: 10.18653/v1/W18-1505.
Steven T Piantadosi. Zipfs word frequency law in natural language: A critical review and future
directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing by generative pre-training, 2018. URLhttps://s3-us-west-2.amazonaws.
com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf . Unpublished manuscript.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners, February 2019. URLhttps:
//d4mucfpksywv.cloudfront.net/better-language-models/language_
models_are_unsupervised_multitask_learners.pdf . Unpublished manuscript.
Stanislau Semeniuta, Aliaksei Severyn, and Sylvain Gelly. On accurate evaluation of gans for lan-
guage generation.arXiv preprint arXiv:1806.04936, 2018.
11

Published as a conference paper at ICLR 2020
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text
by cross-alignment. InAdvances in neural information processing systems, pp. 6830–6841, 2017.
Felix Stahlberg and Bill Byrne. On nmt search errors and model errors: Cat got your tongue? In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.
3347–3353, 2019.
Martin Sundermeyer, Ralf Schl¨uter, and Hermann Ney. Lstm neural networks for language mod-
eling. InThirteenth annual conference of the international speech communication association,
2012.
Guy Tevet, Gavriel Habib, Vered Shwartz, and Jonathan Berant. Evaluating text gans as language
models.Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pp. 2241–2247, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor-
mation Processing Systems, pp. 5998–6008, 2017.
Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee,
David Crandall, and Dhruv Batra. Diverse beam search for improved description of complex
scenes. InThirty-Second AAAI Conference on Articial Intelligence, 2018.
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston.
Neural text generation with unlikelihood training. InProceedings of the International Conference
on Learning Representations (ICLR), 2020.
Sam Wiseman, Stuart Shieber, and Alexander Rush. Challenges in data-to-document generation. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.
2253–2263, Copenhagen, Denmark, September 2017.
Jingjing Xu, Xuancheng Ren, Junyang Lin, and Xu Sun. Diversity-promoting gan: A cross-entropy
based generative adversarial network for diversied text generation. InProceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pp. 3940–3949, Brussels,
Belgium, oct 2018.
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets
with policy gradient. InAAAI, 2017.
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. Texygen:
A benchmarking platform for text generation models.SIGIR, 2018.
12

Published as a conference paper at ICLR 2020
A BEAMWIDTHEFFECT
Figure 10: The total number of trigrams produced by Beam Search with varying beam widths,
with gold (human) data for comparison. Note how the average length of generations goes down
linearly with beam width, while the number of distinct trigrams stays constant and extremely low in
comparison to gold data.
13

Published as a conference paper at ICLR 2020
B EXAMPLE GENERATIONS
We include a set of examples for further qualitative comparison.
    
 


  
 
 
   
!  "
#$ % &  '!   ( )) !! % &  '!   ( *)*)
!! +$  ( )) !! !  (  ,, , 
, , , , 
, , , , 
, , , , 
, , , ,
 ( - ! !. ! !,, , ! ! ! .!/ %, -
(!  . ( (. 0  0    , .! !   &
! % +123,  !! . ! , ! % (! ,  ,. , ! 0
!  ! ! 0   +  - !! , .!.  43%
4+5 " / %,   %6.  &- 0  7. 
6!  ! (-  . ! /  ! ,  ! .!!!!
.,,! ( (  0  -!  ! , &!-
!(!! .    ! % -  ! .,,
!6 !!- &!8 6! .     -!    !- 6!
-  !    !. (&( ! .
9 '!! +!0. -$ ::  ;9 <- ;  .$ =- 4,$ 3- %
&  ! . ,  , .   ,,    0! (. 0  0( 
( . - !'   , ,  . ( ! ! , ! 
(  !! 0! >!  !! !-!   &.///// +!0. -$ 1 ? 
;9 <- ;  .$ =- 4,$ 3-
>(  %   ( , -  !/ %, - &  ! @
&!A , -  !-!  -    0. ( ,
- ! >(  %   ( , -  !/B  !!
   - !  .   ! 40   0. 
(  ( - >(  %   0. ( , - !/
  % ! - -! . . 0!   C+/ 4- ( & 
! , ! &   .  (  !.   ! ( ( 
!  - - <.2 3 - 0 0!  +/  .D  &  ,0! 
 C  &6 .    0  .- ! ( &  !  &
,   >( .!  0 ! - ,  0 . /
-  - !! !/ ! E.! +& + C F,.  +!
%! #- 4    .  !! F! !! -
!-  . .  ! .,, ( <. .  E. 5-/ <.
. !  ( !   , &-  ! 9 G ,  ! 0!
,  ,. . .
Figure 11: More example generations from an initial tag line. All generations available athttps:
//github.com/ari-holtzman/degen
14

Published as a conference paper at ICLR 2020
    
        
                   
                  
                    
                
   !" "     
#     
$%"&%"' (      )                 
             *      
                +  
                  
#     
,% -%"' (      .           / 0 
     .           / 0 
'%& %"' ( 1     ! +  2  / )
1     ! +  2  / )
#     
,% $%"' (   ,    3  ,    3
,% "%"' ()  3 )  3
'%&%"' (     3     3
'%"$%"' ( 2  32  3
'%"$%"' ( 4  34  3
"% %", ( /5 )  6    " )     
5       5         
%&-%", (        )  4  3  7    
        3
8%"%", ( #         & "8 9:; % )5
#     
,%" %"' ( #      #     
,% ,%"' ( #      #     
,% '%"' ( #      #     
) < ,
95
95
0  ) <"8
= ) 
5(<8.
5(<8. < $
#     
,% '%"' ( >      (>     
,% $%"' ( !          ? 
( !          ? 
,% 8%"' ( >         
( >         
2< ,
Figure 12: More example generations from an initial tag line. Note that Pure Sampling and Nu-
cleus Sampling is the only algorithms that can escape the repetition loop, with Nucleus Sam-
pling's generation far closer in style to the ground truth text. All generations available athttps:
//github.com/ari-holtzman/degen
15

Published as a conference paper at ICLR 2020
   
               
            
   !         ""#  
      $
   %%%& "'
  ( )*
        +        
     $ ,   !      
 "## )-$    *      .(
 /$ 0  1    (   $
- 2 ! 3!
- 4 -
( -
, /   
- 2  
 2  /
!   & (
!   & 5(
!   & 6
!   & 6 
!   & 2! 7
!   & 2! 7
 0
3  -!  8 /
!  -  & )(
 8
3  /
 0
9"9# , : 6 0  ;9%
<"'"=>>;%9;>
0?$
!    < -=
!   & 
!   & (
!   & 5(
2@#$
A
A
) 2@'
7 2
@'9#
@9#@#$%
,@#$;
Figure 13: More example generations from an initial tag line. All generations available athttps:
//github.com/ari-holtzman/degen
16