diff --git a/paper/gtt.tex b/paper/gtt.tex
index a2f11a7afab5624832446a3bc518ff85074c9fa6..2dd4f8a685509e5aa812e9888b5c04360f98af14 100644
--- a/paper/gtt.tex
+++ b/paper/gtt.tex
@@ -410,9 +410,9 @@ However, the dichotomy between gradual and optional typing is not as
 firm as one might like.
 %
 There have been many different proposed semantics of run-time type
-checking: ``transient'' cast semantics~\cite{vitousekswordssiek:2017}
+checking: ``transient'' cast semantics~\citep{vitousekswordssiek:2017}
 only checks the head connective of a type (number, function, list,
-\ldots), ``eager'' cast semantics~\cite{herman2010spaceefficient} checks
+\ldots), ``eager'' cast semantics~\citep{herman2010spaceefficient} checks
 run-time type information on closures, whereas ``lazy'' cast
 semantics~\citep{findler-felleisen02} will always delay a type-check on
 a function until it is called (and there are other possibilities, see
@@ -485,8 +485,7 @@ common.
 %
 The second property we take for granted is that the language satisfies
 the \emph{dynamic gradual guarantee}~\cite{refined} (``graduality'')---a
-now standard correctness theorem of gradual typing that most newly
-proposed gradual languages seek to satisfy--- which constraints how
+strong correctness theorem of gradual typing--- which constraints how
 changing type annotations changes behavior.  Graduality says that if we
 change the types in a program to be ``more precise''---e.g., by changing
 from the dynamic type to a more precise type such as integers or
@@ -736,8 +735,8 @@ types and downcasts with lazy/computation types, and that the modalities
 relating values and computations induce the downcasts for eager/value
 types and upcasts for lazy/computation types.  Moreover, this analysis
 articulates an important behavioral property of casts that was proved
-operationally for call-by-value in \cite{newahmed18} but missed for
-call-by-name in \cite{newlicata2018-fscd}: upcasts for eager types and
+operationally for call-by-value in \citet{newahmed18} but missed for
+call-by-name in \citet{newlicata2018-fscd}: upcasts for eager types and
 downcasts for lazy types are both ``pure'' in a suitable sense, which
 enables more refactorings and program optimizations.  In particular, we
 show that these casts can be taken to be (and are essentially forced to
@@ -788,7 +787,7 @@ Then
 for the extended CBPV and define a step-indexed biorthogonal logical
 relation that interprets the ordering relation on terms as contextual
 error approximation, which underlies the definition of graduality as
-presented in \cite{newahmed18}.
+presented in \citet{newahmed18}.
 %
 Combining these theorems gives an implementation of the term
 language of GTT in which $\beta, \eta$ are observational equivalences
@@ -853,7 +852,7 @@ The main contributions of the paper are as follows.
 \begin{shortonly}
 \textbf{Extended version:} An extended version of the paper, which
 includes the omitted cases of definitions, lemmas, and proofs is
-available as \citet{newlicataahmed19:extended}.
+available in \citet{newlicataahmed19:extended}.
 \end{shortonly}
 
 % While gradual typing has been researched quite extensively by proving
@@ -1451,7 +1450,7 @@ types will arise from type dynamism and casts.
 The \emph{type dynamism} relation of gradual type theory is written $A
 \ltdyn A'$ and read as ``$A$ is less dynamic than $A'$''; intuitively,
 this means that $A'$ supports more behaviors than $A$.
-\cite{newahmed18,newlicata2018-fscd} analyze this as the existence of an \emph{upcast}
+\citet{newahmed18,newlicata2018-fscd} analyze this as the existence of an \emph{upcast}
 from $A$ to $A'$ and a downcast from $A'$ to $A$ which form an
 embedding-projection pair (\emph{ep pair}) for term error approximation
 (an ordering where runtime errors are minimal): the upcast followed by the
@@ -1550,7 +1549,7 @@ considered.  However, we can arrive at a first proposal by considering
 how previous work would be embedded into CBPV.
 %
 In the previous work on both CBV and
-CBN~\cite{newahmed18,newlicata2018-fscd} every type dynamism judgement
+CBN~\citep{newahmed18,newlicata2018-fscd} every type dynamism judgement
 $A \ltdyn A'$ induces both an upcast from $A$ to $A'$ and a downcast
 from $A'$ to $A$.
 %
@@ -1858,7 +1857,7 @@ typically because they are identical), which is a syntactic judgement for contex
 
 Finally, we assert some term dynamism axioms that describe the behavior
 of programs.  The cast universal properties at the top of
-Figure~\ref{fig:gtt-term-dyn-axioms}, following~\cite{newlicata2018-fscd}, say that
+Figure~\ref{fig:gtt-term-dyn-axioms}, following~\citet{newlicata2018-fscd}, say that
 the defining property of an upcast from $A$ to $A'$ is that it is the
 least dynamic term of type $A'$ that is more dynamic that $x$, a ``least
 upper bound''.  That is, $\upcast{A}{A'}{x}$ is a term of type $A'$ that
@@ -4594,7 +4593,7 @@ introduction forms for $\dynv$ are exactly the introduction forms for
 all of the value types (unit, pairing,$\texttt{inl}$, $\texttt{inr}$, $\texttt{force}$), while
 elimination forms are all of the elimination forms for computation types
 ($\pi$, $\pi'$, application and binding); such ``bityped'' languages
-are related to \cite{girard01locussolum,zeilberger09thesis}.
+are related to \citet{girard01locussolum,zeilberger09thesis}.
 \begin{shortonly}
   In the extended version, we give an extension of GTT axiomatizing this
   implementation of the dynamic types.
@@ -4700,7 +4699,7 @@ lazy product type $\with$.
 Normally, these are not included in this way, but rather sums are
 encoded by making each case use a fresh constructor (using nominal
 techniques like opaque structs in Racket) and then making the sum the
-union of the constructors, as argued in \cite{siekth16recursiveunion}.
+union of the constructors, as argued in \citet{siekth16recursiveunion}.
 %
 We leave modeling this nominal structure to future work, but in
 minimalist languages, such as simple dialects of Scheme and Lisp, sum
@@ -5384,7 +5383,7 @@ translate a heterogeneous term dynamism to a homogeneous inequality
 \end{theorem}
 
 \begin{longonly}
-  Relative to previous work on graduality (\cite{newahmed18}),
+  Relative to previous work on graduality \citep{newahmed18},
 the distinction between complex value upcasts and complex stack
 downcasts guides the formulation of the theorem; e.g. using upcasts in
 the left-hand theorem would require more thunks/forces.  
@@ -6741,7 +6740,7 @@ strictness).
 \end{longonly}
 %
 \ This translation clarifies the behavioral meaning of complex values and
-stacks, following \cite{munchmaccagnoni14nonassociative,
+stacks, following \citet{munchmaccagnoni14nonassociative,
   fuhrmann1999direct}, and therefore of upcasts and downcasts.
 \begin{longonly}
 This is related to completeness of focusing: it moves inversion rules
@@ -7068,7 +7067,7 @@ $\kw{split}$ are not evaluation contexts in the operational semantics).
 
 \end{longonly}
 
-Levy~\cite{levy03cbpvbook} translates \cbpvstar\/ to \cbpv, but not does not prove
+\citet{levy03cbpvbook} translates \cbpvstar\/ to \cbpv, but not does not prove
 the inequality preservation that we require here, so we give
 an
 alternative translation for which this property is easy to
@@ -7112,7 +7111,7 @@ the same type.
 The \emph{de-complexification} procedure is defined as follows.
 %
 We note that this translation is not the one presented in
-\cite{levy03cbpvbook}, but rather a more inefficient version that, in CPS
+\citet{levy03cbpvbook}, but rather a more inefficient version that, in CPS
 terminology, introduces many administrative redices.
 %
 Since we are only proving results up to observational equivalence
@@ -7839,7 +7838,7 @@ in figure
 \end{figure}
 \fi
 
-This is morally the same as in \cite{levy03cbpvbook}, but we present
+This is morally the same as in \citet{levy03cbpvbook}, but we present
 stacks in a manner similar to Hieb-Felleisen style evaluation
 contexts\iflong(rather than as an explicit stack machine with stack frames)\fi.
 %
@@ -7987,7 +7986,7 @@ least preorders so we write $\apreorder$ rather than $\sim$.
 \noindent Three important relations arise as liftings: Equality of results lifts to observational equivalence
 ($\ctxize=$).  The preorder generated by $\err \ltdyn R$ (i.e. the other
 three results are unrelated maximal elements) lifts to the notion of
-\emph{error approximation} used in \cite{newahmed18} to prove the
+\emph{error approximation} used in \citet{newahmed18} to prove the
 graduality property ($\ctxize\ltdyn$).  The preorder generated by
 $\diverge \preceq R$ lifts to the standard notion of \emph{divergence
   approximation} ($\ctxize\preceq$).
@@ -7998,7 +7997,7 @@ The most famous use of lifting is for observational equivalence,
 which is the lifting of equality of results ($\ctxize=$), and we will show that
 $\equidyn$ proofs in GTT imply observational equivalences.
 %
-However, as shown in \cite{newahmed18}, the graduality property is
+However, as shown in \citet{newahmed18}, the graduality property is
 defined in terms of an observational \emph{approximation} relation
 $\ltdyn$ that places $\err$ as the least element, and every other
 element as a maximal element.
@@ -8157,7 +8156,7 @@ approximation, i.e., $M \ctxize= N$ if and only if $M \ctxize\preceq
 N$ and $N \ctxize\preceq M$, and since $\preceq$ has $\diverge$ as
 a least element, we can use a step-indexed relation to prove it.
 %
-As shown in \cite{newahmed18}, a similar trick works for error
+As shown in \citet{newahmed18}, a similar trick works for error
 approximation, but since $\ltdyn$ is \emph{not} an equivalence
 relation, we decompose it rather into two \emph{different} orderings:
 error approximation up to divergence on the left $\errordivergeleft$ and
@@ -8251,8 +8250,8 @@ So using Lemmas~\ref{lem:ctx-commutes-conjunction}, \ref{lem:ctx-commutes-dual}
 \end{proof}
 
 As a corollary, the decomposition of contextual equivalence into diverge
-approximation in \cite{ahmed06:lr} and the decomposition of dynamism in
-\cite{newahmed18} are really the same trick:
+approximation in \citet{ahmed06:lr} and the decomposition of dynamism in
+\citet{newahmed18} are really the same trick:
 \begin{corollary}[Contextual Decomposition] ~~~ \label{cor:contextual-decomposition}
   \begin{enumerate}
   \item $\ctxize= \mathbin{\Leftrightarrow} \ctxize{\preceq} \wedge
@@ -9373,7 +9372,7 @@ implementation theorems
 (e.g. $\upcast{\mathtt{list}(A)}{\mathtt{list}(A')} \equidyn
 \mathtt{map}\upcast{A}{A'}$).  Additionally, since more efficient cast
 implementations such as optimized cast calculi (the lazy variant in
-\cite{herman2010spaceefficient}) and threesome
+\citet{herman2010spaceefficient}) and threesome
 casts~\cite{siekwadler10zigzag}, are equivalent to the lazy contract
 semantics, they should also be models of GTT, and if so we could use GTT
 to reason about program transformations and optimizations in them.
@@ -9449,7 +9448,7 @@ x. \Omega$ will not.
 %
 This is a crucial feature of Haskell and is a major source of
 differences between implementations of lazy contracts, as noted in
-\cite{Degen2012TheIO}.
+\citet{Degen2012TheIO}.
 %
 We can understand this difference by using a different translation
 into call-by-push-value: what Levy calls the ``lazy paradigm'', as
@@ -9464,7 +9463,7 @@ With this embedding and the uniqueness theorem, GTT produces a
 definition for lazy casts, and the definition matches the work of
 \citet{XuPJC09} when restricting to non-dependent contracts.
 
-Some recent work \cite{greenmanfelleisen:2018} gives a spectrum of
+\citet{greenmanfelleisen:2018} gives a spectrum of
 differing syntactic type soundness theorems for different semantics of
 gradual typing.
 %
@@ -9664,7 +9663,7 @@ interpretation.
 %
 It may be of interest to develop a more general notion of model of GTT
 for which we can prove soundness and completeness theorems, as in
-\cite{newlicata2018-fscd}.
+\citet{newlicata2018-fscd}.
 %
 A model would be a strong adjunction between double categories where
 one of the double categories has all ``companions'' and the other has