diff --git a/paper/gtt.tex b/paper/gtt.tex
index 92c0b72343f28e2a55ba3cc8a38fa66e61dbcdc2..30c5da4d09deca12ae35a854c31bd20d8b03aeec 100644
--- a/paper/gtt.tex
+++ b/paper/gtt.tex
@@ -66,7 +66,7 @@
 %% Citation style
 %% Note: author/year citations are required for papers published as an
 %% issue of PACMPL.
-% \citestyle{acmauthoryear}   %% For author/year citations
+\citestyle{acmauthoryear}   %% For author/year citations
 \setcitestyle{nosort}
 
 \title{Gradual Type Theory}
@@ -410,9 +410,9 @@ However, the dichotomy between gradual and optional typing is not as
 firm as one might like.
 %
 There have been many different proposed semantics of run-time type
-checking: ``transient'' cast semantics~\cite{vitousekswordssiek:2017}
+checking: ``transient'' cast semantics~\citep{vitousekswordssiek:2017}
 only checks the head connective of a type (number, function, list,
-\ldots), ``eager'' cast semantics~\cite{herman2010spaceefficient} checks
+\ldots), ``eager'' cast semantics~\citep{herman2010spaceefficient} checks
 run-time type information on closures, whereas ``lazy'' cast
 semantics~\citep{findler-felleisen02} will always delay a type-check on
 a function until it is called (and there are other possibilities, see
@@ -458,8 +458,7 @@ the other does not) results.
 The logic axiomatizes the equational properties gradual
 programs should satisfy, and offers a high-level syntax for proving
 theorems about many languages at once:
-if a particular operational
-semantics models gradual type theory, then it satisfies all
+if a language models gradual type theory, then it satisfies all
 provable equivalences/approximations.
 %
 Due to its type-theoretic design, different axioms of program
@@ -485,8 +484,7 @@ common.
 %
 The second property we take for granted is that the language satisfies
 the \emph{dynamic gradual guarantee}~\cite{refined} (``graduality'')---a
-now standard correctness theorem of gradual typing that most newly
-proposed gradual languages seek to satisfy--- which constraints how
+strong correctness theorem of gradual typing--- which constraints how
 changing type annotations changes behavior.  Graduality says that if we
 change the types in a program to be ``more precise''---e.g., by changing
 from the dynamic type to a more precise type such as integers or
@@ -736,8 +734,8 @@ types and downcasts with lazy/computation types, and that the modalities
 relating values and computations induce the downcasts for eager/value
 types and upcasts for lazy/computation types.  Moreover, this analysis
 articulates an important behavioral property of casts that was proved
-operationally for call-by-value in \cite{newahmed18} but missed for
-call-by-name in \cite{newlicata2018-fscd}: upcasts for eager types and
+operationally for call-by-value in \citet{newahmed18} but missed for
+call-by-name in \citet{newlicata2018-fscd}: upcasts for eager types and
 downcasts for lazy types are both ``pure'' in a suitable sense, which
 enables more refactorings and program optimizations.  In particular, we
 show that these casts can be taken to be (and are essentially forced to
@@ -788,7 +786,7 @@ Then
 for the extended CBPV and define a step-indexed biorthogonal logical
 relation that interprets the ordering relation on terms as contextual
 error approximation, which underlies the definition of graduality as
-presented in \cite{newahmed18}.
+presented in \citet{newahmed18}.
 %
 Combining these theorems gives an implementation of the term
 language of GTT in which $\beta, \eta$ are observational equivalences
@@ -812,7 +810,6 @@ Our modular proof architecture allows us to easily prove correctness
 of $\beta, \eta$ and graduality for all of these interpretations.
 
 \textbf{Contributions}
-
 The main contributions of the paper are as follows.
 \begin{enumerate}
 \item We present Gradual Type Theory in section \ref{sec:gtt}, a simple
@@ -854,7 +851,7 @@ The main contributions of the paper are as follows.
 \begin{shortonly}
 \textbf{Extended version:} An extended version of the paper, which
 includes the omitted cases of definitions, lemmas, and proofs is
-available as \citet{newlicataahmed19:extended}.
+available in \citet{newlicataahmed19:extended}.
 \end{shortonly}
 
 % While gradual typing has been researched quite extensively by proving
@@ -1452,7 +1449,7 @@ types will arise from type dynamism and casts.
 The \emph{type dynamism} relation of gradual type theory is written $A
 \ltdyn A'$ and read as ``$A$ is less dynamic than $A'$''; intuitively,
 this means that $A'$ supports more behaviors than $A$.
-\cite{newahmed18,newlicata2018-fscd} analyze this as the existence of an \emph{upcast}
+\citet{newahmed18,newlicata2018-fscd} analyze this as the existence of an \emph{upcast}
 from $A$ to $A'$ and a downcast from $A'$ to $A$ which form an
 embedding-projection pair (\emph{ep pair}) for term error approximation
 (an ordering where runtime errors are minimal): the upcast followed by the
@@ -1490,35 +1487,35 @@ domain~\cite{newahmed18,newlicata2018-fscd}.
   \begin{mathpar}
     \framebox{$A \ltdyn A'$ and $\u B \ltdyn \u B'$}
     
-    \inferrule*[right=VTyRefl]{ }{A \ltdyn A}
+    \inferrule*[lab=VTyRefl]{ }{A \ltdyn A}
 
-    \inferrule*[right=VTyTrans]{A \ltdyn A' \and A' \ltdyn A''}
+    \inferrule*[lab=VTyTrans]{A \ltdyn A' \and A' \ltdyn A''}
               {A \ltdyn A''}
 
-    \inferrule*[right=CTyRefl]{ }{\u B \ltdyn \u B'}
+    \inferrule*[lab=CTyRefl]{ }{\u B \ltdyn \u B'}
 
-    \inferrule*[right=CTyTrans]{\u B \ltdyn \u B' \and \u B' \ltdyn \u B''}
+    \inferrule*[lab=CTyTrans]{\u B \ltdyn \u B' \and \u B' \ltdyn \u B''}
               {\u B \ltdyn \u B''}
 
-    \inferrule*[right=VTyTop]{ }{A \ltdyn \dynv}
+    \inferrule*[lab=VTyTop]{ }{A \ltdyn \dynv}
 
-    \inferrule*[right=$U$Mon]{\u B \ltdyn \u B'}
+    \inferrule*[lab=$U$Mon]{\u B \ltdyn \u B'}
               {U B \ltdyn U B'}
 
-    \inferrule*[right=$+$Mon]{A_1 \ltdyn A_1' \and A_2 \ltdyn A_2' }
+    \inferrule*[lab=$+$Mon]{A_1 \ltdyn A_1' \and A_2 \ltdyn A_2' }
               {A_1 + A_2 \ltdyn A_1' + A_2'}
 
-    \inferrule*[right=$\times$Mon]{A_1 \ltdyn A_1' \and A_2 \ltdyn A_2' }
+    \inferrule*[lab=$\times$Mon]{A_1 \ltdyn A_1' \and A_2 \ltdyn A_2' }
               {A_1 \times A_2 \ltdyn A_1' \times A_2'}
 \\
-    \inferrule*[right=CTyTop]{ }{\u B \ltdyn \dync}
+    \inferrule*[lab=CTyTop]{ }{\u B \ltdyn \dync}
 
-\inferrule*[right=$F$Mon]{A \ltdyn A' }{ \u F A \ltdyn \u F A'}
+\inferrule*[lab=$F$Mon]{A \ltdyn A' }{ \u F A \ltdyn \u F A'}
 
-\inferrule*[right=$\with$Mon]{\u B_1 \ltdyn \u B_1' \and \u B_2 \ltdyn \u B_2'}
+\inferrule*[lab=$\with$Mon]{\u B_1 \ltdyn \u B_1' \and \u B_2 \ltdyn \u B_2'}
               {\u B_1 \with \u B_2 \ltdyn \u B_1' \with \u B_2'}
 
-\inferrule*[right=$\to$Mon]{A \ltdyn A' \and \u B \ltdyn \u B'}
+\inferrule*[lab=$\to$Mon]{A \ltdyn A' \and \u B \ltdyn \u B'}
           {A \to \u B \ltdyn A' \to \u B'}
 \begin{longonly}
 \\
@@ -1551,7 +1548,7 @@ considered.  However, we can arrive at a first proposal by considering
 how previous work would be embedded into CBPV.
 %
 In the previous work on both CBV and
-CBN~\cite{newahmed18,newlicata2018-fscd} every type dynamism judgement
+CBN~\citep{newahmed18,newlicata2018-fscd} every type dynamism judgement
 $A \ltdyn A'$ induces both an upcast from $A$ to $A'$ and a downcast
 from $A'$ to $A$.
 %
@@ -1667,14 +1664,14 @@ our design choice is forced for all casts, as long as the casts between ground t
     
     \inferrule*[lab=TmDynRefl]{ }{\Gamma \ltdyn \Gamma \mid \Delta \ltdyn \Delta \vdash E \ltdyn E : T \ltdyn T}
     \qquad
+    \inferrule*[lab=TmDynVar]
+    { }
+    {\Phi,x \ltdyn x' : A \ltdyn A',\Phi' \vdash x \ltdyn x' : A \ltdyn A'}
+    \\\\
     \inferrule*[lab=TmDynTrans]{\Gamma \ltdyn \Gamma' \mid \Delta \ltdyn \Delta' \vdash E \ltdyn E' : T \ltdyn T' \\\\
       \Gamma' \ltdyn \Gamma'' \mid \Delta' \ltdyn \Delta'' \vdash E' \ltdyn E'' : T' \ltdyn T''
     }
     {\Gamma \ltdyn \Gamma'' \mid \Delta \ltdyn \Delta'' \vdash E \ltdyn E'' : T \ltdyn T''}
-    \\\\
-    \inferrule*[lab=TmDynVar]
-    { }
-    {\Phi,x \ltdyn x' : A \ltdyn A',\Phi' \vdash x \ltdyn x' : A \ltdyn A'}
     \qquad
     \inferrule*[lab=TmDynValSubst]
     {\Phi \vdash V \ltdyn V' : A \ltdyn A' \\\\
@@ -1859,7 +1856,7 @@ typically because they are identical), which is a syntactic judgement for contex
 
 Finally, we assert some term dynamism axioms that describe the behavior
 of programs.  The cast universal properties at the top of
-Figure~\ref{fig:gtt-term-dyn-axioms}, following~\cite{newlicata2018-fscd}, say that
+Figure~\ref{fig:gtt-term-dyn-axioms}, following~\citet{newlicata2018-fscd}, say that
 the defining property of an upcast from $A$ to $A'$ is that it is the
 least dynamic term of type $A'$ that is more dynamic that $x$, a ``least
 upper bound''.  That is, $\upcast{A}{A'}{x}$ is a term of type $A'$ that
@@ -4595,7 +4592,7 @@ introduction forms for $\dynv$ are exactly the introduction forms for
 all of the value types (unit, pairing,$\texttt{inl}$, $\texttt{inr}$, $\texttt{force}$), while
 elimination forms are all of the elimination forms for computation types
 ($\pi$, $\pi'$, application and binding); such ``bityped'' languages
-are related to \cite{girard01locussolum,zeilberger09thesis}.
+are related to \citet{girard01locussolum,zeilberger09thesis}.
 \begin{shortonly}
   In the extended version, we give an extension of GTT axiomatizing this
   implementation of the dynamic types.
@@ -4701,7 +4698,7 @@ lazy product type $\with$.
 Normally, these are not included in this way, but rather sums are
 encoded by making each case use a fresh constructor (using nominal
 techniques like opaque structs in Racket) and then making the sum the
-union of the constructors, as argued in \cite{siekth16recursiveunion}.
+union of the constructors, as argued in \citet{siekth16recursiveunion}.
 %
 We leave modeling this nominal structure to future work, but in
 minimalist languages, such as simple dialects of Scheme and Lisp, sum
@@ -5385,7 +5382,7 @@ translate a heterogeneous term dynamism to a homogeneous inequality
 \end{theorem}
 
 \begin{longonly}
-  Relative to previous work on graduality (\cite{newahmed18}),
+  Relative to previous work on graduality \citep{newahmed18},
 the distinction between complex value upcasts and complex stack
 downcasts guides the formulation of the theorem; e.g. using upcasts in
 the left-hand theorem would require more thunks/forces.  
@@ -6742,7 +6739,7 @@ strictness).
 \end{longonly}
 %
 \ This translation clarifies the behavioral meaning of complex values and
-stacks, following \cite{munchmaccagnoni14nonassociative,
+stacks, following \citet{munchmaccagnoni14nonassociative,
   fuhrmann1999direct}, and therefore of upcasts and downcasts.
 \begin{longonly}
 This is related to completeness of focusing: it moves inversion rules
@@ -7069,7 +7066,7 @@ $\kw{split}$ are not evaluation contexts in the operational semantics).
 
 \end{longonly}
 
-Levy~\cite{levy03cbpvbook} translates \cbpvstar\/ to \cbpv, but not does not prove
+\citet{levy03cbpvbook} translates \cbpvstar\/ to \cbpv, but not does not prove
 the inequality preservation that we require here, so we give
 an
 alternative translation for which this property is easy to
@@ -7113,7 +7110,7 @@ the same type.
 The \emph{de-complexification} procedure is defined as follows.
 %
 We note that this translation is not the one presented in
-\cite{levy03cbpvbook}, but rather a more inefficient version that, in CPS
+\citet{levy03cbpvbook}, but rather a more inefficient version that, in CPS
 terminology, introduces many administrative redices.
 %
 Since we are only proving results up to observational equivalence
@@ -7840,7 +7837,7 @@ in figure
 \end{figure}
 \fi
 
-This is morally the same as in \cite{levy03cbpvbook}, but we present
+This is morally the same as in \citet{levy03cbpvbook}, but we present
 stacks in a manner similar to Hieb-Felleisen style evaluation
 contexts\iflong(rather than as an explicit stack machine with stack frames)\fi.
 %
@@ -7988,7 +7985,7 @@ least preorders so we write $\apreorder$ rather than $\sim$.
 \noindent Three important relations arise as liftings: Equality of results lifts to observational equivalence
 ($\ctxize=$).  The preorder generated by $\err \ltdyn R$ (i.e. the other
 three results are unrelated maximal elements) lifts to the notion of
-\emph{error approximation} used in \cite{newahmed18} to prove the
+\emph{error approximation} used in \citet{newahmed18} to prove the
 graduality property ($\ctxize\ltdyn$).  The preorder generated by
 $\diverge \preceq R$ lifts to the standard notion of \emph{divergence
   approximation} ($\ctxize\preceq$).
@@ -7999,7 +7996,7 @@ The most famous use of lifting is for observational equivalence,
 which is the lifting of equality of results ($\ctxize=$), and we will show that
 $\equidyn$ proofs in GTT imply observational equivalences.
 %
-However, as shown in \cite{newahmed18}, the graduality property is
+However, as shown in \citet{newahmed18}, the graduality property is
 defined in terms of an observational \emph{approximation} relation
 $\ltdyn$ that places $\err$ as the least element, and every other
 element as a maximal element.
@@ -8158,7 +8155,7 @@ approximation, i.e., $M \ctxize= N$ if and only if $M \ctxize\preceq
 N$ and $N \ctxize\preceq M$, and since $\preceq$ has $\diverge$ as
 a least element, we can use a step-indexed relation to prove it.
 %
-As shown in \cite{newahmed18}, a similar trick works for error
+As shown in \citet{newahmed18}, a similar trick works for error
 approximation, but since $\ltdyn$ is \emph{not} an equivalence
 relation, we decompose it rather into two \emph{different} orderings:
 error approximation up to divergence on the left $\errordivergeleft$ and
@@ -8252,8 +8249,8 @@ So using Lemmas~\ref{lem:ctx-commutes-conjunction}, \ref{lem:ctx-commutes-dual}
 \end{proof}
 
 As a corollary, the decomposition of contextual equivalence into diverge
-approximation in \cite{ahmed06:lr} and the decomposition of dynamism in
-\cite{newahmed18} are really the same trick:
+approximation in \citet{ahmed06:lr} and the decomposition of dynamism in
+\citet{newahmed18} are really the same trick:
 \begin{corollary}[Contextual Decomposition] ~~~ \label{cor:contextual-decomposition}
   \begin{enumerate}
   \item $\ctxize= \mathbin{\Leftrightarrow} \ctxize{\preceq} \wedge
@@ -9357,7 +9354,7 @@ Corollary~\ref{cor:contextual-decomposition} that this coincides with
 contextual equivalence.
 \end{longproof}
 
-\section{Related and Future Work}
+\section{Discussion and Related Work}
 \label{sec:related}
 
 In this paper, we have given a logic for reasoning about gradual
@@ -9374,7 +9371,7 @@ implementation theorems
 (e.g. $\upcast{\mathtt{list}(A)}{\mathtt{list}(A')} \equidyn
 \mathtt{map}\upcast{A}{A'}$).  Additionally, since more efficient cast
 implementations such as optimized cast calculi (the lazy variant in
-\cite{herman2010spaceefficient}) and threesome
+\citet{herman2010spaceefficient}) and threesome
 casts~\cite{siekwadler10zigzag}, are equivalent to the lazy contract
 semantics, they should also be models of GTT, and if so we could use GTT
 to reason about program transformations and optimizations in them.
@@ -9395,7 +9392,7 @@ sections \ref{sec:contract}, \ref{sec:complex}, \ref{sec:operational}.
 We conjecture that simple call-by-value and call-by-name gradual
 languages are also models of GTT, by extending the translation of
 call-by-push-value into call-by-value and call-by-name in the appendix
-to Levy's monograph \cite{levy03cbpvbook}.
+of Levy's monograph \cite{levy03cbpvbook}.
 %
 In order for the theorem to apply, the language must validate an
 appropriate version of the $\eta$ principles for the types.
@@ -9404,10 +9401,10 @@ So for example, a call-by-value language that has reference equality
 of functions does \emph{not} validate even the value-restricted $\eta$
 law for functions, and so the case for functions does not apply.
 %
-It is a well-known issue that the lazy semantics of function casts is
-not compatible with the refinement property that graduality models in
-the presence of pointer equality, and our uniqueness theorem provides
-a different perspective on this phenomenon
+It is a well-known issue that in the presence of pointer equality of
+functions, the lazy semantics of function casts is not compatible with
+the graduality property, and our uniqueness theorem provides a
+different perspective on this phenomenon
 \cite{findlerflattfelleisen04,chaperonesimpersonators, refined}.
 %
 However, we note that the cases of the uniqueness theorem for each
@@ -9416,9 +9413,9 @@ specification of casts and the $\beta,\eta$ principles for the
 particular connective, and not on the presence of any other types,
 even the dynamic types.
 %
-So while a call-by-value language that has reference equality of
-functions might still have the $\eta$ principle for strict pairs, so
-that case of the theorem still applies.
+So even if a call-by-value language may have reference equality
+functions, if it has the $\eta$ principle for strict pairs, then the
+pair cast must be that of theorem \ref{functorial-casts}.
 
 Next, we consider the applicability to non-eager languages.
 %
@@ -9430,13 +9427,13 @@ restriction for call-by-value function $\eta$.
 %
 We are not aware of any call-by-name gradual languages, but there is
 considerable work on \emph{contracts} for non-eager languages,
-especially Haskell, \cite{hinzeJeuringLoh06,XuPJC09}.
+especially Haskell \cite{hinzeJeuringLoh06,XuPJC09}.
 %
 However, we note that Haskell is \emph{not} a call-by-name language in
 our sense for two reasons.
 %
 First, Haskell uses call-by-need evaluation where results of
-computations are memoized. However when only considering Haskell's
+computations are memoized. However, when only considering Haskell's
 effects (error and divergence), this difference is not observable so
 this is not the main obstacle.
 %
@@ -9450,7 +9447,7 @@ x. \Omega$ will not.
 %
 This is a crucial feature of Haskell and is a major source of
 differences between implementations of lazy contracts, as noted in
-\cite{Degen2012TheIO}.
+\citet{Degen2012TheIO}.
 %
 We can understand this difference by using a different translation
 into call-by-push-value: what Levy calls the ``lazy paradigm'', as
@@ -9465,7 +9462,7 @@ With this embedding and the uniqueness theorem, GTT produces a
 definition for lazy casts, and the definition matches the work of
 \citet{XuPJC09} when restricting to non-dependent contracts.
 
-Some recent work \cite{greenmanfelleisen:2018} gives a spectrum of
+\citet{greenmanfelleisen:2018} gives a spectrum of
 differing syntactic type soundness theorems for different semantics of
 gradual typing.
 %
@@ -9473,7 +9470,7 @@ Our work here is complementary, showing that certain program
 equivalences can only be achieved by certain cast semantics.
 
 \citet{Degen2012TheIO} give an analysis of different cast semantics
-for contracts in lazy languages, specifically based on Haskell, so
+for contracts in lazy languages, specifically based on Haskell, i.e., 
 call-by-need with \seq.
 %
 They propose two properties ``meaning preservation'' and
@@ -9493,7 +9490,7 @@ non-contracted term.
 Completeness, on the other hand, requires that when a contract is
 attached to a value that it is \emph{deeply} checked.
 %
-The two properties are incompatible because for instance a pair of a
+The two properties are incompatible because, for instance, a pair of a
 diverging term and a value can't be deeply checked without causing the
 entire program to diverge.
 %
@@ -9523,12 +9520,13 @@ Because of this, the theorems proven in that paper are more closely
 related to our model construction in section
 \ref{sec:contract}.
 %
-More specifically, many of the lemmas proven in the extended version
-of the paper have direct analogues in Henglein's work.
+More specifically, many of the properties of casts needed to prove
+theorem \ref{thm:axiomatic-graduality} have direct analogues
+in Henglein's work, such as the coherence theorems.
 %
 We have not included these lemmas in the paper because they are quite
-similar to lemmas proven in \citet{newahmed18}; see there for a
-more detailed comparison.
+similar to lemmas proven in \citet{newahmed18}; see there for a more
+detailed comparison, and the extended version of this paper for full proof details \citep{newlicataahmed19:extended}.
 %
 Finally, we note that our assumption of compositionality, i.e., that
 all casts can be decomposed into an upcast followed by a downcast, is
@@ -9665,7 +9663,7 @@ interpretation.
 %
 It may be of interest to develop a more general notion of model of GTT
 for which we can prove soundness and completeness theorems, as in
-\cite{newlicata2018-fscd}.
+\citet{newlicata2018-fscd}.
 %
 A model would be a strong adjunction between double categories where
 one of the double categories has all ``companions'' and the other has
@@ -9757,16 +9755,16 @@ structure~\cite{ahman+16fiberedeffects}.
 %% recursive types, polymorphism, and state, where both equational
 %% reasoning and gradualization are more difficult.
 
-\newpage
+%% \newpage
 
 \paragraph{Acknowledgments}
 
-We thank Gabriel?, Sam TH?, Ron?, MSFP people? FSCD people? for helpful
-discussions about this work.  We thank the anonymous reviewers for
-helpful feedback on a previous version of this article.  This material
-is based on research sponsored by the NSF? FIXME and the United States
-Air Force Research Laboratory under agreement number FA9550-15-1-0053
-and and FA9550-16-1-0292.
+We thank Gabriel Scherer, Kenji Maillard, Ron Garcia, anyone else? for
+helpful discussions about this work.  We thank the anonymous reviewers
+for helpful feedback on this article.  This
+material is based on research sponsored by the NSF? FIXME and the
+United States Air Force Research Laboratory under agreement number
+FA9550-15-1-0053 and and FA9550-16-1-0292.
 %% The U.S. Government is
 %% authorized to reproduce and distribute reprints for Governmental
 %% purposes notwithstanding any copyright notation thereon.