Preparing a document for arXiv submission

20 Nov, 2023

Different people have different advice for how to prepare tex source for arxiv.org submissions. Here is mine! This is the system I use, and it has three basic ideas.

Upload separate files all at once in a gzipped tar bundle.
Use a checklist to avoid common mistakes.
Copy/paste arXiv metadata, to minimize the risk of typos!

I have a shell script that implements these things, and I'll describe it below, but probably each person will want their own version to take into account different preferences, variations between disciplines, etc.

arXiv shell script

Here is the script I use. If the main tex file is newpaper.tex, in a directory called newpaper, then I put this script in a subdirectory newpaper/arxiv. It begins with a tar command to create a gzipped tar bundle of necessary files. Then it shows a checklist for finalizing the submission. Then it uses pdftools to extract some metadata from the compiled document newpaper.pdf. This data is written to newpaper/arxiv/prep_log.txt and then printed out to the terminal for copy/pasting.

#!/bin/sh
##
## run from main tex dir, with prep script in arxiv subdir
##
filebase="newpaper"
version="1"
archivename="arxiv/"$filebase"_v"$version".tar.gz"
logfile="arxiv/prep_log.txt"
echo "Compressing to: $archivename" > $logfile

tar -cvzf $archivename \
    parts/abstract-newpaper.tex \
    parts/applications.tex \
    parts/technical.tex \
    parts/background.tex \
    $filebase.bbl \
    $filebase.tex >> $logfile
echo "done.\n" >> $logfile

docinfo=$(pdfinfo $filebase.pdf | grep 'Title\|Author\|Pages')
title=$(echo "$docinfo" | grep 'Title' | sed "s/: \+/:\n/")
authors=$(echo "$docinfo" | grep 'Author' | sed "s/: \+/(s):\n/")
pagecount=$(echo "$docinfo" | grep 'Pages' | sed "s/Pages: \+//")
mscinfo=$(pdfinfo -custom $filebase.pdf | grep 'MSC2020' | sed "s/MSC2020: \+//")



cat >> $logfile << EOF
======================
arXiv Submission Checklist:
 -[X] set MSC classes
 -[ ] out of draft mode
 -[ ] remove comments
 -[ ] set date (dd fullmonth yyyy)
 -[X] choose primary/secondary subject areas for submission
 -[ ] set/verify metadata

======================
Subject Areas:
 primary: math.CT
 secondary: math.QA

======================
Metadata
$title
$authors

Abstract:
EOF
cat parts/abstract-newpaper.tex >> $logfile

echo "\n" >> $logfile
cat >> $logfile << EOF
Comments:
$pagecount pages.

Report number:

Journal reference:

DOI:

ACM class:

MSC class:
$mscinfo

EOF

cat $logfile

The script could concievably be split into different ones, but I prefer to have a single script that I run just before uploading. This is my best way to ensure that the files and metadata are the complete and final versions.

I typically edit the checklist as I complete the various items. I copied the above as we were close but not quite finished with the document, so that's why some items are marked with X (done) and some are not. I make a new version of the script for each of my arxiv submissions, resetting the particular files, checklist, etc. as necessary for each separate document.

Setup in tex source

The script above pulls metadata from the compiled pdf. To make all this work, I use hyperref to embed the necessary data into the pdf. Here's an excerpt from the preamble of my main tex file.

Note that I typically use the amsart document class; I haven't tried this with other classes. I expect something similar should work, but may need to be adjusted slightly.

\documentclass[draft]{amsart}

\usepackage{hyperref}

\title{Exciting new results that we prove}

\author{Anne Author}

\author{Othello Author}

\date{\today}


% keywords for p1 footnote and pdf metadata
\newcommand{\printkwds}{functor coherence, braided monoidal, symmetric monoidal, pseudomorphism coherence, pseudomorphism classifier}
\keywords{\printkwds} %p1 footnote

%% MSC
\newcommand{\printmsc}{18C15 (Primary); 18D20, 18M05, 18M15, 18N15, 19D23 (Secondary)}
\subjclass[2020]{\printmsc}


\hypersetup{ % set from other metadata
  pdfusetitle,
  pdfauthor={\authors},
  pdfkeywords={\printkwds},
  pdfinfo={
   MSC2020={\printmsc}
  }
}

Note, in particular, the macros \printkwds and \printmsc allow their data to be printed in two different places each, without having to write it separately. This prevents errors whenever that data needs to be updated.

Check the hyperref documentation for further explanation of the metadata fields. The MSC2020 field is a custom key, and others can be set similarly using the pdfinfo option.

Similarly, I keep the abstract in its own file, so that the exact same content can be used in the latex source (via \input) and in this script. The abstract should avoid math symbols and custom macros, because it will be used in both the html arxiv page and the text arxiv email.

Script output

The script above, when run in my "newpaper" project directory, produces the following output. Here, I can check:

which files are included in the tar.gz archive,
progress on my submission checklist, and
arxiv metadata for copy/pasting into the arxiv submission web form.

Compressing to: arxiv/newpaper_v1.tar.gz
parts/abstract.tex
parts/applications.tex
parts/technical.tex
parts/background.tex
newpaper.bbl
newpaper.tex
done.

======================
arXiv Submission Checklist:

 -[X] set MSC classes
 -[ ] out of draft mode
 -[ ] remove comments
 -[ ] set date (dd fullmonth yyyy)
 -[X] choose primary/secondary subject areas for submission
 -[X] set/verify metadata

======================
Subject Areas:
 primary: math.CT
 secondary: math.QA

======================
Metadata

Title:
Exciting new results that we prove

Author(s):
Anne Author and Othello Author

Abstract:
This is the abstract of our paper. It says what we do, and why it's important. It includes a hint about applications, and a bit about technicalities.


Comments:
77 pages.

Report number:

Journal reference:

DOI:

ACM class:

MSC class:
18C15 (Primary); 18D20, 18M05, 18M15, 18N15, 19D23 (Secondary)

Uploading to arxiv

When the document is finalized and ready to upload, I compile the document one final time and then run the script above. On the document upload part of the submission, I choose the gzipped tar bundle. The arXiv server will automatically unpack the .tar.gz file, including any subdirectory structure that was included in the tar bundle.

On the part of the form for metadata, I copy/paste from the script output. This helps avoid any typos or accidental omissions of parts of the data. The format for msc classes, in particular, is the format recommended by the arxiv help for MSC classes. That same help page also describes the other metadata fields.

tags: tex | info

Preparing a document for arXiv submission

arXiv shell script

Setup in tex source

Script output

Uploading to arxiv

About Info

Office Information

E-mail address

Profiles

Site Info