Preparing a document for arXiv submission
Different people have different advice for how to prepare tex source for arxiv.org submissions. Here is mine! This is the system I use, and it has three basic ideas.
- Upload separate files all at once in a gzipped tar bundle.
- Use a checklist to avoid common mistakes.
- Copy/paste arXiv metadata, to minimize the risk of typos!
I have a shell script that implements these things, and I'll describe it below, but probably each person will want their own version to take into account different preferences, variations between disciplines, etc.
arXiv shell script
Here is the script I use. If the main tex file is newpaper.tex, in a directory called newpaper, then I put this script in a subdirectory newpaper/arxiv. It begins with a tar command to create a gzipped tar bundle of necessary files. Then it shows a checklist for finalizing the submission. Then it uses pdftools to extract some metadata from the compiled document newpaper.pdf. This data is written to newpaper/arxiv/prep_log.txt and then printed out to the terminal for copy/pasting.
#!/bin/sh ## ## run from main tex dir, with prep script in arxiv subdir ## filebase="newpaper" version="1" archivename="arxiv/"$filebase"_v"$version".tar.gz" logfile="arxiv/prep_log.txt" echo "Compressing to: $archivename" > $logfile tar -cvzf $archivename \ parts/abstract-newpaper.tex \ parts/applications.tex \ parts/technical.tex \ parts/background.tex \ $filebase.bbl \ $filebase.tex >> $logfile echo "done.\n" >> $logfile docinfo=$(pdfinfo $filebase.pdf | grep 'Title\|Author\|Pages') title=$(echo "$docinfo" | grep 'Title' | sed "s/: \+/:\n/") authors=$(echo "$docinfo" | grep 'Author' | sed "s/: \+/(s):\n/") pagecount=$(echo "$docinfo" | grep 'Pages' | sed "s/Pages: \+//") mscinfo=$(pdfinfo -custom $filebase.pdf | grep 'MSC2020' | sed "s/MSC2020: \+//") cat >> $logfile << EOF ====================== arXiv Submission Checklist: -[X] set MSC classes -[ ] out of draft mode -[ ] remove comments -[ ] set date (dd fullmonth yyyy) -[X] choose primary/secondary subject areas for submission -[ ] set/verify metadata ====================== Subject Areas: primary: math.CT secondary: math.QA ====================== Metadata $title $authors Abstract: EOF cat parts/abstract-newpaper.tex >> $logfile echo "\n" >> $logfile cat >> $logfile << EOF Comments: $pagecount pages. Report number: Journal reference: DOI: ACM class: MSC class: $mscinfo EOF cat $logfile
The script could concievably be split into different ones, but I prefer to have a single script that I run just before uploading. This is my best way to ensure that the files and metadata are the complete and final versions.
I typically edit the checklist as I complete the various items. I copied the above as we were close but not quite finished with the document, so that's why some items are marked with X (done) and some are not. I make a new version of the script for each of my arxiv submissions, resetting the particular files, checklist, etc. as necessary for each separate document.
Setup in tex source
The script above pulls metadata from the compiled pdf. To make all this work, I use hyperref to embed the necessary data into the pdf. Here's an excerpt from the preamble of my main tex file.
Note that I typically use the amsart document class; I haven't tried this with other classes. I expect something similar should work, but may need to be adjusted slightly.
\documentclass[draft]{amsart} \usepackage{hyperref} \title{Exciting new results that we prove} \author{Anne Author} \author{Othello Author} \date{\today} % keywords for p1 footnote and pdf metadata \newcommand{\printkwds}{functor coherence, braided monoidal, symmetric monoidal, pseudomorphism coherence, pseudomorphism classifier} \keywords{\printkwds} %p1 footnote %% MSC \newcommand{\printmsc}{18C15 (Primary); 18D20, 18M05, 18M15, 18N15, 19D23 (Secondary)} \subjclass[2020]{\printmsc} \hypersetup{ % set from other metadata pdfusetitle, pdfauthor={\authors}, pdfkeywords={\printkwds}, pdfinfo={ MSC2020={\printmsc} } }
Note, in particular, the macros \printkwds and \printmsc allow their data to be printed in two different places each, without having to write it separately. This prevents errors whenever that data needs to be updated.
Check the hyperref documentation for further explanation of the metadata fields. The MSC2020 field is a custom key, and others can be set similarly using the pdfinfo option.
Similarly, I keep the abstract in its own file, so that the exact same content can be used in the latex source (via \input) and in this script. The abstract should avoid math symbols and custom macros, because it will be used in both the html arxiv page and the text arxiv email.
Script output
The script above, when run in my "newpaper" project directory, produces the following output. Here, I can check:
- which files are included in the tar.gz archive,
- progress on my submission checklist, and
- arxiv metadata for copy/pasting into the arxiv submission web form.
Compressing to: arxiv/newpaper_v1.tar.gz parts/abstract.tex parts/applications.tex parts/technical.tex parts/background.tex newpaper.bbl newpaper.tex done. ====================== arXiv Submission Checklist: -[X] set MSC classes -[ ] out of draft mode -[ ] remove comments -[ ] set date (dd fullmonth yyyy) -[X] choose primary/secondary subject areas for submission -[X] set/verify metadata ====================== Subject Areas: primary: math.CT secondary: math.QA ====================== Metadata Title: Exciting new results that we prove Author(s): Anne Author and Othello Author Abstract: This is the abstract of our paper. It says what we do, and why it's important. It includes a hint about applications, and a bit about technicalities. Comments: 77 pages. Report number: Journal reference: DOI: ACM class: MSC class: 18C15 (Primary); 18D20, 18M05, 18M15, 18N15, 19D23 (Secondary)
Uploading to arxiv
When the document is finalized and ready to upload, I compile the document one final time and then run the script above. On the document upload part of the submission, I choose the gzipped tar bundle. The arXiv server will automatically unpack the .tar.gz file, including any subdirectory structure that was included in the tar bundle.
On the part of the form for metadata, I copy/paste from the script output. This helps avoid any typos or accidental omissions of parts of the data. The format for msc classes, in particular, is the format recommended by the arxiv help for MSC classes. That same help page also describes the other metadata fields.