My workflow: Managing and munging PDFs
Dealing with PDFs is something I do every day as someone working in software, especially given that I tend toward both research and lower-level work where papers and datasheets rule.
I think that the humble PDF is one of my favourite file formats besides text:
- You can give someone one and it will work
- Vectors work great in it
- Old files also just work
- Anywhere in the continuum between "digital-native output" to "a scan" can be represented and worked with nicely
- Search is typically pretty great when you have the right document since they tend to be large so CTRL-F can go very far
That said, "not being a text file" does sometimes make some tasks difficult, metadata is often dubious, and I am usually drowning in a mountain of PDFs at all times.
Most of the stuff described in this post can probably be done with Adobe Acrobat, but it is not available for my computer. All of the tools described below have packaging in the AUR or main repos on Arch and are not hard to run on other operating systems.
Fixing PDFs
There's several tools I regularly use for fixing up PDFs off the internet, since it's unfortunately common that they come in with bad metadata, or in other problematic forms.
Page numbering
PDF supports switching page numbering midway through the document, for instance, if the front-matter is numbered in Roman numerals and the main content is in Arabic numerals. Too often, large PDFs that run across my desk don't have this set up properly, so the page numbers are annoyingly offset.
You can fix this with the "page numbering" feature of jPDF Tweak.
Document outline
PDF has a great feature called "document outline" or "bookmarks", which lets you include the table of contents in searchable form that will show up in the sidebar of good PDF viewers.
Unfortunately, many PDFs don't have these set up, which makes big documents a hassle to work with as you have to jump back and forth between the table of contents page and the rest of the document to find things. Fortunately, these can be fixed.
There are three main tools that are useful for bookmarks hacking:
- jPDF Tweak, a multi-tool for doing various metadata hacking.
- JPdfBookmarks, a powerful bookmarks-specific editor.
- HandyOutliner, a small tool mostly useful to turn textual tables of contents into bookmarks.
Hyperlinked table of contents
This is the most convenient case: the author put in a hyperlinked table of contents, but somehow the tooling didn't create a document outline. If this happens, you can get a perfect outline with almost no work.
Use the "Extract links from current page and add them as bookmarks" button in JPdfBookmarks to deal with this. It will do as it says: just grab all the hyperlinks and turn them directly into a document outline.
This is great since generally the hyperlinks will have correct page positions and so the outline will go to the right spot on the page in addition to going to the right page.
Textual table of contents
If you can cleanly get or create a table of contents such as the following:
I. Introduction 1
1. Introduction 3
1.1. Software deployment 3
1.2. The state of the art 6
1.3. Motivation 13
1.4. The Nix deployment system 14
1.5. Contributions 14
1.6. Outline of this thesis 16
1.7. Notational conventions 17
Then the best bet is probably to use HandyOutliner to ingest that table of contents as text and create bookmarks.
Often copy support in PDF tables of contents is pretty awful (and I can only imagine it does horrors to screen readers), so it may need some serious amount of cleanup in a text editor, as was the case for me while making an outline for Eelco Dolstra's PhD thesis on Nix.
Another way this can be done is with the "Bookmarks" tab in jPDF Tweak and importing a CSV you make.
Such a CSV looks like so:
1;O;Acknowledgements;3
1;O;Contents;5
1;O;I. Introduction;9
2;O;1. Introduction;11
3;O;1.1. Software deployment;11
3;O;1.2. The state of the art;14
The columns are:
- Depth
- Open ("O" if the level in the tree should start opened, else "")
- Data
- Page number. You can also put coordinates at the end if truly motivated.
Encrypted PDFs
These are annoying. You can strip the encryption with qpdf
:
qpdf --decrypt input.pdf output.pdf
Pages are in the wrong order/PDFs need merging
Imagine that you have been fighting a scanner to scan some document and the software for it is bad and doesn't show previews large enough to make out the page numbers. Exasperated, you just save the PDF knowing the pages are in the wrong order and spread over multiple files.
For this, use pdfarranger, which makes it easy to reorder pages as desired.
Having too many PDFs in my life
Directory full of PDFs to search
Relatable problem! Use pdfgrep:
pdfgrep -nri 'somequery' .
Too many bloody PDFs; overflowing disorganized directories
Academics have this problem and equally have solutions: Use Zotero or similar research document management software to categorize and tag documents.
Getting more of them
As I have student credentials, I can use the University library to get documents. However, getting authenticated to publisher sites is annoying: I often don't use the University library's search system since it can have poor results, but the login pages for individual publisher sites are confusing as well.
UBC uses OpenAthens for their access control on publisher sites. They have a rather nice uniform redirector service that can log in and redirect back to sites: https://docs.openathens.net/libraries/redirector-link-generator
I made a little bookmarklet to authenticate to publisher sites:
javascript:void(location.href='https://go.openathens.net/redirector/ubc.ca?url='+location.href)
It's also possible to use a well-known Web site to "acquire" papers, which is often more convenient than the silly barriers that publishers use to extract profits from keeping publicly-funded knowledge unfree (paper authors are paid nil by journals), even with legitimate access. If one were to use such a hypothetical Web site, it is easiest to use by putting the DOI of papers into it.
Also, paper authors probably have copies of their papers, and are typically happy to send them to you for free if you email them.