Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).
I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)
If publishers listened to their readers we would have had 100% Open Access ten years ago. Traditional academic publishers do not listen and don't care.
Even if I could convince say PLOS to do something about this it wouldn't change much. We need all or the majority of publishers to provide good embedded metadata. Not just an isolated one or two. I don't see a good mechanism for making that happen, sadly.
Do you mean having all DOIs for all referenced papers included in the pdf metadata somehow? From my experience, authors don't know LaTeX well enough to do that. Half of them don't even read the submission guidelines that clearly say not to put page numbers on the camera-ready paper; they're never going to understand complicated metadata commands...
On a related note, these past couple of weeks I've found myself wanting to import several years' worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).
I've written a very rough guide to the approaches I've taken so far at http://chriswarbo.net/essays/pdf-tools.html , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.
Not quite the same problem as the author's, but the tools and scripts I've been using can do similar things :)
This is really neat! For work, I've found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It's like a shim on top of something old and ugly that enables integration with much more modern systems.
Some quick feedback (and a shameless plug):
The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].
Shameless plug: I've been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: https://github.com/adelevie/pdfq. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:
pdftk hello.pdf dump_data_fields | pdfq
Or you can generate FDF from a json file:
cat hello.json | pdfq json_to_fdf
You can also fill a pdf without touching an fdf code:
PDF is less proprietary than most people think. It is an ISO standard after all and it is a bit complicated but it does solve the problem of making "printable" documents produced by all sorts of tools available online.
Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?
Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.
Existing solutions I've found so far are inadequate since they typically are either 'download this obscure-sounding executable', 'upload and convert on this sketchy possibly-malware-injecting-website', or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.
I don't mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.
I just want to be able to highlight and copy text :(
Nice! Thanks for that! If you only knew how much I've looked for something like this, and, not to brag or anything, my internet-search-skills are prettttttty sharp.
For others information (from wikipedia): "Evince used to obey the DRM restrictions of PDF files, which may prevent copying, printing, or converting some PDF files, however this has been made optional, and turned off by default[...]"
DOIs are great for humans, unfortunately they'll take you to the publishers webpage and I don't know of a standard way of getting an actual PDF from a DOI. Maybe with PLOS, I know they're good at serving up different versions (xml with a different accept header iirc).
Searching for something on the page that looks like a download PDF button and trying that might get you 80% of the way there, along with at least giving the user the remaining DOI urls to visit themselves.
This is actually really relevant to a lot of problems I see so if anyone has a general solution / a 90% solution then I'm all ears :)
The standard way for getting the actual PDF from a DOI, when it's a Crossref DOI (which it probably is) is to use the full-text link, available in the CrossRef API.
Publishers are still getting round to including the full-text links in metadata, but there are 16,000,000 DOIs with such data. Not all are open-access however.
When a PDF has Crossref CrossMark, the DOI is embedded in the metadata (I can't say how but I can find out)
Maybe consider a Google Scholar integration? Its search results include links to full-text pdf at times. Even not, extracted links to publisher websites could be helpful for a batch review of referenced articles.
Or I should ask how does any tool besides Adobe LiveCycle designer handles XFA. I've been using Apache Tika with Apache Solr to index PDF documents -- it worked quite well up until I tried to index XFA-based documents and got nothing, well I got metdatada, but not the content.
I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)
[0] http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/ [1] http://www.sno.phy.queensu.ca/~phil/exiftool/ [2] http://poppler.freedesktop.org/