Show HN: PDFx – Extract Metadata and URLs from PDFs, and Download Referenced PDFs

rossmounce · on Oct 26, 2015

Most PDF scientific articles sadly don't have good embedded metadata [0], so this & the DOI issue make this not very useful (at least for the journals I read).

I also would have implemented this with a simpler shell script calling exiftool[1] and pdftotext[2], but hey; fun to have a python-based implementation :)

[0] http://rossmounce.co.uk/2012/12/31/pdf-metadata-why-so-poor/ [1] http://www.sno.phy.queensu.ca/~phil/exiftool/ [2] http://poppler.freedesktop.org/

afandian · on Oct 26, 2015

What's "the DOI issue"?

(Also, if your favourite publisher isn't putting the metadata you want in PDFs, write to them and ask! It may make a difference...)

rossmounce · on Oct 26, 2015

the DOI issue is as described by "aroch"

Quickscrape is capable of downloading PDFs, given a DOI I think: https://github.com/ContentMine/quickscrape

If publishers listened to their readers we would have had 100% Open Access ten years ago. Traditional academic publishers do not listen and don't care.

Even if I could convince say PLOS to do something about this it wouldn't change much. We need all or the majority of publishers to provide good embedded metadata. Not just an isolated one or two. I don't see a good mechanism for making that happen, sadly.

afandian · on Oct 26, 2015

Well hopefully the coverage of full-text link metadata in Crossref will increase over time. Until then, best of luck to ContentMine!

Improving metadata coverage is a different issue than changing business model. Objectively, one's easier to implement than the other.

It is indeed tricky getting ~5,000 publishers to provide optimal metadata, but it's a good thing to have an industry-standard platform to do it in.

(I work at Crossref)

gcr · on Oct 27, 2015

Do you mean having all DOIs for all referenced papers included in the pdf metadata somehow? From my experience, authors don't know LaTeX well enough to do that. Half of them don't even read the submission guidelines that clearly say not to put page numbers on the camera-ready paper; they're never going to understand complicated metadata commands...

afandian · on Oct 27, 2015

The metadata in papers is the responsibility of publishers not authors.

chriswarbo · on Oct 26, 2015

On a related note, these past couple of weeks I've found myself wanting to import several years' worth of accumulated PDFs into a BibTeX file. This has involved metadata extraction, text scraping, querying Google Scholar (good, but rate-limited) and CrossRef (no limit, but not as accurate).

I've written a very rough guide to the approaches I've taken so far at http://chriswarbo.net/essays/pdf-tools.html , with a bunch of links to external tools, some NixOS package definitions, commandline snippets and descriptions of Emacs macros.

Not quite the same problem as the author's, but the tools and scripts I've been using can do similar things :)

metachris · on Oct 26, 2015

Thanks, interesting read.

adelevie · on Oct 26, 2015

This is really neat! For work, I've found myself from time to time exploring the tech around PDFs. I find this tech strangely fascinating. It's like a shim on top of something old and ugly that enables integration with much more modern systems.

Some quick feedback (and a shameless plug):

The CLI interface should output JSON. It would be nice to combin with a CLI JSON parser such as jq[0].

Shameless plug: I've been working on a PDF CLI aimed at making it easier to programmatically fill out PDF forms: https://github.com/adelevie/pdfq. It provides an interface and some wrappers on top of the main pdf form-filling tool, pdftk. For example, you can get json out of a pdf form like this:

    pdftk hello.pdf dump_data_fields | pdfq

Or you can generate FDF from a json file:

    cat hello.json | pdfq json_to_fdf

You can also fill a pdf without touching an fdf code:

    pdfq set foo bar input.pdf output.pdf

[0] https://stedolan.github.io/jq/

PaulHoule · on Oct 26, 2015

PDF is less proprietary than most people think. It is an ISO standard after all and it is a bit complicated but it does solve the problem of making "printable" documents produced by all sorts of tools available online.

metachris · on Oct 26, 2015

pdfx will output json if you use the -j flag!

    pdfx -j <file-or-url.pdf>

jq looks neat btw.

adelevie · on Oct 26, 2015

I should have read the [] manual :)

retSava · on Oct 26, 2015

Slightly off-topic, pardon me. But, does someone have any good tips on how to remove pdf security?

Let me clarify why; I frequently come across datasheets (eg to flash memory ICs) that have security enabled for some strange reason. Nothing secret, just plainly downloaded from the Internet. I can open, and print, but not highlight or add remarks.

Existing solutions I've found so far are inadequate since they typically are either 'download this obscure-sounding executable', 'upload and convert on this sketchy possibly-malware-injecting-website', or resort to printing the entire thing to a new pdf document (eg via PDF creator) - but this makes text un-highlightable.

I don't mind anything involving hex-editing, some node.js or python-lib, or chanting and dancing, as long as it gets the job done.

I just want to be able to highlight and copy text :(

lcswi · on Oct 26, 2015

Use a reader like evince and toggle the checkbox.

retSava · on Oct 26, 2015

Nice! Thanks for that! If you only knew how much I've looked for something like this, and, not to brag or anything, my internet-search-skills are prettttttty sharp.

For others information (from wikipedia): "Evince used to obey the DRM restrictions of PDF files, which may prevent copying, printing, or converting some PDF files, however this has been made optional, and turned off by default[...]"

lcswi · on Oct 27, 2015

You're welcome, have fun! I only found out about it by luck myself.

fit2rule · on Oct 26, 2015

Nice work Chris .. doesn't work on all my PDF's, though:

    j@w1x8-dev:~/Documents/PDF Documents {}
    $ pdfx xhyve\ –\ Lightweight\ Virtualization\ on\ OS\ X\ Based\ on\ bhyve\ _\ pagetable.pdf
    Traceback (most recent call last):
      File "/usr/local/bin/pdfx", line 9, in <module>
    load_entry_point('pdfx==1.0.1', 'console_scripts', 'pdfx')()
    File "build/bdist.macosx-10.10-x86_64/egg/pdfx/cli.py", line 66, in main
    File "build/bdist.macosx-10.10-x86_64/egg/pdfx/__init__.py", line 137, in __init__
    AttributeError: 'NoneType' object has no attribute 'items'
    j@w1x8-dev:~/Documents/PDF Documents {}

If you want some sample PDF's on which it is borked, just let me know .. in the meantime I'm using pdf_scraper for most of these ..

metachris · on Oct 26, 2015

Thanks for the stack trace. Yes please, sample PDFs with problems would be great! You can find my email in my profile.

fit2rule · on Oct 27, 2015

Maybe I just show you next time we run into each other at the 'lab or so ..

aroch · on Oct 26, 2015

While nice, it really only works with the reference has a direct link to the PDF while the majority of citations use the DOI in the sciences.

DOI traversal would be required

IanCal · on Oct 26, 2015

DOIs are great for humans, unfortunately they'll take you to the publishers webpage and I don't know of a standard way of getting an actual PDF from a DOI. Maybe with PLOS, I know they're good at serving up different versions (xml with a different accept header iirc).

Searching for something on the page that looks like a download PDF button and trying that might get you 80% of the way there, along with at least giving the user the remaining DOI urls to visit themselves.

This is actually really relevant to a lot of problems I see so if anyone has a general solution / a 90% solution then I'm all ears :)

afandian · on Oct 26, 2015

The standard way for getting the actual PDF from a DOI, when it's a Crossref DOI (which it probably is) is to use the full-text link, available in the CrossRef API.

For DOI 10.1155/2010/963926

http://api.crossref.org/works/10.1155/2010/963926

From the returned JSON message -> link -> there's the PDF!

    [
      {
        intended-application: "text-mining",
        content-version: "vor",
        content-type: "application/pdf",
        URL: "http://downloads.hindawi.com/journals/jo/2010/963926.pdf"
      },
      {
        intended-application: "text-mining",
        content-version: "vor",
        content-type: "application/xml",
        URL: "http://downloads.hindawi.com/journals/jo/2010/963926.xml"
      }
    ]

Publishers are still getting round to including the full-text links in metadata, but there are 16,000,000 DOIs with such data. Not all are open-access however.

When a PDF has Crossref CrossMark, the DOI is embedded in the metadata (I can't say how but I can find out)

http://www.plosone.org/article/fetchObject.action?uri=info:d...

Drop us a line on labs@crossref.org

IanCal · on Oct 26, 2015

Thanks, I'd not thought of the crossref api for this. I use the API pretty heavily for other things though, really good work!

Just noticed this part of the response:

    "affiliation": [],

How well filled in is that? I find it's currently a really poorly provided thing on many sites (although there are metatags, they're often wrong).

> When a PDF has Crossref CrossMark, the DOI is embedded in the metadata (I can't say how but I can find out)

That's likely to come in really useful, thanks.

afandian · on Oct 26, 2015

    http://api.crossref.org/works?filter=has-affiliation:true

    => total-results: 964,696

Do ask us questions on labs@crossref.org or raise a ticket on https://github.com/CrossRef/rest-api-doc

NeatoJn · on Oct 26, 2015

Maybe consider a Google Scholar integration? Its search results include links to full-text pdf at times. Even not, extracted links to publisher websites could be helpful for a batch review of referenced articles.

afandian · on Oct 26, 2015

Not wishing to nit-pick, but a DOI is a link to the publisher website.

aroch · on Oct 26, 2015

afandian's reply below is exactly how I would go about it. Most DOI's in science papers are crossmark OR convertable to a crossmark DOI

metachris · on Oct 26, 2015

Good point, and I will definitely take a look at that!

Would you open an issue on Github, and perhaps reference a few papers?

aroch · on Oct 26, 2015

Sure, I'll do it this evening

metachris · on Oct 26, 2015

Glad if this tool/lib is useful to some. I'm happy to answer any and all questions!

A Kivy [1] based cross-platform GUI would be a nice addition at some point.

[1] http://kivy.org

prolepunk · on Oct 26, 2015

How does this tool handle XFA forms?

Or I should ask how does any tool besides Adobe LiveCycle designer handles XFA. I've been using Apache Tika with Apache Solr to index PDF documents -- it worked quite well up until I tried to index XFA-based documents and got nothing, well I got metdatada, but not the content.

[pdftk](https://www.pdflabs.com/docs/pdftk-man-page/#dest-drop-xfa) seems to have an option called drop_xfa, which does just that.

metachris · on Oct 26, 2015

I don't know! If you could send me sample PDFs I'm more than happy to look into it!

afenc · on Oct 28, 2015

I've installed pdfx and saw the help info as the demo. But when I tried to download the example 17 pdf files the following error msg jumped in the end

ERROR 2: len() takes exactly one argument (2 given)

What does this mean??

metachris · on Oct 29, 2015

Thanks, you found a bug! I've fixed it right now. You can update pdfx with

    $ easy_install -U pdfx

based2 · on Oct 26, 2015

related Apache Tika 1.11 release

http://mail-archives.apache.org/mod_mbox/www-announce/201510...