s on, here comes a rant.
Pre-rant side note… I know it’s been quite quiet around here (aside from heroic @cscanlon holding down the checkin fort!
) I suppose this is probably somewhat to do with the start of a new semester for some of the folks here… but it does seem to be the case that activity is directly related to how many recent posts there are, so I’m going to go ahead and just start posting things myself, as embarrassing as it is to see my icon-y face on every new post! I hope you all feel free to start your own topics, no matter how half-baked your idea!
I hate PDFs.
I mean, I hate the PDF format.
PDFs are good at typography and layout, and really that’s it. It is true that PDFs can do essentially anything that an author (or designer) can imagine in terms of visual appearance. You can merge images and text in interesting ways, use any typeface, use fancy column layouts, and so forth.
But a PDF is itself almost like an image once the content is “frozen in”. The problem with PDFs is that their content is interspersed into formatting information in a manner that is essentially one-way — once the information is in the PDF, it can’t come out of the PDF. At least, not without a lot of work.
This is a known problem:
https://filingdb.com/b/pdf-text-extraction
If you would like to spend your time reading other people complaining about PDFs, consider this thread on a website full of programmers and people who run websites. PDFs have been around practically forever, and people still have to fight the format.
The very idea that text has to be “extracted” from a document format is sort of weird. If it’s an output format, why should it be that whatever the original “thing” was that was “formatted” is inevitably lost once the formatted document is created?
Yet, the fact of the matter is that we are funneling a lot of our work into these output formats. That, in and of itself, is not horrible: it’s nice to have a beautiful version of our content for those who want to read it that way. The problem starts when we think of that beautiful visual version as the canonical version of the document. PDFs are searchable only in a very cursory way, they can’t adapt to devices with different screen sizes, and they’re not even legible to many users (for instance, users with partial vision or blindness, and so forth.
I guess the take-home message I’m trying to get to is that we should think about a PDF as just one of many output formats. A PDF is almost never “the data”. “The data” should be stored in some more fundamental way — what that way is is a topic for another discussion (I myself lean toward .json but .xml could work as well, I guess).
</rant>
.