PDFs are horrible

pathall · September 14, 2020, 4:49pm

s on, here comes a rant.

Pre-rant side note… I know it’s been quite quiet around here (aside from heroic @cscanlon holding down the checkin fort! ) I suppose this is probably somewhat to do with the start of a new semester for some of the folks here… but it does seem to be the case that activity is directly related to how many recent posts there are, so I’m going to go ahead and just start posting things myself, as embarrassing as it is to see my icon-y face on every new post! I hope you all feel free to start your own topics, no matter how half-baked your idea!

I hate PDFs.

I mean, I hate the PDF format.

PDFs are good at typography and layout, and really that’s it. It is true that PDFs can do essentially anything that an author (or designer) can imagine in terms of visual appearance. You can merge images and text in interesting ways, use any typeface, use fancy column layouts, and so forth.

But a PDF is itself almost like an image once the content is “frozen in”. The problem with PDFs is that their content is interspersed into formatting information in a manner that is essentially one-way — once the information is in the PDF, it can’t come out of the PDF. At least, not without a lot of work.

This is a known problem:

https://filingdb.com/b/pdf-text-extraction

If you would like to spend your time reading other people complaining about PDFs, consider this thread on a website full of programmers and people who run websites. PDFs have been around practically forever, and people still have to fight the format.

The very idea that text has to be “extracted” from a document format is sort of weird. If it’s an output format, why should it be that whatever the original “thing” was that was “formatted” is inevitably lost once the formatted document is created?

Yet, the fact of the matter is that we are funneling a lot of our work into these output formats. That, in and of itself, is not horrible: it’s nice to have a beautiful version of our content for those who want to read it that way. The problem starts when we think of that beautiful visual version as the canonical version of the document. PDFs are searchable only in a very cursory way, they can’t adapt to devices with different screen sizes, and they’re not even legible to many users (for instance, users with partial vision or blindness, and so forth.

I guess the take-home message I’m trying to get to is that we should think about a PDF as just one of many output formats. A PDF is almost never “the data”. “The data” should be stored in some more fundamental way — what that way is is a topic for another discussion (I myself lean toward .json but .xml could work as well, I guess).

</rant>.

canonical-file-types

lgessler · September 17, 2020, 2:42am

A demonstration of the shortcomings of PDF that is especially relevant to us on this forum: Xia et al. 2016, wherein they attempt to extract interlinear glossed text from PDFs and run into a lot of problems:

if a language line contains non-ASCII characters, the line [of IGT] may be jumbled up by the off-the-shelf PDF-to-text converter and consequently is not included as part of IGT.

ODIN uses an off-the-shelf converter to convert PDF documents into text format and the converter sometimes wrongly splits a[n] [IGT] line into two lines. One such an example is Fig. 2, where the language line is incorrectly split into two lines by the converter, as indicated by the CR (for ‘‘corruption’’) tag for lines 875 and 876.

fig21045×288 59.7 KB

Text extraction shouldn’t be so hard from a “portable “document” format”!

pathall · September 17, 2020, 3:40pm

Oh wow, yeah, ODIN. I have spent a lot of time looking at that project and reading that literature. They were, I think, really the right path in the sense of what their goal was — they pretty much knew what the nested structure should be, and IMHO that model is pretty close to spot on still.

It’s interesting that the original version of ODIN took a totally different approach. Rather than converting PDF to text, they tried to find LaTeX files on the web, and then parse that for the standard LaTeX markup (I can’t recall what the LaTeX packages are called… expdx or something like that? Shows you how many articles I have published!)

http://odin.linguistlist.org/

Not sure where in the papers they talk about that, but presumably it didn’t work very well if they switched to PDF to text.

I mean, this right here is already game over for a linguist, really. Can’t handle Unicode? Gulp.

But like, parsing LaTeX and trying to parse text output of PDF to text are different flavors of the same problem: the data was never input as structured data in the first place. Formatted text isn’t a database, it’s an output format.

Thanks for your comment! It’s been pretty lonely around here

(By the way @lgessler, I’d love to know what you think about the application brainstorm thread )