PDFs are horrible

pathall · September 17, 2020, 3:40pm

Oh wow, yeah, ODIN. I have spent a lot of time looking at that project and reading that literature. They were, I think, really the right path in the sense of what their goal was — they pretty much knew what the nested structure should be, and IMHO that model is pretty close to spot on still.

It’s interesting that the original version of ODIN took a totally different approach. Rather than converting PDF to text, they tried to find LaTeX files on the web, and then parse that for the standard LaTeX markup (I can’t recall what the LaTeX packages are called… expdx or something like that? Shows you how many articles I have published!)

http://odin.linguistlist.org/

Not sure where in the papers they talk about that, but presumably it didn’t work very well if they switched to PDF to text.

I mean, this right here is already game over for a linguist, really. Can’t handle Unicode? Gulp.

But like, parsing LaTeX and trying to parse text output of PDF to text are different flavors of the same problem: the data was never input as structured data in the first place. Formatted text isn’t a database, it’s an output format.

Thanks for your comment! It’s been pretty lonely around here

(By the way @lgessler, I’d love to know what you think about the application brainstorm thread )