🗄 Let’s create a sample data repo!

What if we created a repo in our Github organization of sample documentary data that is consistently structured (and thus comparable) and available for reuse?

Some critieria:

  • The data have to be open-access and suitable for reuse
  • They don’t need to be formatted in any particular way (we can convert them if it’s not too much work — digital something is better than a PDF or a .jpg that has to be re-transcribed!)
  • They shouldn’t be gigantic — any or all of the following would suffice:
    • a few short-to-medium length texts
    • a wordlist or small dictionary
    • some metadata
    • grammatical abbreviations
    • phonological inventories
  • Citation required, of course

The content could come from existing sources (yesterday’s post on the CoCoON Archive, for example, might be a starting point). The point would be not to create an archive per se, but to create some data for testing various kinds of user interfaces in documentation.

I have some things I have done that I could contribute, but I need to filter through them and make sure all the citation is in order.

(There’s no data there as yet, updates soon!)

https://github.com/docling-forum/docling-data

2 Likes