🗄 Let’s create a sample data repo!

pathall · July 9, 2021, 6:25pm

What if we created a repo in our Github organization of sample documentary data that is consistently structured (and thus comparable) and available for reuse?

Some critieria:

The data have to be open-access and suitable for reuse
They don’t need to be formatted in any particular way (we can convert them if it’s not too much work — digital something is better than a PDF or a .jpg that has to be re-transcribed!)
They shouldn’t be gigantic — any or all of the following would suffice:
- a few short-to-medium length texts
- a wordlist or small dictionary
- some metadata
- grammatical abbreviations
- phonological inventories
Citation required, of course

The content could come from existing sources (yesterday’s post on the CoCoON Archive, for example, might be a starting point). The point would be not to create an archive per se, but to create some data for testing various kinds of user interfaces in documentation.

I have some things I have done that I could contribute, but I need to filter through them and make sure all the citation is in order.

(There’s no data there as yet, updates soon!)

https://github.com/docling-forum/docling-data