The Signpost

File:Lock-green.svg
Trappist the monk
PD
30
In focus

Tens of thousands of freely available sources flagged

Over the weekend of 25 November, citation templates received some updates. One change, in particular, goes a long way in flagging freely-available resources. Here's a short history of what was needed for the most recent changes to fully pay off.

Step 1: access locks get rolled out

In October 2016, so-called "access locks" were deployed in CS1 and CS2 templates (see Signpost coverage). After a few RfCs on visual appearance, things settled in the current scheme:

– to indicate a full version of a source that is freely accessible, with no conditions
– to indicate a full version of a source that is freely accessible, with some conditions (e.g. free registration is required, only the first 5 reads are free, etc.)
– to indicate a full version of a source that is not freely accessible (e.g. a paid subscription is required).

Step 2: bots get involved

Access locks for always-free resources, like papers hosted on arXiv or papers with PMCIDs, were automatically rolled out. But the main identifier for scientific articles is the sometimes-free DOI, which requires the presence of |doi-access=free to signal whether or not a particular DOI link is free to read.

For those unfamiliar with DOIs, they are roughly the equivalent of what ISBNs are for books, and usually point to individual academic papers published in peer-reviewed journals. Their structure is 10.xxxxx/foobar, with the 10.xxxxx part being the DOI prefix, identifying who has registered the DOI in question. DOI registrants can be access platforms like JSTOR (10.2307), individual journals like Notre Dame Journal of Formal Logic (10.1305), or publishers like the IEEE (10.1109).

While the initial roll-out of DOI access locks was done manually and semi-automatically with WP:AWB, OA Bot greatly assisted in flagging free-to-read resources on select articles. However, OA Bot tends to be user-activated on specific articles, rather than systematically crawling every article on Wikipedia.

One way to find swathes of free DOIs is to identify DOI prefixes belonging to known open-access publishers. For example, 10.3389 belongs to the (in)famous Frontiers Media, while 10.3390 belongs to the equally controversial MDPI. It's then a simple matter to have Citation bot flag them. It worked pretty well for the big publishers, so an effort was made to identify more open-access DOI prefixes, and the bot was updated accordingly.

Step 3: search and flag

Targeted Citation bot runs were done from database dumps — rather efficiently to begin with. But while database scans are good at finding articles containing specific DOI prefixes, they are bad at finding articles containing unflagged DOIs with these prefixes. Meaning that if, hypothetically, 92% of all articles with MDPI DOIs were flagged, you'd be wasting your processing power on 92% of articles with MDPI prefixes in them. As of writing, that's 12,151 articles — meaning well over 11,000 articles would be processed for nothing to catch the other ~1000. And the next time, if you have 98% flagged ... you'll have an even more inefficient run.

Luckily, with the recent update to the CS1 and CS2 citation templates, we have a solution: Category:CS1 maint: unflagged free DOI. This is a category that specifically tracks if a citation has a) a known free DOI prefix and b) a DOI that has been flagged as free. As of writing, a bit over 16,000 Wikipedia articles have been identified and processed. Here's an example edit: flagging 2 DOIs with prefix 10.3847, belonging to the American Astronomical Society. Here's another: flagging 4 DOIs with prefixes 10.1186, associated with BioMed Central journals, and 10.1073, associated with Proceedings of the National Academy of Sciences of the United States of America.

The hope is to have the category mostly cleared by the end of December, when it will contain only new additions. Those should be easily handled by daily bot runs.

Where to next?

About 2 to 3% of the 16,000 or so articles seem to have a free DOI that is unflagged in Wikidata, which are (mostly) the ones remaining in the category. Sadly, {{cite q}} makes it impossible to deal with it here, as well as the many other issues Citation bot is able to correct. Hopefully Wikidata people can look at the updates to the CS1 and CS2 templates and go through whatever is going on on their side of things and update things accordingly.

It should be a relatively straightforward task for someone that understands how Wikidata works. That someone isn't me. But it could be you!

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

"a free DOI that is unflagged in Wikidata... Sadly, {{cite q}} makes it impossible to deal with it here"

No, it does not; for example:

{{Cite Q|Q55893751}}

could be changed by the bot to:

{{Cite Q|Q55893751 |doi-access=free}}

and would render as:

John Ruhl; Peter A. R. Ade; John E. Carlstrom; et al. (8 October 2004). "The South Pole Telescope". Proceedings of SPIE. 5498: 11–29. arXiv:astro-ph/0411122. Bibcode:2004SPIE.5498...11R. doi:10.1117/12.552473. ISSN 0277-786X. Wikidata Q55893751.

However, rather than adding metadata to multiple instances of the same citation, it's far more sensible to hold the data on Wikidata, and to render it as part of each citation from there - which is {{Cite Q}}'s purpose.

Those of us working on Cite Q, and on citation metadata on Wikidata, would have appreciated being informed of this initiative when it was being developed, in order that the functionality could be rolled out, and metadata updated (by a bot acting on DOI prefixes in exactly the same manner as described above), in parallel. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:39, 4 December 2023 (UTC)[reply]

@Pigsonthewing: Well, consider yourself notified. I would have thought Wikidata people monitored CS1/2 talk pages so that cite Q can remain up to date, but that doesn't seem to have been the case. But also {{cite Q}} and how it interacts with Wikidata is completely obscure (and we really should not be using it, ever), so no one involved expected it to throw errors like this.
Anyway, the current list of registrants can be gotten from the section that starts with
--[[--------------------------< B U I L D _ K N O W N _ F R E E _ D O I _ R E G I S T R A N T S _ T A B L E >--
in Module:Citation/CS1/Configuration/sandbox
Headbomb {t · c · p · b} 21:18, 4 December 2023 (UTC)[reply]
Cite Q, and its interactions with Wikidata, are extensively documented. It is not "throwing an error". HTH. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 21:24, 4 December 2023 (UTC)[reply]
"throwing a maintenance message" then. As for |doi-access=, {{cite Q}} doesn't mention what its equivalent Wikidata property is. Headbomb {t · c · p · b} 21:31, 4 December 2023 (UTC)[reply]
The template that Cite Q wraps throws a maintenance message, because it was changed with no notification to the people who maintain Cite Q. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 21:43, 4 December 2023 (UTC)[reply]
Assuming that all papers with a given DOI prefix are open access is not going to work well in practice, since many publishers have mixed approaches to making papers open access that have been published in their journals. I hope you've been automatically confirming this on a per-paper basis rather than just relying on prefixes? Doing this work on Wikidata to start with (and focusing on data that can be individually checked by a script, rather than blanket automatic assumptions) would have been much better. Thanks. Mike Peel (talk) 21:37, 4 December 2023 (UTC)[reply]
These are specifically for registrants that have their entire portfolio in open access. MDPI, Frontiers Media, Hindawi, BioMed Central, Athabasca University Press, PeerJ, etc... When they have a mixed portfolio, like IOP Publishing, things don't get flagged. Headbomb {t · c · p · b} 23:42, 4 December 2023 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2023-12-04/In_focus