The Signpost

Technology report

Better PDFs, backup plans, and birthday wishes

The existing PDF generator skips tables (List of country calling codes § Tree list pictured)

A new way to export pages to PDF files has been developed. The current method of creating PDFs uses the Offline content generator (OCG) service. However, it can be quite problematic for many articles, as tables–including infoboxes–are completely omitted.

There have been multiple requests for table support since the OCG was introduced in 2014. The issue was also raised in 2015 as part of that year's Community Wishlist Survey and German community technical wishlist. Since then, the German Wikimedia chapter (WMDE) has been leading the initiative on enhancing tables in PDF. It was discussed at the 2016 Wikimania Hackathon, where a solution was proposed: offer an alternative PDF download that replicates the look of the website, using browser-based rendering instead of the OCG's LaTeX-based rendering.

A special page will be used to select which rendering to use.

The new PDF creator uses the Electron Service to render pages (using the Chromium web browser as a back end). When enabled on a wiki that already uses the OGC service, clicking "Download as PDF" on the side menu will display a choice of which service to use. The Electron Service was enabled by default on Meta and German Wikipedia last week, and is planned to be deployed to more wikis later.

A community consultation is open on MediaWiki.org regarding the future of PDF rendering. It is proposed to retire the OGC by August this year, once "core" OGC features are available with the Electron service. One such feature is the book creator, which collates multiple articles into a single PDF via the Collection extension. However, there are no plans to provide a two-column option, nor any plans to support conversion to plain-text or other file formats. E

Backing up Wikimedia

Concerns were raised earlier this week on the wikimedia-l mailing list about the "back-up plan" for Wikimedia.

The most well-known backups are the data dumps of MediaWiki content. Operations Engineer Ariel Glenn, who focuses on the dumps, doesn't consider them to be a form of backup though: the dumps only contain public data that is viewable by all, and just run twice a month.

Glenn further explained that the dumps are currently stored on two servers in the Virginia datacenter, and the most recent ones are also on a third server. They are also mirrored by other organizations, placing copies in California, Illinois, Sweden, and Brazil.

Glenn noted that there are no dumps of images currently. Operations Engineer Filippo Giunchedi said, "We're looking at 120 terabytes of original [files] today." Giunchedi added that files are stored in both the Virginia datacenter and one of the Texas datacenters, so there is some redundancy.

The databases themselves have a high level of redundancy according to Database Administrator Jaime Crespo. The servers themselves use RAID10, and there are about 20 active database replicas across the Virginia and Texas datacenters with the same content that can be cloned if one server goes down. For cases of accidental data loss, there is one server that has a delayed replica by 24 hours in each datacenter.

As far as actual backups, Wikimedia uses bacula as its backup software.

"As far as content goes, we do perform weekly database dumps and store them in an encrypted format in order to provide a pretty good guarantee we will avoid data leak issues via the backups," Operations Engineer Alexandros Kosiaris said. "We've had no such issues yet, but better safe than sorry."

The backups are stored in the Virginia and Texas datacenters, and are deleted after about 45–50 days for privacy policy compliance, Kosiaris explained.

As for improvements, Glenn has been looking for new mirrors for the dumps. Crespo noted that work on selecting a location for a new Asia datacenter is in progress, including discussions with legal. L

Ten years of Twinkling

The popular Twinkle tool (available as a gadget in Special:Preferences) celebrated its tenth birthday on January 21. Originally started as the rollback script "Twinklefluff" by AzaToth, it now automates or simplifies a plethora of common maintenance tasks, including responding to vandalism, tagging articles, welcoming new users, and admin duties. It is likely that over the past decade, millions of edits have been made using Twinkle. Thank you to everyone who has made Twinkle possible, your efforts are very much appreciated! E

+ Add a comment

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

A couple of clarifications, it was probably my fault not to express them clearly when I was asked. There are about 20 English Wikipedia core mediawiki replicas (the number is not fixed, newer servers are continuously being added/upgraded and others decomissioned). There are around 130 core db server in total for all projects serving wiki traffic, to maximize high availability and performance, and its topology can be seen at: https://dbtree.wikimedia.org/ Some auxiliary (non-core) servers are hidden for clarity. Should a meteorite hit the west cost of US, we could have all wiki projects running on the secondary datacenter in 30 minutes (?)- we are trying to get faster and better there. https://blog.wikimedia.org/2016/04/11/wikimedia-failover-test/ https://www.mediawiki.org/wiki/Wikimedia_Engineering/2016-17_Q3_Goals#Technical_Operations

Also, there is 2 (not 1) db servers delayed 24 hours, one per main datacenter, one just happens to be temporarily (for a few weeks) under maintenance and it is up but not "delayed" after hardware renewal (redundancy helps, not only a recovery method, but also for easier maintenance and less user impact).

In general, backups is something that one never stops working on- there is always room for faster backups, faster recovery, more backups, better verification, more redundancy, etc.

--jynus (talk) 14:55, 6 February 2017 (UTC)[reply]

PDF rendering

I'm really excited to learn that better PDF rendering is on the way -- this will be enormously helpful to many projects. I'm curious, will the rendering respect little customizations, e.g. whether one has chosen to show or hide the Table of Contents or collapsed text, or the sort order chosen in sortable tables? Also, is there any related progress on ODT or ePub output? -Pete Forsyth (talk) 21:41, 6 February 2017 (UTC)[reply]

It will render exactly the same as a printout of page would look like, if you would be an anonymous user (basically, it works just like "Print to PDF on any modern OS's print dialog). There is no progress on ODT or ePub output (as a matter of fact, it could be argued that we will be further from such a solution, by choosing for maintainable simplicity over unmaintainable complexity). —TheDJ (talkcontribs) 12:55, 7 February 2017 (UTC)[reply]
TheDJ, currently, "print to PDF" does respect whether or not the TOC is expanded. But if the browser engine doing the rendering is on the server side, will that still be the case? -Pete Forsyth (talk) 19:10, 7 February 2017 (UTC)[reply]
No, the rendering is serverside, so it has no idea about the context that your browser keeps. —TheDJ (talkcontribs) 20:07, 7 February 2017 (UTC)[reply]
OK. As I suspected...and unfortunate, but difficult to change, I'd imagine. Thanks for the clarification! -Pete Forsyth (talk) 20:08, 7 February 2017 (UTC)[reply]
Glad to see the ability to have proper tables in PDFs is now likely. Quite a lot of my work on recent years has been tables and lists, and its been as real pain not to be able to render them. It means WP readers can't access them easily for study off web. I have had to place the texts in my word processor and format them there for my private use. Apwoolrich (talk) 11:16, 8 February 2017 (UTC)[reply]
I too am looking forward to see the glorious PDF function restored to its glory! Can't wait for books to be a thing again! Headbomb {talk / contribs / physics / books} 17:17, 13 February 2017 (UTC)[reply]

















Wikipedia:Wikipedia Signpost/2017-02-06/Technology_report