This is the transcript of the talk ShExStatements: Simplifying Shape Expressions for Wikidata prepared for WikiWorkshop 2021 on 14th April, 2021
Good day/Good afternoon/Good evening to all according to your timezone.
First and foremost, I am thankful to the organizers for allowing me to present this talk for WikiWorkshop 2021. I am John Samuel, and my presentation is about ShExStatements: Simplifying Shape Expressions for Wikidata.
In the next slide, I present the main motivation behind this work.
Thanks to the open and collaborative nature of Wikidata, new entities are regularly created, and they need to be validated. WikiProjects play a significant role in guiding contributors and newcomers to various possible ways for describing entities belonging to several domains. But how can errors, like wrong use of properties, cardinalities, or datatypes be identified? Wikidata property constraints can identify some of them. Recently introduced ShEx (or Shape Expressions) for describing entity schemas can be used to identify many more complex errors. However, at the time of speaking, there exist only less than 300 entity schemas for more than 90 million Wikidata items.
This work started during Wiki Techstorm 2019 in Amsterdam aimed to reduce the complexity of writing entity schemas or shape expressions. The major question was "Is it possible to generate shape expressions from simple CSV statements or files?". Secondly, it must take into consideration the work done by numerous WikiProjects. Thirdly, the solution should be multilingual and it should also help speakers from multiple languages.
ShExStatements, inspired by QuickStatements, was thus developed. It supports a simple tabular syntax with five columns. There are two parts: the first part for specifying any prefixes and the second part for describing the shape of entities. In the second part, the contributors can specify node names, properties, allowed values, cardinalities, and comments. The demonstration is also available at shexstatements.toolforge.org. In the next slide, we will see an example of a TV series, written using ShExStatements.
On Wikidata, a TV series is an instance of Q5398426. It should have zero or more genres, one or more countries of origin, one or more directors, one or more screenwriters. Vertical bars are used to separate the column values.
As you can see in this example, we can have multiple node names. And as shown in the last line, a genre is an instance of either Q201658 or Q15961987. This last line is an interesting example of how different separators are used. A comma is used to separate multiple values.
Thank you once again for this opportunity. If you have any questions or remarks or if you want to point to some other relevant works, please do not hesitate to contact me.