When I started blogging, one of my visions was to ensure the machine-readability of the articles. Years have passed, I am still far from completely achieving it. No doubt, there are technical limitations to achieve full machine readability. But a certain amount of machine readability can still be achieved. There are a several advantages to use structured data, the major one being for the findability of relevant information. I have been blogging for quite a long time and the number of articles, notes that I have been documenting is increasing every month. A simple search for keywords, sometimes do not give me the information that I have been looking for.
I have been using a limited amount of structured data in the form of RDFa1,2 since the beginning for the different sections of the article like
WebPage
, BreadcrumbList
, ListItem
, etc. from Schema.org3. However, this was quite limited. More data can be added, take, for example, the author name, date of creation, date of publication, date of last modification, the title of the article, etc. The following information
can be easily obtained for any published article and does not require a lot of effort.
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "https://johnsamuel.info"
},
"articleSection": "blog",
"name": "Integrating Linked Data",
"headline": "Integrating Linked Data",
"description": "Article by John Samuel",
"inLanguage": "en",
"author": "John Samuel",
"datePublished": "2020-05-03 19:04:28",
"dateModified": "2020-05-03 19:04:28",
"dateCreated": "2020-05-03 19:04:28",
"url": "https://johnsamuel.info/en/programming/linkeddata-integration.html",
"keywords": ["Blog"]
}
As you may have observed, I am using JSON-LD4,5 (JSON for Linked Data) for this purpose, mainly because of the ease of generation of this code. The above information can be
easily embedded using the following script
tag.
<script type="application/ld+json">
....
</script>
From a programming point of view, the main challenge was to correctly represent this information. Example snippets can easily be found on Schema.org3. Other authors have previously documented about their
choice of properties6. As I am using version control systems, I can easily obtain the creation date and the last modification date of the article. With HTML parser like BeautifulSoup
, I
obtain the title of the article. I used the following Python libraries for generating and supporting JSON-LD on this blog, like the above example for a given blog posting.
- extruct: for extracting metadata (RDFa, JSON-LD, Microdata) from a web page. It can also be used to extract and verify the newly added JSON-LD and RDFa information.
- argparse: for parsing command line arguments, especially to work with one or more files and to support options for extraction or addition of metadata.
- w3lib: for obtaining the base URL
- pygit2: for obtaining the creation date, the last modification date of a blog post
- bs4 (BeautifulSoup): for parsing HTML file and obtaining information like the title of the article
My next goal will be to go beyond just annotating the metadata of the article to its actual content. Considering the 5 Star deployment scheme for Linked Data7 proposed by Tim Berners-Lee, there is still a lot of work to do for this site. The articles contain unstructured data. Though written in HTML, the article content could be made machine-readable, if annotated with open standards like RDFa and linked with other linked open data sources.