The Semantic Web Comes of Age

The Semantic Web Comes of Age

How do you describe a business? What about a person, or an intellectual work? There's an interesting little secret that people in IT likely know, but that doesn't always get to the C-Suite. Programming, at its core, is all about creating models. Sometimes those models are of classes of things, sometimes they better describe processes, but it is rare for a piece of software in your organization to not have some relevance to perhaps a few dozen critical types of things.

In large enterprises, it's not at all uncommon for that organization to go through a form of fire drill known as "creating the enterprise data model" (in TLA-speak, "EDM"). This particular ritual is initiated by business analysts who talk in hushed tones about data dictionaries, cardinality rules, associations and constraints. There are almost always drawings drawn, typically with reference to entity relationship diagrams, with lots of boxes and ovals and arrows, all neatly tied up in hushed debates about whether UML 1 or UML 2 rules apply and whether JSON or XML schema is the better denormalized form for handling streaming. Blood has been known to be drawn in these encounters. The end result of this, almost invariably, is a big, complex document called a schema, which is then placed in a folder on Sharepoint while the programmers merrily ignore everything in it, until they get upset when their applications don't work with the ones across the hallway and realize that they needed to figure out inter-operability.

The effort of creating such schemas can often be time consuming, and in the process opens up the potential for different groups within an organization to seek solutions that are optimized for their requirements, even if they are inconvenient for others. For people who work with such data dictionaries - business analysts, taxonomists and ontologists - this struggle was an inevitable part of defining an organization's data language, but it didn't mean that it was an enjoyable part.

Yet one thing emerges for those who deal with the language of data - when you get right down to it, there is actually a pretty minimal subset of things that matter to a business or organization, and these can be modeled in the same way. While there are always variations and additions, most organizations have common structures - divisions, employees, facilities, customers and so forth.  While address forms may vary somewhat, most of it tends to be uniform. Even paperwork - invoices, purchase orders, calendar events, etc. - have a lot of commonality.

Various organizations have built schemas around these, but because this information is so frequently accessed, Microsoft and Google (along with Russian search engine Yandex) came together in 2011 to establish a website called schema.org. Initially, it's purpose was to provide a home for various public schemas and microtagging languages, but starting in 2014, its focused shifted into consolidating these languages and creating a set of, well, schemas, that organizations could use to describe their own business languages. This was not the first such effort - in the mid-1990s, a set of "tags" were created under the auspices of NCSA called Dublin Core (after Dublin, Ohio, where much of this work was done). With the rise of both XML and the semantic Web, Dublin Core, which focused primarily on publishing information, was refactored as the Dublin Core Information Model (or DCIM). Not surprisingly, schema.org proceeded to slurp most of the DCIM terms into its own specification as well.

Today, there are nearly six hundred distinct types, in areas as diverse as

  • Creative works: CreativeWork, Book, Movie, MusicRecording, Recipe, TVSeries ...
  • Embedded non-text objects: AudioObject, ImageObject, VideoObject
  • Event
  • Health and medical types: notes on the health and medical types under MedicalEntity
  • Organization
  • Person
  • Place, LocalBusiness, Restaurant ...
  • Product, Offer, AggregateOffer
  • Review, AggregateRating
  • Action

These can get into surprising detail - the Organization set itself includes sixty-one distinct properties (some strings of text or numbers, some other object types), and organizations in turn are referenced by dozens of other resource types.

Beyond this core, the automotive industry, health and life sciences, bibliographic information (where most of Dublin Core now resides) and the Internet of Things all have their own industry extension vocabularies, while insurance, financial services and aerospace players are currently examining adding to such an effort.

This may sound somewhat geeky, yet another fascinating arena of technology that nonetheless may seem to have little value to businesses, save for one huge aspect - online search. Beginning in 2017, both Google and Bing (Microsoft's search  engine) announced that they would be supporting the use of embedded smart snippets in web content. A smart snippet is a bit of JSON (and common web standard for data interchange) that uses schema.org tags to identify what a web page contains. Google (and likely all other major search engines) would read the snippet and create a much more comprehensive record about that page than is done now for SEO searching. Smart snippets would have greater weight in search algorithms, and because such snippets could in fact be fairly complex, it would be possible to describe individual resources within these snippets in machine readable ways.

So, suppose that you have a catalog of products - say books. Normally a search engine scanning a web page for a particular book will attempt to use heuristic algorithms to attempt to get an idea about what the page is about. However, unless you have an army of SEO experts, chances are pretty good that the heuristic is pretty basic, and will typically rely upon keyword positioning presence and positioning that can be done as quickly as possible (CPU cycles cost money), and most of this is focused on the web page, not its content.

When a smart snippet is encountered, on the other hand, not only can the search engine get a much better idea about what the web page is about, but it is now able (if the snippets are set up properly) to actually describe things themselves. When Google reads that snippet, it actually creates a record about a particular book as a book, not simply as content on a web page. If you are looking to find books that are in the urban fantasy genre, feature a female doctor protagonist, is typically a two hour read, and is under $3 in price, Google will be able to bring up this particular book. Not only that, but because the book has a unique identifier, different reviewers can weigh in (from potentially multiple platforms) and these annotation reviews can then be linked to this identifier. Other applications can also read this same page and get this same information, and from it add other links that can be picked up by Google or other web applications.

This is called a knowledge graph, and it fulfills one of the basic visions of the Semantic Web when Web creator Tim Berners-Lee first described it publicly in a 2004 Scientific American article. This vision relies upon a common language for describing essential things ... the same language that is now emerging from schema.org. Smart snippets uses JSON-LD as a carrier, but it uses schema.org terms and relationships to describe not only things but how they relate to one another. The LD here stands for Linked Data, but can be thought of as ways of ascribing contextual information to resources that can be described in the virtual world.

In a business context, this same principle can be applied to both create and manage corporate knowledge bases. If you are a manufacturer, your catalog of goods also becomes a database. A potential customer (either consumer or business) could read the data from the catalog page for that book directly, rather than having to go through the complex process of setting up a data feed or a repository to extract content, can determine the retail price, the wholesale price, and relevant shipping information from that json-ld, and use that to place fifty orders for that book through other channels. A blogging or news site can generate JSON-LD smart snippets that describe specific events, and if those events happen to include a reference to your company or CEO, that relationship can be extracted (along with all other relevant information) and made available to your sales people to capitalize on those events or your data analysts to help factor the impact of this news to your company. In effect it makes it far easier for your own organization to create a dedicated mini-Google for retrieving relevant news, as well as making it far easier for other applications and search engines to identify metadata from your own press channels.

Similarly, by embedding Smart Snippets into your publishing or digital asset management systems, this provides a way to future proof classification - even if you don't currently have a way of doing anything with smart snippets, the ability to add such (either through manual production or entity extraction programs) insures that you can categorize your content so that it is not only consistent with internal standards, but also globally accessible when this media is published. Indeed, because schema.org also includes a number of rights management features and terminology, this metadata can be used to better insure that content goes only to the intended markets, and inappropriate or unavailable content is not inadvertently distributed, preventing liability headaches.

Communication can only take place when a common language exists. Schema.org very well has the potential to be that common language, and as such schema.org, linked data and json-ld should absolutely be on the watchlist for your digital transformation strategies.


Leave a Reply

Send us a Message

Get all the information