The biggest challenge for Google with regard to semantic search is identifying and extracting entities and their attributes from data sources such as websites. The information is usually unstructured and not error-free. The current Knowledge Graph as Google's semantic center is largely based on the structured content from Wikidata and the semi-structured data from Wikipedia or Wikimedia.
In this article in my series, I would like to take a closer look at the processing of data from semi-structured data sources such as Wikipedia.
I have already briefly discussed the processing of structured data here .
You can find a detailed collection of articles on the topic paraguay phone number data of Knowledge Graph, semantic SEO and entities in the associated article series .
Table of contents [ Hide ]
1 Processing of semi-structured data
2 The processing of semi-structured data using Wikipedia as an example
3 How Google can use Wikipedia special pages
3.1 List & category pages for classification by entity types and classes
3.2 Special forwarding pages for the identification of synonyms
3.3 Definition pages for the recognition of multiple meanings
4 databases based on Wikipedia: DBpedia & YAGO
5 Categorization of entities based on key attributes
6 Collecting Attributes with Wikipedia as a Starting Point
7 How is information about entities collected?
8 Information from Wikipedia in Featured Snippets and Knowledge Panel
9 Wikipedia as “Proof of Entity”
10
11 Wikipedia and Wikidata currently (still) the most important data sources
processing of semi-structured data
Semi-structured data is information that is not explicitly marked up according to general markup standards such as RDF, schema.org, etc., but has an implicit structure. Structured data can usually be obtained from this implicit structure using workarounds.
Extracting information from data sources with semi-structured data can be done using a template-based extractor . This can identify content sections based on a recurring, identical structure of posts and extract information from them.
The processing of semi-structured data using Wikipedia as an example
Wikipedia or other sources are a very attractive source of information due to the similar structure in every article and constant checking by editors such as Wikipedians. In addition, Wikipedia is based on the MediaWiki CMS . This means that the content is provided with rudimentary mark-ups and can be easily downloaded via XML, SQL dumps or as HTML . This can also be referred to as semi-structured data .
The structure of a typical Wikipedia article is a template for classifying entities by category, identifying attributes and extracting information for featured snippets and knowledge panels. The very similar or identical structure of the individual Wikipedia articles in e.g.