Whether entities occur in a content
whether there is a main entity that the content is about
which ontology or ontologies the main entity can be assigned to
to which author or entity the content is to be attributed
in what relationship the entities occurring in the content stand to each other
which properties or attributes are to be assigned to the entities
This is what it could look like:
Possible crawling and indexing in the entity era (click on the image to enlarge)
Some Google patents also repeatedly talk about an entity database that exists alongside a search index. In relation to Google, this entity database is obviously the Knowledge Graph.
The Google patent Entity database data aggregation states :
an entity database storing an entity-relationship graph representing elements in the virtualization environment, wherein:
each of the elements is represented by an entity-type kenya phone number data node in the entity-relationship graph,
relationships between the elements are represented by edges between the nodes, and
information regarding each of the entity-type nodes is accessible through a query interface.
The Knowledge Graph is officially described in a very similar way by Google. What is also interesting here is that entity types are referred to as nodes and not as entities themselves.
The patent describes that the relationships between the entity types, the respective attributes, and historical statistics are used to select the entities for delivery in search results. Here is an illustration of what such an entity graph could look like:
I think the most exciting idea from Cindy's post is the idea that a main entity, e.g. a top-level entity such as a person or a company, can be assigned to various sub-entities such as websites, content or apps. This approach can also be found in the Google patent Ranking nodes in a linked database based on node independence from 2013. It states:
generate one or more clusters of affiliated nodes from the plurality of nodes,
where the affiliated nodes, of each cluster of affiliated nodes, are one or more of:
owned by a common entity, or
controlled by the common entity;
The following illustration from the patent makes it clearer what is meant:
clustering of entities
Elements 415 and 410 represent clusters of different nodes such as documents or websites. These clusters are formed based on the links between the nodes or when it is clear that the nodes are under the control of the same organization or entity.
In other words, ranking component 340 may determine that multiple nodes should be clustered when there is a high probability that all of the nodes are controlled by a single entity.
Decisive criteria for clustering the nodes can be authorship, graph structure, similarity of the content, manually specified information such as meta data. In this way, elements such as individual posts and other content formats, domains, apps... could be assigned to entities such as companies or people. WHOIS information would also be possible, but this is no longer so easy after the GDPR.