Data ingestion and indexing
When working with content objects, Purple Boost expects the following properties:
- id: (str) unique identifier that will be used to access or reference your content
- title: (html-formatted str) title of your content
- link_url: (str) url of the published content
- content: (html-formatted str) for an article (which is the main use-case), should be an HTML-string
- [OPTIONAL] seo_title: (html-formatted str) SEO optimized version of the title
- [OPTIONAL] description: (html-formatted str) this can be an overhead or summary introduction to the content
- [OPTIONAL] published: (date str) date of publishing
- [OPTIONAL] last_modified: (date str) date of last modification
- [OPTIONAL] author: (str) name of the content writer
- [OPTIONAL] type: (str) the kind of content, e.g. article, video, ... (default=article)
- [OPTIONAL] sub_type: (str) to further specify the type of your content, e.g. news, blog, ...
- [OPTIONAL] section: (str) main category used to organize the content on a published website
- [OPTIONAL] image_url: (str) url of a featured image or thumbnail for the content
- [OPTIONAL] source: (object {id: str, title: str, href: str}) in case multiple sources is used
- [OPTIONAL] categories: (list of str) a list of categories that the content could belong to
- [OPTIONAL] keywords: (list of str) a list of keywords that can characterize the content
- [OPTIONAL] topic: (list of str) a list of textual representations of the topics covered by the content
- [OPTIONAL] premium: (bool) used to tag the content availability to registered reader
- [OPTIONAL] region: (str) localization of the content
- [OPTIONAL] reading_time: (float) estimated time of reading for an article, in minutes
Although many properties are optional, we encourage providing as many of them as possible, as many Purple Boost features work better on more information.
To manually exclude some articles from being suggested as link targets, the tag "acm_ignore" can be added to it. When this tag is present in the categories, keywords, or topic list, the article can still be opened and edited with the Link Optimizer, but will never be linked to from another article. To exclude articles only sometimes, you can set up profiles to filter for specific categories, as described in Profile Management.
Purple Boost offers two basic modes of data ingestion: PUSH and PULL:
- PUSH refers to a setup where the CMS actively notifies Purple Boost about changes in data, i.e. whenever a new article got published or an existing published article got modified or deleted.
- PULL refers to a setup where no such active mechanism is possible, and Purple Boost instead periodically fetches content from the CMS.
The simplest use-case is when a user is already using our publishing solution.
We currently only support PULL mechanism via the JSON or GRAPHQL API thats needs to be installed on the corresponding instance.
For the JSON API, to enable basic authentication, one needs to install additionally the following plugin: https://github.com/WP-API/Basic-Auth.
For the GRAPHQL API, one needs to install the associated plugin: https://www.wpgraphql.com/
Since Purple DS HUB instances are based on WordPress, a similar setup should be working on any WP instances.
Similarly to our WP connectors, a custom connector can be implemented specifically for a customer to periodically retrieve data from one or more sources.
There is a variety of possible sources: REST API, GraphQL API, access to data on some cloud storage platform (like S3 or GCS).
Different data format can also be used, JSON and XML being the most frequent.
Most importantly, each imported piece of content needs to have a minimum of metadata associated to it (see list above) so that our features can be functional, and some other optional that can still be used to increase the performance of certain features.
The previous method requires periodic re-indexing of the content. To enable continuous indexing of content, customers can use the Indexer API to push new content directly into our system. Content can also be updated and deleted in real-time using the corresponding HTTP methods.
When using this mechanism, initial data can be imported through two mechanisms:
- Initiated by the customer using the bulk import API (preferred).
- Using PULL indexing from a data dump provided by the customer.
Check the Indexer API section of the Rest API Documentation for more information about importing data.