Data preparation: what you need to know to get the best out of Crystal

crystal is here to help you process your data and get insights out of it to make successful business decisions.

In order to do so, she needs a data source from which to retrieve the information wanted and, to get the best out of her performance, this data needs to be prepared according to certain guidelines.

The higher the data and structure quality is, the better results and impact crystal can bring.

We recommend you to follow our advice to get the best out of your crystal experience.

But before we dive into specific details, let’s see (or review) some basic concepts as an introduction to the perfect crystal experience.

In this article we will see:

  1. crystal foundations: entities, aggregation, aliases, objectives and topics

  2. Data preparation outside crystal

  3. Data manipulation inside crystal

1. crystal foundations: entities, aggregation, aliases, objectives and topics

Entities

Entities are the core of crystal’s functioning: they are essentially the content that you allow the members to explore through the advisor conversation.

There are two types of entities: the main entities and the filter entities.

A main entity is the subject the insight focuses on, according to the objective you set for the analysis. For example, if the question for the advisor would be: “Can you show me total of sales?”, the main focus of this request would be “sales”.

A filter entity is the dimension that you can associate with the main entity to drill down, filter, or deepen your analysis. For example, for our question “Show me total of sales”, a filter entity would be the value of the "country" entity, like “Show me total of sales for Germany”, or a year, like “Show me total of sales in 2019”.

A filter entity should not be confused with a filter value: for example, in our case, a filter entity might be “Country”, which would retrieve the sales for a single country, while a filter value is a specific item in the entity’s data column. In our case, “Germany” would be a specific value for the entity “Country”.

Here’s another example:

“Show me the revenues (main entity) for Italy (filter value) in 2019 (filter value)”.

Aggregation

The aggregation in crystal is based on SQL aggregation functions and it represents how the different values of a main entity can be put together in a specific topic, meaning that the same entity can be aggregated in different ways.

For example, if we have the entity “Sales”, we can either sum the different values of the entity, count them or calculate the average.

If we want to show the user how much money we made, then we would use Sum, if we want to count how many sales transactions we had, we would go for Count and if we want to represent the average sale we would use Average.

Aliases

Aliases are very important for crystal, because they enable the flexible conversational experience. The same entity can be in fact associated with multiple words: “sales” can be also called “revenues”, “turnover”, and so on. Another example: if the entity is “items” words like “units” or “articles” might also be used to refer to it.

These synonyms associated with the same entity are called aliases and they can be used in the conversation to refer to the same entity.

Objectives

An objective is essentially the main purpose of the topic. Each objective can be represented through one or more visualizations that will be chosen by the admins during the phase of topic creation. Each objective has a number of consistent visualizations, that can be charts or other formats (such as ranking tables and lists).

For example, the comparison objective shows the similarity or dissimilarity between different items or facts, like in the question: “Can you compare items sold versus budget by country?” The over time objective, instead, shows how data changed over a specific time period, as in: “Can you show me the sales over time by month for the last year?"

Topics

A topic is an insight on your data that you can get from crystal. It is the combination of one or multiple main entities that can be filtered for one or more filter entities that fulfill an objective and that are surfaced to the user through a visual or textual representation.

How to ask questions to crystal

crystal’s conversation is driven mainly by the following factors:

  • Main entity

  • Objective

  • Filter entity

A good rule of thumb is the following:

  • Show me the objective for the main entity in filter value.

    • For example: “Show me the trend for sales in 2021”.

  • Show me the main entity objective in filter value.

    • For example: “Show me sales over time in Germany”.

You can find all the options and guidelines to follow for the conversation here.

Consequently, it is very important to prepare the data to create topics that have a natural way of requesting them.

Each objective has one or multiple visualizations that best fits the need. Find more details on the visualization here.

Data formatting

In order to display your data correctly, crystal will follow these data formatting conventions and rules.

Supported formats

The data must be formatted with a certain logic to be properly consumed by crystal. Check out this article to know more.

Strings

crystal works with Text fields. It supports most of the commonly used categorical types. Here you can find the complete list of supported data sources. Depending on the connected data source, you could need to transform some of your dataset columns.

For example

Scenario

Query

VARCHAR type

CAST ( column AS varchar [ ( 120 ) ] )

Mind that this is just an example and it needs to be adapted, based on the exact case and technology used.

Special characters

We often encounter qualitative fields, such as product description, address and others, containing special characters that are not properly encoded as Unicode.

To clean these data types, it is more convenient to use the regular function expression.

How to handle

Scenario

Query

publicitÖÖÖÖ

regexp_replace('publicitÖÖÖÖ','[Ö�]+|√†!','à','g');

Empty values

crystal expects empty values to be represented as NULL.

How to handle

Scenario

Query

““

nullif(column, "")

“None”

nullif(column, "None")

Numbers (floats)

The floating numeric from customer data source should follow the standard format that uses the comma as thousand separators (e.g.1,200 for one thousand two hundred) and the dot as decimal separator (e.g. 1.2 for one and two).

  • 1.234 = one and two hundred thirty-four thousandths

  • 1,234 = one thousand two hundred thirty-four

However it can happen quite often to find numeric data that violates these conventions.

How to handle

Scenario

Query

Result

We just need to read it correctly from text type and save it as numeric type.

to_number('1,234,567.009678' , '999G999G999G999D999999'::text)

1234567.009678

Negative sign is at the end of the number.

to_number('123,456,789.01-', '999999999G999D999S'::text);

-123456789.01

In the following example, you can find some common scenarios and how to fix them.

  • If our data already complies with the common standard (dot for decimal and comma for thousands separators):

Scenario

Query

Result

We just need to read it correctly from text type and save it as numeric type.

to_number('1,234,567.009678' , '999G999G999G999D999999'::text)

1234567.009678

Negative sign is at the end of the number.

to_number('123,456,789.01-', '999999999G999D999S'::text);

-123456789.01

  • If our data does NOT comply with the common standard:

Scenario

Query

Result

It uses dot for thousands and comma for decimal separators (eg. 1.200 and 1,2).

NULLIF(replace(replace('5.295,22' , '.'::text, ''::text), ','::text, '.'::text), ''::text)::numeric

5295.22

It uses comma for decimal separator (1,2) and nothing for thousands separator (1200).

NULLIF(replace('2271,02', ','::text, '.'::text), ''::text)::numeric

2271.02

Complex fields types

At the moment crystal is not supporting complex fields like “Json objects“ for Postgres or “LOB“ for Oracle. If in a table such field types are present, crystal will not be able to use them in the topic configuration process, nor will she be able to preview the related data.

Note that, for Oracle, all the data types described here are not supported.

Product limitations

Before we start to dive into the data preparation details, take a look at this article to know more about crystal’s limitations.

2. Data preparation outside crystal

The topic configuration is a key process for the success of the crystal project, that’s why it is extremely important to provide a clean data structure by following some guidelines.

The main entity

A data source is mainly composed of tables that are also made of columns.

Main entities can be configured by using a simple rule of thumb:

One column = one main entity

This is the suggested and recommended approach for most cases. Each column must contain the values or content of a single entity.

For example, let’s consider that we have to create a topic whose main entity is “Sales” and this is our data:

Product

City

Channel

Sales

Chargebacks

Product 1

Milan

Online

$20,56

$0

Product 2

London

Store

$50,36

$0

Product 1

Turin

Online

$97,76

$0

Product 1

New York

Store

$190,12

$0

In this case Sales will be our main entity. Other main entities, such as the chargebacks in our example, can be configured on the related column.

A crucially important factor to keep in mind is that entities’ aliases MUST be unique , therefore we will not be able to name another main entity within crystal “Sales”.

Also note that:

  • If, conceptually, there are multiple columns that refer to generic “sales”, they need to be centralized in the same column and in the same table to enable the entity “sales” to be unique in the conversation and allow the system to identify the source field.

  • if there are multiple sales columns that conceptually have differences (e.g. actual sales & budget sales), they can be different columns and can have different alias names (such as “actual sales“ and “budget sales”).

There is a possible alternative approach that can be used in peculiar cases. Let’s take a table like this as an example:

Product

Indicators

Value

Product 1

Revenues

$0

Product 1

Costs

$0

Product 1

Inventory

$0

In this case, we can use the “value” column as the main entity and the “indicator” column as a filter entity. Nonetheless, this approach is solid when the values within the column “Value” can be logically aggregated with one of the supported functions: otherwise there is the risk of showing inconsistent data to the final user.

Moreover, the conversation structure will be directly affected, as the main entity, and therefore the focus of the topic, will be “value”, rather than the values of the filter dimension (such as, for example, revenues or costs).

The filter entity

If a conceptual dimension (for example, “product”) is present in multiple tables (like, 3), and therefore in multiple columns (also 3), and it has to be used as a filter for topics configured on the three objects, there are two options:

  • We could configure two different entities (for example, “Sales”and “Sales target“) and potentially trigger the disambiguation feature, in the case the request to the advisor is partial or incomplete (for example, by specifying only "sales").

  • Or, we could centralize the filter column in a single object to configure the entity in a single column.

Let’s see an example.

If we assume that “product” is the only filter dimension that we want to share between the two tables, we will have to create a master table and two “data” tables as follows, linked by a key that we will use to join the two tables.

Master table

ID

ProductID

Price

34

2

12

85

6

15

45

3

23

Table 1: Sales targets

ID

ProductID

Target

23

2

45,000

54

6

125,000

57

3

10,000

Table 2: Products

The “Name” column would become the Product filter entity.

ID

Name

2

Apples

6

Pears

3

Bananas

Centralize all entities

Assuming that Product is the only filter dimension that we want to share between the two tables, we can directly create a single table containing all the filters and main entities.

The driver for this decision would be the granularity of the data and the amount of entities that share the same filters.

For example:

Master table: sales

ID

Product

Price

34

Apples

12

85

Pears

15

45

Bananas

23

Table 1: Sales targets

ID

Product

Target

23

Apples

45,000

54

Pears

125,000

57

Bananas

10,000

Table 2: Products

ID

Product

Price

Target

1

Apples

12

45,000

2

Pears

15

125,000

3

Bananas

23

10,000

Calculated columns

Calculated columns are customized columns created dynamically through operations. In the below case, for example, we have the total column which is generated by the system by multiplying price per quantity. At the moment crystal does not support the creation of calculated columns.

ID

ProductID

Price

Quantity

Total

34

34

5

3

15

85

85

10

5

50

45

45

20

6

120

3. Data manipulation inside crystal

Joins

Join can be performed on a single key, meaning that you cannot use multiple fields to join two tables. Moreover, join operations can be performed only if the objects reside in the same data source (such as, database connection).

The joined tables can be further joined to a maximum of 5.

The performance of crystal’s answers depends on the number of the join operations and the performance of the underlying data source (for example, indexing and other database optimization techniques can be performed to speed up the data retrieval and enhance crystal answering performance).

The supported joins are:

  • Inner

  • Left

  • Right

Sorting

You can sort the final result by any column of the connected data, choosing ascending or descending order.

Static filtering

In crystal, static filters correspond to the SQL where clause. It is possible to apply static filtering to the connected data: static filters are applied by default to a topic and they are not to be confused with dynamic filters, applied in the filter step of the topic configuration flow.

You can apply multiple static filters to a topic with a specific operation. For example, you can select “category” as a static filter and say that you want to filter category values that EQUAL “gadget” (meaning that only the “gadget” objects will be taken into account in the conversation), or the “price” as a static filter with operation “greater than” and value 100, and your topic will only retrieve those values which are above 100.

The available options are:

  • Equal

  • Different

  • Empty

  • Not Empty

  • Greater Than

  • Less Than

  • Greater Than Or Equal To

  • Less Than Or Equal To

  • Between

Aggregation functions

You can choose among the following aggregation functions:

  • Count

  • Count distinct

  • Sum

  • Average

We hope this article will help you to prepare and format your data at its best, to give you the most complete and successful crystal experience.

Do you want to know more about topic creation? Check out our Tutorials for admins section on Resources!

If you have any questions about crystal’s features, you have encountered a problem or you would like to share your feedback, contact us at support@igenius.ai.

Last updated