This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Management

Within a project, you can create and modify tables, views, and dimensions.

1 - Using Tables and Views

Using and managing tables and views

Tabular data and information in PlaidCloud is stored in Greenplum data warehouses. This provides massive scalability and performance while using well understood and mature technology to minimize risk of data loss or corruption.

In addition, utilizing a data warehouse that operates with a common syntax allows 3rd party tools to connect and explore data directly. Essentially, this makes the PlaidCloud data ecosystem open and explorable while also ensuring industry leading security and access controls.

Tables

Tables hold the physical tabular data throughout PlaidCloud. Individual tables can hold many terabytes of data if needed. Data is stored across many physical servers and is automatically mirrored to ensure data integrity and high availability.

Tables consist of columns of various data types. Using an appropriate data type can help with performance and especially the storage size of your data. PlaidCloud can do a better job of compressing the data if the data is using the most appropriate data type too. This is usually guessed by PlaidCloud but it is also possible to change the data types using the column mappers in workflow steps.

Views

Views act just like tables but don't hold any physical data. They are logical representations of tables derived through a query. Using views can save on storage.

There are some limitations to the use of views though. Just be aware of the following:

  • View Stacking Performance - View stacking (view of a view of a view...etc) can impact performance on very large tables or complex calculations. It might be necessary to create intermediate tables to improve performance.
  • Dashboard Performance - While perfectly fine to publish a view for Dashboard use, for very large tables you may want to publish a table rather than a view for optimal user experience.
  • Dynamic Data - The data in a view changes when the underlying referenced table data changes. This can be both a benefit (everything updates automatically) or an unexpected headache if the desire was a static set of data.

2 - Table Explorer

Table Explorer provides powerful and readily accessible data exploration capabilities

Table Explorer provides a powerful and readily accessible data exploration tool with built in filtering, summarization, and other features to make life easy for people working with large and complex data.

Table Explorer supports exploration on any size dataset so you can use the same tool no matter how much your data grows. It also provides point-and-click filtering along with advanced filter capabilities to zero in on the data you need. The best part is that anywhere in PlaidCloud with tables or views, you can click on those tables and views to explore with Table Explorer. By being fully integrated, data access is only a click away.

The Grid view provides a tabular view of the data. The Details view provides a summary of each column, a count of unique values, and summary statistics for numeric columns.

Data can be exported directly from a filtered set as well as being able to save and share filters with others. Finally, the filters and column settings can be saved directly as a workflow Extract step.

The Grid View

The Grid view provides a tabular view of the data.

Setting the row limit

By default, the row limit is set to 5,000 rows. However, this can be adjusted or disabled entirely.

The rows shown along with the total size of the dataset are shown at the bottom of the table. The information provides three key pieces of information:

  1. The current row count shown based on the row limit applied
  2. The size of the global data after filters are applied
  3. The size of the unfiltered global data

Sorting locally versus globally

The Grid view provides the ability to click on the column header and sort the data based on that column. However, this method is only sorting the dataset that has already been retrieved and is not sorting based on the full dataset. If your retrieved data contains the entire dataset this distinction is immaterial however if your full dataset is larger than what appears in the browser, this may not be the desired sort result.

If you desire to sort the global dataset before retrieving the limited data that will appear in your browser those sorts can be applied to the columns in the Details view by clicking on the Sort icon at the top of each column. An additional benefit of using the global sort approach is that you can apply multiple sorts along with a mix of sort directions.

Quick reference column list

All of the columns in the table or view are shown on the left of the Table Explorer window by default. This column list can be toggled on and off using the column list toggle button.

The column list provides a number of quick access and useful features including:

  • Double clicking an item jumps to the column in the Grid or Details view
  • Control visibility of the column through the visibility checkbox
  • Use multi-select and right-click to include or exclude many columns at once
  • Quickly view the data type of each column using the data type icons
  • View the total column count

The Details View

The Details view provides an efficient way to view the data at a high level and exposes tools to quickly filter down to information with point-and-click operations.

Column data and unique counts

Each column is shown, provided it is currently marked as visible. The column summary displays the top 1,000 unique values by count. The number of unique values shown can be adjusted by selecting the Detailed Rows Displayed selection for a different value.

Managing point-and-click filters

Each column provides for point-and-click filtering by activating the filter toggle at the top of the column. Select the items in the column that you would like to include in the resulting data. Multi-select is supported.

Once you apply a filter, there may be items you wish to remove or to clear the entire column filter without clearing all filters. This is accomplished by selecting the dropdown on the column filter button and unchecking columns or selecting the clear all option at the top.

Managing Summarization

Summarization of the data can be applied by toggling the Summarize button to On. When the Summarize button is activated, each column will display a summarization type to apply. Adjust the summarization type desired for each column.

When the desired summarizations are complete, refresh the data and the summarizations will be applied.

Examples of summarization types are Min, Max, Sum, Count, and Count Distinct.

Finding Distinct Values

Activating the Distinct button will help reduce the data to only a set of unique records. When the Distinct button is active, a Distinct checkbox will appear on each column. Uncheck the columns that DO NOT define uniqueness of the column to the dataset. For example, if you want to find the unique set of customers in a customer order table, you would only want to select the customer column rather than including the customer order number too.

Summary statistics for numeric columns

Integer and numeric columns automatically display summary statistics at the bottom of the column information. This includes:

  • Min
  • Max
  • Mean
  • Sum
  • Standard Deviation
  • Variance

These statistics are calculated on the full filtered dataset.

Copying Data

It is sometimes useful to allow for copying of selected data from PlaidCloud so that it can be pasted into other applications such as a spreadsheet.

From the Copy button in the upper right, there are several copy options available for the data:

  • Copy All - Copies all of the data to the clipboard
  • Copy Selection - Copies the selected data to the clipboard
  • Copy Cell - Copies only the contents of a single cell to the clipboard
  • Copy Column - Copies the full contents of the column to the clipboard

Exporting Data

Exporting data from the Table Explorer interface allows exporting of the filtered data with only the columns visible. You can export in the following formats:

  • Microsoft Excel (xlsx)
  • CSV (Comma)
  • TSV (Tab)
  • PSV (Pipe)

The Download menu also offers the ability to download only the rows visible in the browser. This is based on using the row limit specified.

Additional Actions

Additional useful actions are available under the Actions menu.

Save as Extract Step

When exploring data, it is often in the context of determining how to filter it for a data pipeline process. This often consists of applying multiple filters including advanced filters to zero in on the desired result.

Instead of attempting to replicate all the filters, columns, summarizations, and sorts in an Extract Step, you can simply save the existing Table Explorer settings as a new Extract Step.

Save as View

Similar to saving the current Table Explorer settings as an Extract Step above, you can also save the settings directly as a view.

This can be particularly useful when trying to construct slices of data for reporting or other downstream processes that don't require a a data pipeline.

Manage Saved Filters

You never have to lose your filter work. You can save your Table Explorer settings as a saved filter. Saved filters also include column visibility, summarizations, columns filters, advanced filters, and sorts.

You can also let others use a saved filter by checking the Public checkbox when saving the filter.

From the Actions menu you can also choose to delete and rename saved filters.

Advanced Filters

While point-and-click column filters allow for quick application of filters to zero in on the desired results, sometimes filter conditions are complex and need more advanced specifications.

The advanced filter area provides both a pre-aggregation filter as well as a post-aggregation filter, if Summarize is enabled.

Any valid Python expression is acceptable to subset the data. Please see Expressions for more details and examples.

3 - Using Dimensions (Hierarchies)

Using and managing hierarchical data

PlaidCloud natively manages dimension (i.e. hierarchical) data through our proprietary hierarchy storage system. We decided to construct our own from purpose-built solution because other commercial and open-source solutions seem to present limitations that were not easily overcome.

The hierarchy storage supports not only hierarchical relationships but also properties, aliases, attributes, and values. It is also designed to operate on large structures and perform operations quickly including complex branch and leaf navigation.

Dimensions are managed in the Dimensions tab within each PlaidCloud project configuration area.

Main Hierarchy

Each dimension (i.e. hierarchical dataset) always consists of a main hierarchy. Every member of the hierarchy is represented here.

Having a main hierarchy helps establish the complete set of leaf nodes in the dimension.

Alternate or Attribute Hierarchies

Alternate hierarchies are different representations of the main hierarchy leaf nodes. Alternate hierarchies can consist of a subset of both leaf nodes and roll-up (i.e. folders) in the main hierarchy as well as its own set of unique roll-ups.

This provides for the maximum amount of flexibility by automatically updating alternate hierarchies when children of a roll-up change or to strictly control the alternate hierarchy members by specifying only the leaf nodes required.

Managing Dimensions

Creating a Dimension

From the New button in the toolbar, select New Dimension. Enter in the desired name, directory, and a descriptive memo.

Once you press the Create button the dimension will be created and ready for immediate use.

You can also create a dimension from a workflow using the Dimension Create workflow step.

Deleting a Dimension

To delete an existing dimension, select the dimension record and open the Actions menu in the upper right. Select Delete Dimension.

This will delete the dimension and all underlying data.

You can also delete a dimension from a workflow using the Dimension Delete workflow step.

It is also possible to clear the dimension of all structure, values, aliases, properties, and alternate hierarchies without deleting the dimension by using the Dimension Clear workflow step.

Copying a Dimension

To copy an existing dimension, select the dimension record and open the Actions menu in the upper right. Select Copy Dimension.

This will open a dialog where you can specify the name of the copy. Click the Create Copy button to make a copy of the dimension including values, aliases, properties, and alternate hierarchies.

Sorting a Dimension

The dimension management area makes it easy to move hierarchy members up and down as well as changing parents. It also makes it easy to create and delete members.

However, it can get tedious when manually moving hierarchy items around so you can sort a dimension from a workflow using the Dimension Sort workflow step. This can be a big time saver especially after data loads or major changes.

Loading Dimensions

Since dimensions represent hierarchical data structures, the load process must convey the relationships in the data. PlaidCloud supports two different data structures for loading dimensions:

  • Parent-Child - The data is organized vertically with a Parent column and Child column defining each parent of a child throughout the structure
  • Levels - The data is organized horizontally with each column representing a level in the hierarchy from left to right

In addition to structure, other dimension information can be included in the load process such as values, aliases, and properties.

See the Workflow Step for Dimension Load for more information.

Dimension Property Inheritance

A dimension may inherit a property from an ancestor. To enable inheritance, click the dropdown next to Properties and select Inherited Properties. All child nodes in the dimension will now inherit the propties of its parents.

Usage Notes:

  • Inheritance will happen for all properties in a dimension. You cannot set inheritance on one property but not another.
  • If you change and then delete the value of a child property, it will default back to the parent value. You cannot have a null value when the parent has a value.
  • If you set the value of a child property, its children will inherit the child property instead of the parent.
  • Inheritance will go all the way down to the leaf node.

4 - Publishing Tables

Publishing Tables and Views to allow usage in Dashboard, PlaidXL, and other external reporting

Since data pipelines can generate many intermediate tables and views useful for validation and process checks but not suitable for final results reporting, PlaidCloud provides a Publish process to help reduce the noise when building Dashboards or pulling data in PlaidXL. The Publish process helps clarify which tables and views are final and reliable for reporting purposes.

Publish

From the Tables tab in a PlaidCloud project configuration, find the table you wish to publish for use in dashboards and PlaidXL. Right-click on the table record and select Set Published Table Reporting Name from the menu.

This will open a dialog where you can specify a unique published name. This name does not need to be the same as the table or view name. Enabling a different name is often useful when referencing data sources in dashboards and PlaidXL because it can provide a friendlier name to users.

Once the table or view is published, its published name will appear in the Published As column in the Tables view.

Unpublish

Unpublishing a table or view is similar to the publish process. From the Tables tab in a PlaidCloud project configuration, find the table you wish to publish for use in dashboards and PlaidXL. Right-click on the table record and select Set Published Table Reporting Name from the menu.

When the dialog appears to set the published name, select the Unpublish button. This will remove the table from Dashboard and PlaidXL usage.

The published name will no longer appear in the Published As column.

Renaming

Renaming a table or view is similar to the publish process. From the Tables tab in a PlaidCloud project configuration, find the table you wish to publish for use in dashboards and PlaidXL. Right-click on the table record and select Set Published Table Reporting Name from the menu.

When the dialog appears change the publish name to the new desired name. Press the Publish button to update the name.

The updated name will now appear in the Published As column as well as in Dashboard and PlaidXL.