This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Lakehouse Service

The PlaidCloud Data Lakehouse Service (DLS) provides the speed of a Data Warehouse combined with the vast storage capability of a Data Lake. The DLS is based on Databend, a Lakehouse suitable for big data analytics and traditional data warehouse operations while having data lake operations and compatibility with Apache Iceberg, Apache Hive, and Delta Lake. It's extensive analytical optimizations, array of indexing types, and high compression makes it ideal for wide array of uses.

1: Getting Started
2: Pricing

1 - Getting Started

Getting started with the PlaidCloud Data Lakehouse Service

About

The PlaidCloud Data Lakehouse Service (DLS) stands on the shoulders of great technology. The service is based on Databend, a lakehouse suitable for big data analytics and traditional data warehouse operations while supporting vast storage as a data lake. It's extensive analytical optimizations, array of indexing types, high compression, and native time travel capabilities makes it ideal for wide array of uses.

The PlaidCloud DLS also has the ability to integrate with existing data lakes on Apache Hive, Apache Iceberg, and Delta Lake. This allows for accessing vast amounts of already stored data using a modern and fast query engine without having to move any data.

The PlaidCloud DLS continues our goal of providing the best open source options for our customers to eliminate lock-in while also providing services as turn-key solutions.

Managing, upgrading, and maintaining a data lakehouse requires special skills and investment. Both can be hard to find when you need them. The PlaidCloud service eliminates that need while still providing deep technical access for those that need or want total control.

Key Benefits

Always on

The PlaidCloud DLS provides always-on query access. You don't have to schedule availability or incur additional costs for usage outside the expected time.

This also means there is no first-query delay and no cache to warm up before optimal performance is achieved.

Read and Write the way you expect

The PlaidCloud DLS operates like a traditional database so you don't have to decide which instances are read-only or have special processes to load data from a write instance. All instances support full read and write with no special ETL or data loading processes required.

If you are used to using traditional databases, you don't need to learn any new skills or change your applications. The DLS is a drop-in replacement for ANSI SQL compliant databases. If you are coming from other databases such as Oracle, MySQL or Microsoft SQL Server then some adjustments to your query logic may be necessary but not to the overall process.

Since SAP HANA and Amazon Redshift use the PostgreSQL dialect, those seeking a portable alternative will find PlaidCloud DLS a straightforward option.

Economical

With usage based billing, you only pay for what you use. There are no per-query or extra processing charges. Triple redundant storage, incredible IOPS, wide data throughput, time travel queries, and out-of-band backups are all standard at a reasonable price.

We eliminate the headache of having to choose different data warehousing tiers based on optimizing storage costs. We offer the ability to select how long each table's history is kept live for time travel queries and recovery.

Zero (0) days of time travel creates a transient table that will have no time travel or recovery. This is suitable for intermediate tables or tables that can be reproduced from other data.

You can set tables to have from one (1) to ninety (90) days of time travel. During the time travel window you can issue queries to view data at different snapshots or periods along with recovery a table at a point-in-time to a new table. This is an incredibly powerful capability that surpasses traditional backups because the historical state of a table can be viewed with a simple query rather than having to recover a backup.

Highly performant

We employ multiple caching strategies to ensure peak performance.

We also extensively tested optimal compute, networking, and RAM configurations to achieve maximum performance. As new technology and capabilities become available, our goal is to incorporate features that increase performance.

Scale out and scale up capable

The ability to both scale up and scale out are essential for a data lakehouse, especially when it is performing analytical processes.

Scaling up means more simultaneous queries can occur at once. This is useful if you have many users or applications that require many concurrent processes.

Scaling out means more compute power can be applied to each query by breaking the data processing up across many CPUs. This is useful on large data where summarizations or other analytical processes such as machine learning, AI, or geospatial analysis is required.

The PlaidCloud DLS allows scale expansion either on-demand or based on pre-defined events/metrics.

Integrated with PlaidCloud Analyze for Low/No Code operations

Analyze, Dashboards, Forms, PlaidXL, and JupyterLab are quickly connected to any PlaidCloud DLS. This provides point-and-click operations to automate data related activities as well as building beautiful visualizations for reporting and insightful analysis.

From an Analyze project, you can select any DLS instance. This also provides the ability for Analyze projects to switch among DLS instances to facilitate testing and Blue/Green upgrade processes. It also allows quickly restoring an Analyze Project from a DLS point-in-time backup.

Clone

Making a clone of an existing lakehouse performs a complete copy of the source lakehouse. When a clone is made it has nothing shared with the original lakehouse and therefore is a quick way to isolate a complete lakehouse for testing or even a live archive at a specific point in time.

Another important feature is that you can clone a lakehouse to a different data center. This might be desireable if global usage shifts from one region to another or having a copy of a warehouse in various regions for development/testing improves internal processes.

Web or Desktop SQL Client Access

A web SQL console is provided within PlaidCloud. It is a full featured SQL client so it supports most use cases. However, for more advanced use cases, a desktop client or other service may be desired. The PlaidCloud DLS uses standard security and access controls enabling remote connections and controlled user permissions.

Access options allow quick and easy start-up as well as ongoing query and analytics access. A firewall allows control over external access.

DBeaver provides a nice free desktop option that has a Greenplum driver to fully support PlaidCloud DWS instances. They also provide a commercial version called DBeaver Pro for those that require/prefer use of licensed software.

2 - Pricing

PlaidCloud Data Lakehouse Service Pricing

Usage Based

The cost of a PlaidCloud Data Lakehouse instance is determined by a limited number of factors that you control. All costs incurred are usage based.

The factors that impact cost are:

Concurrency Factor - The size of each compute node in your warehouse instance
Parallelism Factor - The number of nodes in your warehouse instance
Allocated Storage - The number of Gigabytes of storage consumed by your warehouse instance
Network Egress - The number of Gigabytes of network egress. Excludes traffic to PlaidCloud applications within the same region. Ingress is always free.
Time Travel Period - How many days, weeks, or months to retain time travel history on tables

Storage, backups, and network egress are calculated in gigabytes (GB), where 1 GB is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).

All prices are in USD. If you are paying in another currency please convert to your currency using the appropriate rate.

Billing is on an hourly basis. The monthly prices shown are illustrative based on a 730 hour month.

Controlling Factors

Concurrency Factor

Compute Type	Hourly Cost (streams/hr)	Monthly Cost (streams/month)
Standard	Contact Us	Contact Us

Concurrency determines how many simultaneous queries are handled by the DLS instance. This is expressed as a number of process streams. There is not a 1:1 relationship between streams and query capacity since a single stream can handle multiple simultaneous queries. However, as the number of concurrent requests increase, the query duration may exceed the desired response time and an increase in the concurrency factor will help.

From a conceptual standpoint you can view processing streams as vCPUs used to process queries.

The default concurrency factor is 2, which is a good starting point if you are unsure of your needs. It can be adjusted from 1 to 14. If your needs exceed 14, please contact us to increase your concurrency limit.

Parallelism Factor

There is no additional cost per node. The compute cost of the DLS instance is the product of concurrency and parallelism plus the master node.

Parallelism determines how many nodes are in the DLS instance. This is expressed as node count. The number of nodes determines how much compute power can be applied to any single query. By increasing the node count, the computational part of the query can be spread out over many process streams. In addition, the storage throughput is multiplied by the number of nodes, which is very valuable when dealing with large datasets.

For example, if the maximum theoretical write throughput of a single node was 4 TB/sec, a warehouse with 8 nodes would have a theoretical write throughput of 8 x 4 TB/sec = 32 TB/sec. There are many factors that impact write speed including compression level, indexes, table storage type, network overhead, etc... but in general, nodes apply a multiplying factor to data throughput speed.

Allocated Storage

Three types of table storage options are available in a PlaidCloud DLS:

Regional
Multi-Regional

Storage Type	Hourly Cost (GB/hr)	Monthly Cost (GB/month)
Regional	Contact Us	Contact Us
Multi-Regional	Contact Us	Contact Us

Regional

The storage provides triple redundancy across multiple availability zones in a single region. This is suitable for most workloads that do not need geographically distributed redundancy.

Multi-Regional

This storage provides triple redundancy in each region and is stored in two regions. This provides geographical redundancy and fast failover for data requiring the highest availability.

Network Egress

Network Egress and Ingress charges are dependent on the cloud provider, region, and destination for the traffic. Contact us and we can provide a detailed cost matrix.

Network egress is calculated based on the egress traffic from your PlaidCloud Workspace. In terms of the egress traffic from a DLS instance, traffic to PlaidCloud applications in the same region such as Analyze and Dashboard are excluded. However, if you are connecting directly to the DLS instance through the external access point, egress charges will apply. In addition, if you access DLS instances from different regions using PlaidCloud applications then egress charges will apply.

If you connect between DLS instances in the same region using internal network routing there are no egress charges. However, if you connect using the external endpoint then egress charges will apply.

There is no charge for ingress traffic.