This is the multi-page printable view of this section. Click here to print.
Data Warehouse Service
1 - Getting Started
About
The PlaidCloud Data Warehouse Service (DWS) stands on the shoulders of great technology. The service is based on Greenplum, a warehouse suitable for big data analytics and traditional data warehouse operations. It's extensive analytical optimizations, array of indexing types, highly-flexible compression, and availability of both row-based and columnar storage models makes it ideal for wide array of uses.
The PlaidCloud DWS continues our goal of providing the best open source options for our customers to eliminate lock-in while also providing services as turn-key solutions.
Managing, upgrading, and maintaining a data warehouse requires special skills and investment. Both can be hard to find when you need them. The PlaidCloud service eliminates that need while still providing deep technical access for those that need or want total control. Since Greenplum is based on PostgreSQL, it is nearly 100% compatible with current PostgreSQL operations.
Key Benefits
Always on
The PlaidCloud DWS provides always-on query access. You don't have to schedule availability or incur additional costs for usage outside the expected time.
This also means there is no first-query delay and no cache to warm up before optimal performance is achieved.
Read and Write the way you expect
The PlaidCloud DWS operates like a traditional database so you don't have to decide which instances are read-only or have special processes to load data from a write instance. All instances support full read and write with no special ETL or data loading processes required.
If you are used to using traditional databases, you don't need to learn any new skills or change your applications. The DWS is a drop-in replacement for Greenplum as well as a replacement for PostgreSQL, CockroachDB, yugabyteDB and other databases that use the PostgreSQL Wire Protocol. If you are coming from other databases such as Oracle, MySQL or Microsoft SQL Server then some adjustments to your query logic may be necessary but not to the overall process.
Since SAP HANA and Amazon Redshift use the PostgreSQL dialect, those seeking a portable alternative will find PlaidCloud DWS a straightforward option.
Economical
With usage based billing, you only pay for what you use. There are no per-query or extra processing charges. High performance storage with triple redundancy, incredible IOPS, wide data throughput, and out-of-band backups are all standard at a reasonable price.
We eliminate the headache of having to choose different data warehousing tiers based on optimizing storage costs. We offer three different storage options at a table level which all interoperate and can be used together in queries:
- HOT - This is the highest performance storage available and is suitable for analytical data that is frequently accessed or needs to be ultra-responsive
- WARM - This provides cost savings over Hot storage while maintaining good performance and no changes to SQL commands
- COLD - This is the most economical by utilizing cloud storage
Highly performant
While network attached storage has been able to achieve significant performance, it still can't come close to local disk. Using local disks for storage is complicated while operating in cloud environments but our goal was to provide an uncompromising data warehouse service that can achieve the same or better performance as a hand-built data warehouse cluster.
We also extensively tested optimal compute, networking, and RAM configurations to achieve maximum performance. As new technology and capabilities become available, our goal is to incorporate features that increase performance.
Real-time backups without impacting performance
One of the more complex processes with data warehouse clusters is backups. While seemingly simple, achieving a consistent snapshot of data across many nodes while not interfering in the execution of multiple queries is actually quite complex. Doing this without impacting performance of the database is even harder.
Thankfully, you don't have to worry about all that complexity. You can set the frequency of backups you desire and it is all handled automatically for you. While all data is triple redundant, backups are necessary in the event a destructive user action takes place such as accidentally deleting data or dropping a table. Having a backup allows for recovery of that prior state.
Scale out and scale up capable
The ability to both scale up and scale out are essential for a data warehouse, especially when it is performing analytical processes.
Scaling up means more simultaneous queries can occur at once. This is useful if you have many users or applications that require many concurrent processes.
Scaling out means more compute power can be applied to each query by breaking the data processing up across many CPUs. This is useful on large data where summarizations or other analytical processes such as machine learning (MADLib) or geospatial (PostGIS) analysis is required.
The PlaidCloud DWS allows scale expansion either on-demand or based on pre-defined events/metrics.
Integrated with PlaidCloud Analyze for Low/No Code operations
Analyze and Dashboards are quickly connected to any PlaidCloud DWS. This provides point-and-click operations to automate data related activities as well as building beautiful visualizations for reporting and insightful analysis.
From an Analyze project, you can select any DWS instance. This also provides the ability for Analyze projects to switch among DWS instances to facilitate testing and Blue/Green upgrade processes. It also allows quickly restoring an Analyze Project from a DWS point-in-time backup.
Clone
Making a clone of an existing warehouse performs a complete copy of the source warehouse. When a clone is made it has nothing shared with the original warehouse and therefore is a quick way to isolate a complete warehouse for testing or even a live archive at a specific point in time.
Another important feature is that you can clone a warehouse to a different data center. This might be desireable if global usage shifts from one region to another or having a copy of a warehouse in various regions for development/testing improves internal processes.
Restore
A new warehouse instance is easily restored from an existing backup. The backup frequency is adjustable for each warehouse instance. Those backups allow for a point-in-time restoration.
Prioritize queries within the warehouse
The PlaidCloud DWS provides a straightforward way to control the priority of queries within a single DWS instance. Through use of Resource Queues, certain roles can be granted higher priority. This differs from other warehouse services that require separate warehouse instances to delineate different priority access based on resource isolation/dedication.
By using Resource Queues, you can achieve your business requirements (e.g. high priority dashboards for executives) while using a single DWS instance. This allows you to control resource usage and eliminates the need to have large amounts of idle resources dedicated to low usage (high importance) scenarios.
Large number of connectors available
Since PlaidCloud DWS is based on PostgreSQL technology, virtually all PostgreSQL connectors and clients will work out-of-the-box. With a vibrant PostgreSQL community, new capabilities, adapters, and connectors are released frequently.
Some examples:
- Integration with Microsoft PowerBI using the NpgSQL built-in connector
- Connect Tableau using the standard data source setup for PostgreSQL connections
- Apache Superset integration using PostgreSQL connection string
- Qlik integration using the PosgreSQL Connector Package
Foreign table access
Already have data in another database or in cloud storage? No worry, you can connect to it directly and include the data in complex queries such as joins and Common Table Expressions. Use of foreign tables also include predicate push-down so conditions are applied before the data is moved to the DWS instance.
This enables use of existing data sources which means you can choose to gradually migrate them to a DWS instance or choose to keep the data where it exists forever.
Note that performance will not be as good as having the data in the DWS instance since it is subject to network speeds and the speed of the foreign data source operations.
This capability also enables communication across different PlaidCloud DWS instances. While it would be ideal to have all data in a single warehouse instance, there are certainly situations where this is not always practical.
Well understood and mature
While much of data warehousing activity is fairly straightforward, there still remains a large body of work that pushes the bounds of a database. When operating at maximum capacity, many facets come into play including the maturity and optimization of all the underlying processes. Since PlaidCloud DWS is built on very mature technology in use for decades, substantial performance and stability optimizations are in place.
With a well understood and mature technical foundation, there is a far less likelihood of strange failure modes and when unusual events do occur an answer is likely a Google search away.
Tuning queries is sometimes necessary for highly complex queries. There are substantial resources available that help explain, analyze, and optimize queries in PostgreSQL and Greenplum systems. We all wish that the days of hand tuning queries were no longer necessary. The questions we ask of our data and required processing to determine a result can often have orders of magnitude time improvements by adjusting aspects of the query where even the most intelligent query planner will struggle.
When trying to squeeze out the best performance you want to rely on known patterns and examples.
Web or Desktop SQL Client Access
A web SQL console is provided within PlaidCloud. It is a full featured SQL client so it supports most use cases. However, for more advanced use cases, a desktop client or other service may be desired. The PlaidCloud DWS uses standard security and access controls enabling remote connections and controlled user permissions.
Access options allow quick and easy start-up as well as ongoing query and analytics access. A firewall allows control over external access.
DBeaver provides a nice free desktop option that has a Greenplum driver to fully support PlaidCloud DWS instances. They also provide a commercial version called DBeaver Pro for those that require/prefer use of licensed software.
2 - Pricing
Usage Based
The cost of a PlaidCloud Data Warehouse instance is determined by a limited number of factors that you control. All costs incurred are usage based.
The factors that impact cost are:
- Concurrency Factor - The size of each compute node in your warehouse instance
- Parallelism Factor - The number of nodes in your warehouse instance
- Allocated Storage - The number of Gigabytes of storage consumed by your warehouse instance
- Network Egress - The number of Gigabytes of network egress. Excludes traffic to PlaidCloud applications within the same region. Ingress is always free.
- Backup Retention Period - How many days, weeks, or months to retain backups beyond 30 days
Storage, backups, and network egress are calculated in gigabytes (GB), where 1 GB is 2^30 bytes. This unit of measurement is also known as a gibibyte (GiB).
All prices are in USD. If you are paying in another currency please convert to your currency using the appropriate rate.
Billing is on an hourly basis. The monthly prices shown are illustrative based on a 730 hour month.
Controlling Factors
Concurrency Factor
Compute Type | Hourly Cost (streams/hr) | Monthly Cost (streams/month) |
---|---|---|
Standard | Contact Us | Contact Us |
Concurrency determines how many simultaneous queries are handled by the DWS instance. This is expressed as a number of process streams. There is not a 1:1 relationship between streams and query capacity since a single stream can handle multiple simultaneous queries. However, as the number of concurrent requests increase, the query duration may exceed the desired response time and an increase in the concurrency factor will help.
From a conceptual standpoint you can view processing streams as vCPUs used to process queries.
The default concurrency factor is 2, which is a good starting point if you are unsure of your needs. It can be adjusted from 1 to 14. If your needs exceed 14, please contact us to increase your concurrency limit.
Parallelism Factor
There is no additional cost per node. The compute cost of the DWS instance is the product of concurrency and parallelism plus the master node.
Parallelism determines how many nodes are in the DWS instance. This is expressed as node count. The number of nodes determines how much compute power can be applied to any single query. By increasing the node count, the computational part of the query can be spread out over many process streams. In addition, the storage throughput is multiplied by the number of nodes, which is very valuable when dealing with large datasets.
For example, if the maximum theoretical write throughput of a single node was 4 TB/sec, a warehouse with 8 nodes would have a theoretical write throughput of 8 x 4 TB/sec = 32 TB/sec. There are many factors that impact write speed including compression level, indexes, table storage type, network overhead, etc... but in general, nodes apply a multiplying factor to data throughput speed.
Allocated Storage
Three types of table storage options are available in a PlaidCloud DWS:
- Hot
- Warm
- Cold
Storage Type | Hourly Cost (GB/hr) | Monthly Cost (GB/month) |
---|---|---|
Hot | Contact Us | Contact Us |
Warm | Contact Us | Contact Us |
Cold | Contact Us | Contact Us |
These storage options can be applied on a table-by-table basis so you can optimize storage costs within a DWS with no change to existing queries.
Hot Storage
This is the most common storage type for a database. It is the default storage type for data in the DWS instance.
Storage cost is computed based on the allocated Hot storage space for the warehouse instance. Storage is allocated to the warehouse on-demand up to the specified limit set by you. The current limit is 4.5TB per node. If your needs exceed 4.5TB per node, please contact us to increase your node storage limit.
Warm Storage
Warm Storage provides an excellent trade-off between cost and performance. Warm storage is ideal for data used in batch processing, infrequently accessed historical data, or other general data that does not have high performance requirements. Warm storage provides good performance and does not have per node size limits.
Cold Storage
Cold storage is significantly less expensive than both Hot and Warm but it does have limitations. It is not included in the backup snapshots. It has significantly lower performance and is generally not suitable for queries that must be responsive.
However, for low usage or archival data it can provide a substantial cost savings while still enabling real-time access to the data, albeit at a slower query speed. This is a significant improvement over using ETL processes to archive table data and then needing to reconstitute it later when required through additional ETL processes.
For example, if the current and prior year financial data is stored in high performance storage to handle the vast majority of queries, prior years could be stored in Cold storage. When access to several years is needed, exceeding what is in hot storage, then a simple UNION query of the hot data and the cold data will return the full dataset. This eliminates complex data archival processes by keeping all the data readily available in the same DWS instance while optimizing storage costs.
Network Egress
Source Geolocation | Egress (per GB) | Ingress (per GB) |
---|---|---|
Worldwide Locations (Default) | $0.13 | Free |
China Locations (excluding Hong Kong) | $0.26 | Free |
Australia Locations | $0.20 | Free |
Network egress is calculated based on the egress traffic from your PlaidCloud Workspace. In terms of the egress traffic from a DWS instance, traffic to PlaidCloud applications in the same region such as Analyze and Dashboard are excluded. However, if you are connecting directly to the DWS instance through the external access point, egress charges will apply. In addition, if you access DWS instances from different regions using PlaidCloud applications then egress charges will apply.
If you connect between DWS instances in the same region using internal network routing there are no egress charges. However, if you connect using the external endpoint then egress charges will apply.
There is no charge for ingress traffic.
Backup Retention Period
Retention Period | Hourly Cost (GB/hr) | Monthly Cost (GB/month) |
---|---|---|
Scheduled Backups - First 30 Days | Free | Free |
Scheduled Backups - Retention (after 30 days) | $0.000274 | $0.02 |
On-Demand Backup Snapshots | $0.000274 | $0.02 |
By default, all scheduled backups are stored for 30 days free of charge. Setting the retention period beyond 30 days will incur additional storage retention charges. Backup retention storage cost is based on the allocated storage size of the DWS instance when the backup was taken and the duration for which you would like to retain each backup beyond 30 days.
For example, if the DWS instance allocated storage is 200GB and the additional retention period is 7 days, the backup storage cost is computed as 200GB x 7 Days = 1,400 GB Days.
1,400 GB days x 24 hours/day x $0.000274 per GB/hr = $9.20
On-demand backups can be taken at any time and will incur backup storage fees immediately. There is a minimum of 30 days billing applied to on-demand backups even if they are deleted within the 30 days.
By default, on-demand backups do not have a retention period set. If you make on-demand backups without a retention period, you must manually delete the backup or backup storage fees will continue to accrue.
If you put a hold on a backup to prevent deletion when the retention period expires, you must remove that hold or manually delete the backup. If the hold remains you will continue to incur backup storage fees.
Premium Capabilities Included
PlaidCloud DWS provides several additional features as part of each DWS instance that provide valuable capabilities without additional fees. Each DWS instance includes MADLib, PostGIS, and PXF.
The MADLib and PostGIS libraries allow you to perform machine learning and geospatial analysis without moving your data or using other external tools. PXF provides the ability to query external data files, whose metadata is not managed by the database. PXF includes built-in connectors for accessing data that exists inside HDFS files, Hive tables, HBase tables, JDBC-accessible databases and more. Users can also create their own connectors to other data storage or processing engines.
3 - Greenplum Technical Resource Links
Greenplum Introductory Videos
Greenplum Tutorial for Beginners
Greenplum Technical Fundamentals
Greenplum Advanced Usage Concepts
Data Warehouse Modernization with Greenplum
Use of Greenplum with External Tables
Using Greenplum as a Consolidated Data Lake for Analytics
Greenplum Storage Considerations
Understanding Data Distribution in Greenplum