This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

PySpark and Spark Compute Clusters

Build and deploy PySpark applications and utilize Spark compute clusters for large data analysis

1 - Getting Started with PySpark

Using PySpark in UDFs and Notebooks

PySpark Documentation

PySpark is similar to using Pandas but allows for distributed compute and is not RAM bound. PySpark is available in both UDFs and Jupyter Notebooks.

Spark Cluster

By default, workspaces do not have the Spark cluster enabled. To activate the Spark Cluster, go to the Workspace management app and enable the "Spark Compute Cluster" service.

Once activated, Spark jobs can be submitted to the cluster.

The cluster can be monitored from the spark sub-domain for the Workspace (e.g. https://spark.my_workspace.plaid.cloud)