
PySpark Overview — PySpark 3.5.5 documentation - Apache Spark
Feb 23, 2025 · PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.
Getting Started — PySpark 3.5.5 documentation - Apache Spark
Quickstart: Spark Connect. Launch Spark server with Spark Connect; Connect to Spark Connect server; Create DataFrame; Quickstart: Pandas API on Spark. Object Creation; Missing Data; Operations; Grouping; Plotting; Getting data in/out; Testing PySpark. Build a PySpark Application; Testing your PySpark Application; Putting It All Together!
Installation — PySpark 3.5.5 documentation - Apache Spark
PySpark is included in the official releases of Spark available in the Apache Spark website. For Python users, PySpark also provides pip installation from PyPI. This is usually for local usage or as a client to connect to a cluster instead of setting up a cluster itself.
Quick Start - Spark 3.5.5 Documentation - Apache Spark
We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python. To follow along with this guide, first, download a packaged release of Spark from the Spark website .
User Guides — PySpark 3.5.5 documentation - Apache Spark
There are also basic programming guides covering multiple languages available in the Spark documentation, including these: Spark SQL, DataFrames and Datasets Guide. Structured Streaming Programming Guide. Machine Learning Library (MLlib) Guide
Quickstart: DataFrame — PySpark 3.5.5 documentation - Apache …
PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also the latest Pandas UDFs and Pandas Function APIs. For instance, the example below allows users to directly use the APIs in a pandas Series within Python native function.
API Reference — PySpark 3.5.5 documentation - Apache Spark
API Reference¶. This page lists an overview of all public PySpark modules, classes, functions and methods. Pandas API on Spark follows the API specifications of latest pandas release.
Overview - Spark 3.5.5 Documentation - Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
Examples - Apache Spark
This page shows you how to use different Apache Spark APIs with simple examples. Spark is a great engine for small and large datasets. It can be used with single-node/localhost environments, or distributed clusters.
DataFrame — PySpark 3.5.5 documentation - Apache Spark
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame. DataFrame.mapInArrow (func, schema[, barrier])