About 23,900 results
Open links in new tab
  1. PySpark Overview — PySpark 3.5.5 documentation - Apache Spark

    Feb 23, 2025 · PySpark is the Python API for Apache Spark. It enables you to perform real-time, large-scale data processing in a distributed environment using Python. It also provides a PySpark shell for interactively analyzing your data.

  2. Getting Started — PySpark 3.5.5 documentation - Apache Spark

    Quickstart: Spark Connect. Launch Spark server with Spark Connect; Connect to Spark Connect server; Create DataFrame; Quickstart: Pandas API on Spark. Object Creation; Missing Data; Operations; Grouping; Plotting; Getting data in/out; Testing PySpark. Build a PySpark Application; Testing your PySpark Application; Putting It All Together!

  3. Installation — PySpark 3.5.5 documentation - Apache Spark

    PySpark is included in the official releases of Spark available in the Apache Spark website. For Python users, PySpark also provides pip installation from PyPI. This is usually for local usage or as a client to connect to a cluster instead of setting up a cluster itself.

  4. Quick Start - Spark 3.5.5 Documentation - Apache Spark

    We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python. To follow along with this guide, first, download a packaged release of Spark from the Spark website .

  5. User Guides — PySpark 3.5.5 documentation - Apache Spark

    There are also basic programming guides covering multiple languages available in the Spark documentation, including these: Spark SQL, DataFrames and Datasets Guide. Structured Streaming Programming Guide. Machine Learning Library (MLlib) Guide

  6. Quickstart: DataFrame — PySpark 3.5.5 documentation - Apache …

    PySpark supports various UDFs and APIs to allow users to execute Python native functions. See also the latest Pandas UDFs and Pandas Function APIs. For instance, the example below allows users to directly use the APIs in a pandas Series within Python native function.

  7. API Reference — PySpark 3.5.5 documentation - Apache Spark

    API Reference¶. This page lists an overview of all public PySpark modules, classes, functions and methods. Pandas API on Spark follows the API specifications of latest pandas release.

  8. Overview - Spark 3.5.5 Documentation - Apache Spark

    Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

  9. Examples - Apache Spark

    This page shows you how to use different Apache Spark APIs with simple examples. Spark is a great engine for small and large datasets. It can be used with single-node/localhost environments, or distributed clusters.

  10. DataFrame — PySpark 3.5.5 documentation - Apache Spark

    Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame. DataFrame.mapInArrow (func, schema[, barrier])