Pyspark Interview Questions

Here, we will discuss Pyspark Interview Questions, which interviewers ask mainly for Data Engineer positions in most company interviews.

1. What is PySpark?

PySpark is Python’s API for Apache Spark, enabling distributed data processing.
It handles large datasets using RDDs, DataFrames, SQL, and MLlib, offering scalability, fault tolerance, and in-memory computation via lazy evaluation.

2. PySpark Interview Topics

RDDs (immutable, partitioned datasets) vs. DataFrames (optimized via Catalyst).
Transformations (e.g., map, filter) vs. Actions (e.g., collect), with lazy evaluation and DAG execution.
Cluster architecture: Driver coordinates tasks; executors perform computations.
Shuffling during wide transformations (groupBy, join).
Optimizations: Caching, partitioning, broadcast variables, and tuning parallelism via repartition.
Fault tolerance through RDD lineage.
Spark SQL for querying structured data.
Handling data skew (salting, bucketing).
In-memory processing vs. Hadoop’s disk-based model.
UDFs and performance impacts.
Spark Streaming basics.

1. You need to process a 1 TB dataset on a Spark cluster with 20 nodes, each with 128 GB of memory and 32 cores. What’s the best way to allocate resources?

Step-by-Step Solution to allocate resources in a Spark cluster for processing 1TB datasets:

𝐔𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝 𝐘𝐨𝐮𝐫 𝐂𝐥𝐮𝐬𝐭𝐞𝐫
- Each node has 128 GB of memory and 32 cores.
- Total: 20 nodes available.
𝐀𝐥𝐥𝐨𝐜𝐚𝐭𝐞 𝐌𝐞𝐦𝐨𝐫𝐲 𝐒𝐦𝐚𝐫𝐭𝐥𝐲
- Keep 10% of memory for Spark’s internal processes.
- Usable Memory per Node: ~115 GB (128 x 0.9)
𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐞 𝐄𝐱𝐞𝐜𝐮𝐭𝐨𝐫𝐬 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭𝐥𝐲
- Use 3 executors per node to balance memory and processing power.
- Memory per Executor: ~38 GB (115 GB ÷ 3).
- Cores per Executor: 5 (to avoid overhead).
𝐓𝐨𝐭𝐚𝐥 𝐄𝐱𝐞𝐜𝐮𝐭𝐨𝐫𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐂𝐥𝐮𝐬𝐭𝐞𝐫
- Total No of Executors: 60 (3 executors x 20 nodes).
𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞 𝐒𝐩𝐚𝐫𝐤 𝐂𝐨𝐧𝐟𝐢𝐠𝐮𝐫𝐚𝐭𝐢𝐨𝐧
- spark.executor.memory: 38g
- spark.executor.cores: 5
- spark.executor.instances: 60
- spark.driver.memory: 64g (adjust as needed)
- spark.driver.cores: 8
𝐄𝐧𝐚𝐛𝐥𝐞 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐀𝐥𝐥𝐨𝐜𝐚𝐭𝐢𝐨𝐧 (𝐢𝐟 𝐰𝐨𝐫𝐤𝐥𝐨𝐚𝐝 𝐯𝐚𝐫𝐢𝐞𝐬)
- Allows Spark to automatically adjust resources.
- spark.dynamicAllocation.enabled = true
- spark.dynamicAllocation.minExecutors = 30
- spark.dynamicAllocation.maxExecutors = 60
𝐀𝐝𝐣𝐮𝐬𝐭 𝐒𝐡𝐮𝐟𝐟𝐥𝐞 𝐏𝐚𝐫𝐭𝐢𝐭𝐢𝐨𝐧𝐬
- Tuning the number of shuffle partitions improves performance.
- spark.sql.shuffle.partitions: 300 (or based on job needs).

2. How would you optimize a PySpark DataFrame operation that involves multiple transformations and is running too slowly on a large dataset?

To optimize a PySpark DataFrame operation with multiple transformations, consider using techniques like caching intermediate results with .cache() or .persist(), minimizing shuffles by using operations like filter and select early, and leveraging built-in functions instead of UDFs.
Additionally, reviewing the execution plan with .explain() can help identify bottlenecks.

3. Given a large dataset that doesn’t fit in memory, how would you convert a Pandas DataFrame to a PySpark DataFrame for scalable processing?

To convert a large Pandas DataFrame to a PySpark DataFrame, use the spark.createDataFrame(pandas_df) method.
However, if the dataset is too large to fit into memory, consider using pandas with it Dask to partition the DataFrame and then convert each partition to a PySpark DataFrame iteratively, or read the data directly from a distributed file system like HDFS or S3.

4. You have a large dataset with a highly skewed distribution. How would you handle data skewness in PySpark to ensure that your jobs do not fail or take too long to execute?

To handle data skewness in PySpark, you can use techniques such as salting, where you add a random value to the key to distribute data more evenly across partitions.
Additionally, consider using the sample() function to process a representative subset of the data or adjusting the number of partitions with repartition() to balance the workload.

5. How do you optimize data partitioning in PySpark? When and how would you use repartition() and coalesce()?

Optimizing data partitioning in PySpark involves using repartition() to increase the number of partitions for better parallelism and coalesce() to decrease the number of partitions without a full shuffle, which is more efficient.
Use repartition() when you need to increase parallelism and coalesce() when reducing partitions post-filtering to avoid unnecessary shuffling.

6. Write a PySpark code snippet to calculate the moving average of a column for each partition of data, using window functions.

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

spark = SparkSession.builder.appName("MovingAverage").getOrCreate()
df = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ["id", "value"])

window_spec = Window.partitionBy().orderBy("id").rowsBetween(-1, 0)

df_with_moving_avg = df.withColumn("moving_avg", avg("value").over(window_spec))
df_with_moving_avg.show()

7. How would you handle null values in a PySpark DataFrame when different columns require different strategies (e.g., dropping, replacing, or imputing)?

Handling null values in a PySpark DataFrame can be done using different strategies for different columns.

Use .dropna() to remove rows with nulls for certain columns
.fillna(value) to replace nulls with specific values
.fillna(method=’ffill’) for forward filling.

You can apply these methods conditionally based on column names.

8. When would you use a broadcast join in PySpark? Provide an example where broadcasting improves performance and explain the limitations.

A broadcast join is beneficial when one of the DataFrames is small enough to fit in memory, allowing it to be sent to all worker nodes, thus reducing shuffling.
For example, if you have a large sales DataFrame and a small lookup table for product details, broadcasting the lookup table can improve performance.
Limitations of broadcast join include the size of the DataFrame being broadcast; if it’s too large, it can lead to memory issues.

9. When should you use UDFs instead of built-in PySpark functions, and how do you ensure UDFs are optimized for performance?

Use UDFs when built-in PySpark functions do not meet your needs, particularly for complex transformations.
To optimize UDFs, ensure they are written in a way that minimizes serialization overhead, such as using vectorized UDFs with Pandas UDFs, and avoid using them for operations that can be achieved with built-in functions, as they are generally more efficient.

10. Describe the difference between a DataFrame and an RDD in PySpark.

A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database.
It provides a higher-level abstraction and is optimized for performance.

An RDD (Resilient Distributed Dataset) is a lower-level abstraction that represents a distributed collection of objects and lacks the optimizations and convenience features of DataFrames.

11. Describe your experience with PySpark SQL.

PySpark SQL is used to perform complex queries on large datasets, leveraging its ability to integrate SQL-like syntax for querying DataFrames.
It is best for data analysis, transformation, and reporting, enabling efficient querying and manipulation of big data.

12. How do you execute SQL queries on PySpark DataFrames?

To execute SQL queries on PySpark DataFrames, you first register the DataFrame as a temporary view using createOrReplaceTempView().
Then, you can use the spark.sql() method to run SQL queries against that view, returning the result as a DataFrame.

13. Can you explain transformations and actions in PySpark DataFrames?

Transformations are operations that create a new DataFrame from an existing one (e.g., filter, select, groupBy), while actions are operations that trigger computation and return a result (e.g., show, count, collect).
Transformations are lazy, meaning they only execute when an action is called.

14. How do you optimize the performance of PySpark jobs?

Performance optimization techniques include using DataFrame APIs over RDDs, caching intermediate DataFrames, optimizing shuffles, using partitioning effectively, and tuning Spark configurations such as executor memory and cores.

15. How do you handle skewed data in PySpark?

Techniques for handling skewed data include salting keys to distribute load evenly, using coalesce() or repartition() to adjust partitions, and employing custom partitioners to manage data distribution.

16. Explain how data serialization works in PySpark.

Data serialization in PySpark refers to the process of converting data structures into a format that can be efficiently transmitted over the network or stored. PySpark uses formats like Java serialization, Kryo serialization, and Avro for efficient data serialization.

17. Discuss the significance of choosing the right compression codec for your PySpark applications.

Choosing the right compression codec can significantly affect storage requirements and I/O performance.
Common codecs include Snappy, Gzip, and LZ4, each with different trade-offs in terms of compression ratio and speed.
Selecting the right codec can enhance data processing performance.

18. Have you integrated PySpark with other big data technologies or databases? If so, please provide examples.

PySpark is integrated with technologies like Hadoop for data storage, Kafka for real-time data streaming, and databases like Cassandra and MongoDB for data persistence.
This integration allows for seamless data flow and processing across platforms.

19. How do you handle data transfer between PySpark and external systems?

Data transfer can be handled using connectors and APIs provided by PySpark.
For example, using read and write Methods to interact with various data sources such as HDFS, S3, JDBC, and more.

20. How do you deal with missing or null values in PySpark DataFrames?

Missing values can be handled using methods like dropna() to remove them or fillna() to replace them with a specified value or statistic (e.g., mean, median).

21. What is broadcasting, and how is it useful in PySpark?

Broadcasting is a mechanism to efficiently share large variables across all worker nodes.
This is useful for joining large DataFrames with smaller ones, reducing data transfer overhead and improving performance.

22. Explain the architecture of PySpark. How does it differ from traditional Hadoop MapReduce?

PySpark uses a master-slave architecture with a driver (master) managing executors (slaves) for distributed processing. It leverages in-memory computation and Directed Acyclic Graph (DAG) execution for iterative tasks,
unlike Hadoop MapReduce‘s disk-based, two-phase (map-shuffle-reduce) approach. This reduces I/O overhead, enabling faster processing for iterative algorithms (e.g., ML).
Spark’s DAG scheduler optimizes workflows, while Hadoop’s rigid map-reduce stages incur latency due to frequent disk writes.

23. What are the main components of PySpark, and what roles do they play in the Spark ecosystem?

The list of PySpark main components is:

Spark Core: Manages task scheduling, memory, and RDDs.
Spark SQL: Processes structured data via DataFrames.
MLlib: Provides scalable ML algorithms.
Spark Streaming: Handles real-time data.
GraphX: For graph processing. PySpark wraps these in Python, enabling distributed computing with Python APIs.

24. How does PySpark handle data partitioning? Can you explain how partitioning affects performance in PySpark applications?

Partitions split data across nodes. Proper partitioning (e.g., using repartition()) minimizes data shuffling, enhancing parallelism.
Skewed partitions cause slower tasks (stragglers), while balanced partitions improve performance. Partitioning is critical for joins/aggregations to avoid network overhead.
Use partitionBy for key-based operations.

25. What are some common transformations and actions in PySpark, and when would you use each? Can you provide examples?

Transformations (e.g., map, filter) define lazy operations, building execution plans. Actions (e.g., count, collect) trigger computation.
Use transformations to build workflows and actions to materialize results.
Example:

// Transformation
rdd = sc.parallelize([1,2,3]).map(lambda x: x*2)

// Action (executes pipeline)  
rdd.count()

26. Explain the concept of RDD (Resilient Distributed Dataset) in PySpark. How does it achieve fault tolerance?

RDDs (Resilient Distributed Datasets) are immutable, distributed collections. Fault tolerance is achieved via lineage: each RDD tracks transformations used to build it.
Lost partitions are recomputed from lineage, avoiding data redundancy.
Example: An RDD from file.map(…).filter(…) can rebuild partitions using the original file and recorded steps.

27. What are DataFrame and Dataset APIs in PySpark? How do they differ from RDDs, and when would you choose to use them?

DataFrames are structured collections with a schema, optimized via Catalyst. RDDs are low-level, unstructured.
DataFrames enable SQL-like queries and efficient execution (columnar storage).
Use DataFrames for structured data (CSV, JSON) and RDDs for unstructured data or fine-grained control.

28. Discuss the advantages and disadvantages of using PySpark over traditional SQL-based data processing.

Pros: PySpark handles complex ETL, ML, and streaming; scales for big data.
Cons: Overhead for small data; steeper learning curve.
SQL is declarative and simple for queries, but lacks PySpark’s flexibility for advanced workflows.

29. How does PySpark optimize query execution? Can you explain the Catalyst optimizer and its role in PySpark?

Catalyst optimizes query plans via logical/physical optimizations (e.g., predicate pushdown, join reordering).
It converts DataFrame operations into optimized physical plans, improving execution speed.
For example, filtering data early reduces I/O.

30. What are the different deployment modes available for PySpark applications, and when would you choose each?

Local: Testing on a single machine.
Standalone: Spark cluster without resource managers.
YARN: Hadoop integration.
Kubernetes: Cloud-native deployments.

Choose YARN for Hadoop ecosystems, Kubernetes for cloud scalability, and standalone for simplicity

31. Explain how PySpark integrates with other technologies like Hadoop, Hive, and Kafka. Can you provide examples of using PySpark with each of these technologies?

Hadoop: Read/write from HDFS.
sc.textFile("hdfs://path")
Hive: Query tables via HiveContext.
df = spark.sql("SELECT * FROM hive_table")
Kafka: Stream data with spark.readStream.format("kafka").

32. Discuss some common performance tuning techniques for PySpark applications. How would you optimize PySpark jobs for better performance?

Cache reused DataFrames: df.cache()
Avoid wide transformations (e.g., groupByKey).
Tune partitions: spark.conf.set(“spark.sql.shuffle.partitions”, 200)
Use broadcast joins for small tables.

33. What are broadcast variables and accumulators in PySpark? How do they contribute to improving performance and managing shared states in distributed computations?

Broadcast: Efficiently distribute read-only data (e.g., lookup tables) to executors.
broadcast_var = sc.broadcast({1: “a”})
Accumulators: Aggregate values across workers (e.g., counters).
accum = sc.accumulator(0); rdd.foreach(lambda x: accum.add(1))

Table of Contents

1. What is PySpark?

2. PySpark Interview Topics