Interview Questions For Pyspark

Here, we will discuss Interview Questions for Pyspark, which interviewers ask mainly for Data Engineer positions with 1 to 6 years of Experience in most company interviews.

1. What is PySpark?

PySpark is Python’s API for Apache Spark, enabling distributed data processing.
It handles large datasets using RDDs, DataFrames, SQL, and MLlib, offering scalability, fault tolerance, and in-memory computation via lazy evaluation.

2. PySpark Interview Topics

RDDs (immutable, partitioned datasets) vs. DataFrames (optimized via Catalyst).
Transformations (e.g., map, filter) vs. Actions (e.g., collect), with lazy evaluation and DAG execution.
Cluster architecture: Driver coordinates tasks; executors perform computations.
Shuffling during wide transformations (groupBy, join).
Optimizations: Caching, partitioning, broadcast variables, and tuning parallelism via repartition.
Fault tolerance through RDD lineage.
Spark SQL for querying structured data.
Handling data skew (salting, bucketing).
In-memory processing vs. Hadoop’s disk-based model.
UDFs and performance impacts.
Spark Streaming basics.

3.1 Basic PySpark Questions

1. What is PySpark, and how does it fit into the Apache Spark ecosystem?

PySpark is Python’s API for Apache Spark, enabling Python integration with Spark’s distributed computing engine.
Using familiar syntax, Python developers can leverage Spark’s capabilities (batch/stream processing, SQL, ML).

PySpark sits atop Spark’s JVM core, translating Python code into Spark jobs. It integrates seamlessly with Spark’s libraries (Spark SQL, MLlib), bridging Python’s ease of use with Spark’s scalability for big data tasks.

2. Explain the difference between ‘DataFrame’ and ‘RDD’ in PySpark. When would you use each?

RDDs (Resilient Distributed Datasets) are low-level, immutable distributed collections with no schema. DataFrames are structured, schema-enabled abstractions optimized via Catalyst.
Use DataFrames for SQL-like queries, optimizations, and structured data.
RDDs are suitable for unstructured data or custom Python functions (e.g., complex transformations)

3. How do you create a PySpark DataFrame from a CSV file?

Use spark.read.csv with options.
header reads the first row as column names; inferSchema detects data types (can be slow; prefer explicit schemas for large data)

df = spark.read.csv("path.csv", header=True, inferSchema=True)

4. Explain how Spark SQL integrates with Spark’s core APIs.

Spark SQL unifies Spark’s APIs, allowing SQL queries on DataFrames. DataFrames can be queried as temporary SQL views, interoperating with RDDs.
The Catalyst optimizer applies SQL-like optimizations across all APIs (e.g., DataFrame transformations), enabling hybrid workflows (SQL + Python code).

5. How do you run SQL queries on DataFrames using Spark SQL?

Spark SQL executes queries on DataFrames, enabling ANSI SQL syntax within PySpark.

df.createOrReplaceTempView("table")
result = spark.sql("SELECT * FROM table")

6. What are the different types of joins in PySpark SQL, and how are they implemented?

PySpark supports inner, outer, left, right, and cross joins.
Joins trigger shuffles; optimize using broadcast joins for small datasets.

df1.join(df2, on="key", how="inner")

7. How do you perform aggregations in PySpark using ‘`‘groupBy()‘`‘ and ‘agg()’?

groupBy() groups rows;
agg() applies aggregate functions:

df.groupBy("dept").agg(sum("salary").alias("total"))

8. Explain the use of PySpark SQL functions like ‘filter()’, ‘select(), and ‘withColumn()’.

filter(): Filters rows (WHERE clause).
select(): Projects columns.
withColumn(): Adds/updates columns (e.g., df.withColumn(“new_col”, col(“old”)*2)

9. How do you use the `distinct()` function in PySpark to remove duplicates?

Removes duplicates: df.distinct() or df.dropDuplicates()
Applies to all columns; specify subset via subset=[“col1”, “col2”]

3.2 Intermediate PySpark Questions

10. What are the benefits of using ‘Broadcast Joins’ in PySpark, and when should you use them?

Broadcast a smaller DataFrame to all nodes, avoiding shuffle. Use when one dataset is small (default threshold: 10MB).
Optimizes join performance:

from pyspark.sql.functions import broadcast
df1.join(broadcast(df2), "key")

11. How does partitioning in PySpark affect query performance?

Partitioning distributes data across nodes. Proper partitioning (e.g., by key) reduces shuffles in joins/aggregations, improving parallelism.
Skewed partitions degrade performance.

12. Explain the purpose of window functions in PySpark, and provide examples of how to use ‘ROW_NUMBER()’ and ‘RANK()’.

Window Functions: Perform row-wise calculations over partitions.

ROW_NUMBER() assigns unique IDs;
RANK() skips values after ties.

from pyspark.sql.window import Window

// rank()
window = Window.partitionBy("dept").orderBy("salary")
df.withColumn("rank", rank().over(window))

// row_number()
df_with_rownum = df.withColumn("row_number", row_number().over(window))

13. How do you optimize PySpark SQL queries when dealing with large datasets?

Partition data.
Use broadcast joins.
Cache reused DataFrames.
Avoid shuffles; tune spark.sql.shuffle.partitions
Use columnar formats (Parquet).

14. What is a ‘cache()’ function in PySpark, and when would you use it?

Persists DataFrame in memory (or disk) for reuse. Use when a DataFrame is accessed multiple times:

// cache dataframe in memory
df.cache()

15. How do you handle missing or null values in PySpark using ‘fillna()’, ‘dropna()’, and ‘replace()’?

fillna(): Replace nulls (e.g., df.fillna(0))
dropna(): Remove rows with nulls
replace(): Substitute values (e.g., df.replace(“old”, “new”))

16. Explain the difference between ‘repartition()’ and ‘coalesce()’ in PySpark.

repartition(): Full shuffle to change partition count (e.g., df.repartition(100)).
coalesce(): Merges partitions without shuffle (e.g., df.coalesce(1)).

3.3 Advanced PySpark Questions

17. What are UDFs (User Defined Functions) in PySpark, and how are they used in SQL queries?

User Defined Functions(UDFs) enable custom logic in SQL.
UDFs may impact performance due to serialization overhead.

from pyspark.sql.functions import udf

my_udf = udf(lambda x: x*2)
spark.udf.register("sql_udf", my_udf)

spark.sql("SELECT sql_udf(col) FROM table")

18. How do you manage large datasets using PySpark’s DataFrame API?

Managing Large Datasets: Use partitioning, efficient file formats (Parquet), predicate pushdown, and avoid collect(). Leverage lazy evaluation and Catalyst optimizations.

19. What is the purpose of ‘window()’ in PySpark SQL, and how does it compare to regular SQL window functions?

In PySpark SQL, the Window function provides a way to define window specifications, which are used in conjunction with window functions like row_number(), rank(), dense_rank(), sum(), avg(), etc.

// Window function in SQL

SELECT 
  name, dept, salary,
  ROW_NUMBER() OVER(PARTITION BY dept ORDER BY salary) as row_number
FROM employees;

// Window function in PySpark

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("dept").orderBy("salary")
df.withColumn("row_number", row_number().over(window_spec))

20. How do you use PySpark’s Catalyst optimizer to improve the performance of SQL queries?

Catalyst transforms logical plans (SQL/DataFrame ops) into optimized physical plans via rule-based (predicate pushdown) and cost-based optimizations (join reordering).
Tungsten accelerates execution with binary data formats.

21. What is the difference between ‘persist()’ and ‘cache()’ in PySpark, and how do they improve performance?

cache() uses default storage (memory).
persist() allows levels (e.g., MEMORY_AND_DISK).
Use to avoid recomputation in iterative workflows.

22. Explain how Spark SQL’s query optimization works behind the scenes, and how it leverages the Catalyst optimizer and Tungsten execution engine.

Catalyst generates optimized execution plans. Tungsten optimizes memory/CPU via codegen and off-heap data structures, improving speed for structured operations.

23. How do you debug and tune PySpark jobs for performance using the Spark UI and logs?

Use Spark UI to analyze stages/tasks, identify skew (uneven task durations), spills (disk usage), and GC overhead.
Tune via partition sizing, memory configs (executor.memory), and shuffle settings.

Table of Contents

1. What is PySpark?

2. PySpark Interview Topics