Pyspark Interview Questions For Data Engineer

Here, we will discuss Pyspark Interview Questions for Data Engineers, which interviewers ask mainly for Data Engineer positions in most company interviews.

1. What is PySpark?

PySpark is Python’s API for Apache Spark, enabling distributed data processing.
It handles large datasets using RDDs, DataFrames, SQL, and MLlib, offering scalability, fault tolerance, and in-memory computation via lazy evaluation.

2. PySpark Interview Topics

RDDs (immutable, partitioned datasets) vs. DataFrames (optimized via Catalyst).
Transformations (e.g., map, filter) vs. Actions (e.g., collect), with lazy evaluation and DAG execution.
Cluster architecture: Driver coordinates tasks; executors perform computations.
Shuffling during wide transformations (groupBy, join).
Optimizations: Caching, partitioning, broadcast variables, and tuning parallelism via repartition.
Fault tolerance through RDD lineage.
Spark SQL for querying structured data.
Handling data skew (salting, bucketing).
In-memory processing vs. Hadoop’s disk-based model.
UDFs and performance impacts.
Spark Streaming basics.

1. What’s the difference between an RDD, a DataFrame, and a DataSet?

RDD(Resilient Distributed Dataset)
- It is Spark’s structural square. RDDs contain all datasets and dataframes.
- If a similar arrangement of data needs to be calculated again, RDDs can be efficiently reused.
- It’s useful when you need to do low-level transformations, operations, and control on a dataset.
- It’s more commonly used to alter data with functional programming structures than with domain-specific expressions.

DataFrame
- It allows the structure, i.e., lines and segments, to be seen. You can think of it as a database table.
- Optimized Execution Plan- The catalyst analyzer is used to create query plans.
- One of the limitations of dataframes is Compile Time Wellbeing, i.e., when the structure of information is unknown, no control of information is possible.
- Also, if you’re working on Python, start with DataFrames and then switch to RDDs if you need more flexibility.

DataSet (A subset of DataFrames)
- It has the best encoding component and, unlike information edges, it enables time security in an organized manner.
- If you want a greater level of type safety at compile-time, or if you want typed JVM objects, Dataset is the way to go.
- Also, you can leverage datasets in situations where you are looking for a chance to take advantage of Catalyst optimization or even when you are trying to benefit from Tungsten’s fast code generation.

2. How can you create a DataFrame with
a) using an existing RDD
b) from a CSV file?

Creating a DataFrame in PySpark → a) From an Existing RDD b) From a CSV file

a) From an Existing RDD

Convert RDD to DataFrame using toDF() (requires a schema or column names):

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDDtoDF").getOrCreate()

# Sample RDD
rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob")])

# Convert to DataFrame with column names
df = rdd.toDF(["id", "name"])
df.show()

Using createDataFrame() with explicit schema:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])

df = spark.createDataFrame(rdd, schema)
df.show()

b) From a CSV File

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.show()

header=True → Treats first row as column names.
inferSchema=True → Automatically detects data types.

Alternative (explicit schema):

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True)
])

df = spark.read.csv("path/to/file.csv", header=True, schema=schema)

This ensures type safety and avoids schema inference overhead.

Both methods are widely used in PySpark for efficient DataFrame creation.

3. Explain the use of StructType and StructField classes in PySpark with examples.

The StructType and StructField classes in PySpark are used to define the schema for the DataFrame and create complex columns such as nested struct, array, and map columns.
StructType is a collection of StructField objects that determines column name, column data type, field nullability, and metadata.

PySpark imports the StructType class from pyspark.sql.types to describe the DataFrame’s structure.
The DataFrame’s printSchema() function displays StructType columns as “struct.”

To define the columns, PySpark offers the pyspark.sql.types import StructField class, which has the column name (String), column type (DataType), nullable column (Boolean), and metadata (MetaData).

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder.master("local[1]") \.appName('ProjectPro') \.getOrCreate()
data = [("James","","William","36636","M",3000), ("Michael","Smith","","40288","M",4000), ("Robert","","Dawson","42114","M",4000), ("Maria","Jones","39192","F",4000)]

schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])

df = spark.createDataFrame(data=data,schema=schema)

df.printSchema()
df.show(truncate=False)

4. What are the different ways to handle row duplication in a PySpark DataFrame?

There are two ways to handle row duplication in PySpark dataframes. The distinct() function in PySpark is used to drop/remove duplicate rows (all columns) from a DataFrame, while dropDuplicates() is used to drop rows based on one or more columns.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.appName('ProjectPro').getOrCreate()
data = [("James", "Sales", 3000), \ ("Michael", "Sales", 4600), \ ("Robert", "Sales", 4100), \ ("Maria", "Finance", 3000), \ ("James", "Sales", 3000), \ ("Scott", "Finance", 3300), \ ("Jen", "Finance", 3900), \ ("Jeff", "Marketing", 3000), \ ("Kumar", "Marketing", 2000), \ ("Saif", "Sales", 4100) \ ]

column= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = data, schema = column)

df.printSchema()
df.show(truncate=False)

#Distinct
distinctDF = df.distinct()
print("Distinct count: "+str(distinctDF.count()))
distinctDF.show(truncate=False)

#Drop duplicates
df2 = df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)

#Drop duplicates on selected columns
dropDisDF = df.dropDuplicates(["department","salary"])
print("Distinct count of department salary : "+str(dropDisDF.count()))
dropDisDF.show(truncate=False)
}

5. Explain PySpark UDF with the help of an example.

PySpark UDF (i.e., User Defined Function), is used to expand PySpark’s built-in capabilities. UDFs in PySpark work similarly to UDFs in conventional databases.

Write a Python function and wrap it in PySpark SQL udf() or register it as a udf and use it on DataFrame and SQL, respectively, in the case of PySpark.

spark = SparkSession.builder.appName('ProjectPro').getOrCreate()

column = ["Seqno","Name"]
data = [("1", "john jones"), ("2", "tracey smith"), ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=column)
df.show(truncate=False)

// generates the convertCase method
def convertCase(str):
resStr="" arr = str.split(" ")

for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr

// Converting function to UDF
convertUDF = udf(lambda z: convertCase(z),StringType())

6. Discuss the map() transformation in PySpark DataFrame with the help of an example.

PySpark map or the map() function is an RDD transformation that generates a new RDD by applying a ‘lambda‘, which is the transformation function, to each RDD/DataFrame element.

RDD map() transformations are used to perform complex operations such as adding a column, changing a column, converting data, and so on.
Map transformations always produce the same number of records as the input.

7. What do you mean by ‘joins’ in PySpark DataFrame? What are the different types of joins?

Joins in PySpark are used to join two DataFrames together, and by linking them together, one may join several DataFrames.

INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT ANTI Join, LEFT SEMI Join, CROSS Join, and SELF Join are among the SQL join types it supports.

//PySpark Join syntax
join(self, other, on=None, how=None)

The join() procedure accepts the following parameters and returns a DataFrame-
‘other’: The join’s right side;
‘on’: the join column’s name;
‘how’: default inner (Options are inner, cross, outer, full, full outer, left, left outer, right, right outer, left semi, and left anti.)

8. What is PySpark ArrayType? Explain with an example.

PySpark ArrayType is a collection data type that extends PySpark’s DataType class, which is the superclass for all kinds. The types of items in all ArrayType elements should be the same.

The ArraType() method may be used to construct an instance of an ArrayType. It accepts two arguments: valueType and one optional argument valueContainsNull, which specifies whether a value can accept null and is set to True by default.
valueType should extend the DataType class in PySpark.

from pyspark.sql.types import StringType, ArrayType
arrayCol = ArrayType(StringType(),False)

9. What do you understand by PySpark Partition?

PySpark partitions a large dataset into smaller parts. When we build a DataFrame from a file or table, PySpark creates the DataFrame in memory with a specific number of divisions based on specified criteria.

Transformations on partitioned data run quicker since each partition’s transformations are executed in parallel.

Partitioning in memory (DataFrame) and partitioning on disc (File system) are both supported by PySpark.

10. What is meant by PySpark MapType? How can you create a MapType using StructType?

PySpark MapType accepts two mandatory parameters – keyType and valueType, and one optional boolean argument valueContainsNull.

Create a MapType with PySpark StructType and StructField.

The StructType() accepts a list of StructFields, each of which takes a fieldname and a value type.
Using the StructType structure, to construct a DataFrame-

from pyspark.sql.types import StructField, StructType, StringType, MapType

schema = StructType([
StructField('name', StringType(), True),
StructField('properties', MapType(StringType(),StringType()),True)
])

spark= SparkSession.builder.appName('PySpark StructType StructField').getOrCreate()

dataDictionary = [('James',{'hair':'black','eye':'brown'}), ('Michael',{'hair':'brown','eye':None}), ('Robert',{'hair':'red','eye':'black'}), ('Washington',{'hair':'grey','eye':'grey'}), ('Jefferson',{'hair':'brown','eye':''}) ]

df = spark.createDataFrame(data=dataDictionary, schema = schema)

df.printSchema()
df.show(truncate=False)

11. How can a PySpark DataFrame be converted to a Pandas DataFrame?

The key difference between Pandas and PySpark is that:
PySpark‘s operations are quicker than Pandas’ because of its distributed nature and parallel execution over several cores and computers.
In other words, Pandas uses a single node to do operations, whereas PySpark uses several computers.

Some Scenarios for PySpark and Pandas:

You’ll need to transfer the data back to Pandas DataFrame after processing it in PySpark so that you can use it in Machine Learning apps or other Python programs.
To convert a PySpark DataFrame to a Python Pandas DataFrame, use the toPandas() function which gathers all records in a PySpark DataFrame and delivers them to the driver software; it should only be used on a short percentage of the data. When using a bigger dataset, the application fails due to a memory error.

12. What is the function of PySpark’s pivot() method?

The pivot() method in PySpark is used to rotate/transpose data from one column into many Dataframe columns and back using the unpivot() function ().

Pivot() is an aggregation in which the values of one of the grouping columns are transposed into separate columns containing different data.
To determine the entire amount of each product’s exports to each nation, we’ll group by Product, pivot by Country, and sum by Amount.

pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.printSchema()
pivotDF.show(truncate=False)

13. In PySpark, how do you generate broadcast variables? Give an example.

Broadcast variables in PySpark are read-only shared variables that are stored and accessible on all nodes in a cluster so that processes may access or use them. Instead of sending this information with each job, PySpark uses efficient broadcast algorithms to distribute broadcast variables among workers, lowering communication costs.

The broadcast(v) function of the SparkContext class is used to generate a PySpark Broadcast. This method accepts the broadcast parameter v.

// Generating a broadcast in PySpark Shell
broadcastVariable = sc.broadcast(Array(0, 1, 2, 3))
broadcastVariable.value

// PySpark RDD Broadcast variable example
spark=SparkSession.builder.appName('SparkByExample.com').getOrCreate()
states = {"NY":"New York", "CA":"California", "FL":"Florida"}
broadcastStates = spark.sparkContext.broadcast(states)
data = [("James","Smith","USA","CA"), ("Michael","Rose","USA","NY"), ("Robert","Williams","USA","CA"), ("Maria","Jones","USA","FL") ]

14. Under what scenarios are Client and Cluster modes used for deployment?

Cluster mode should be utilized for deployment if the client computers are not near the cluster. This is done to prevent the network delay that would occur in Client mode while communicating between executors.
In the case of Client mode, if the machine goes offline, the entire operation is lost.

Client mode can be utilized for deployment if the client computer is located within the cluster. There will be no network latency concerns because the computer is part of the cluster, and the cluster’s maintenance is already taken care of, so there is no need to be concerned in the event of a failure.

15. Write a Spark program to check whether a given keyword exists in a huge text file or not.

def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“sample_file.txt”);
isExist = lines.map(keywordExists);
sum=isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)

16. What is meant by Executor Memory in PySpark?

Spark executors have the same fixed core count and heap size as the applications created in Spark. The heap size relates to the memory used by the Spark executor, which is controlled by the executor-memory flag’s property spark.executor.memory.

On each worker node where Spark operates, one executor is assigned to it. The executor memory is a measurement of the memory utilized by the application’s worker node.

17. List some of the functions of SparkCore.

The core engine for large-scale distributed and parallel data processing is SparkCore. The distributed execution engine in the Spark core provides APIs in Java, Python, and Scala for constructing distributed ETL applications.

Memory management, task monitoring, fault tolerance, storage system interactions, work scheduling, and support for all fundamental I/O activities are all performed by Spark Core.
Additional libraries on top of Spark Core enable a variety of SQL, streaming, and machine learning applications.

It is managed:

Fault Recovery.
Interactions between memory management and storage systems.
Monitoring, scheduling, and distributing jobs.
Fundamental I/O functions.

18. What are some of the drawbacks of incorporating Spark into applications?

Even though Spark is a strong data processing engine, there are certain drawbacks to utilizing it in applications.
When compared to MapReduce or Hadoop, Spark consumes greater storage space, which may cause memory-related issues.
Spark can be a constraint for cost-effective large data processing since it uses “in-memory” calculations.
When working in cluster mode, files on the path of the local filesystem must be available at the same place on all worker nodes, as the task execution shuffles across different worker nodes based on resource availability.
All worker nodes must copy the files, or a separate network-mounted file sharing system must be installed.

19. How can data transfers be kept to a minimum while using PySpark?

The process of shuffling corresponds to data transfers. Spark applications run quickly and more reliably when these transfers are minimized.
There are quite a number of approaches that may be used to reduce them. They are as follows:
Using broadcast variables improves the efficiency of joining big and small RDDs.
Accumulators are used to update variable values in a parallel manner during execution.
Another popular method is to prevent operations that cause these reshuffles.

20. What are Sparse Vectors? What distinguishes them from dense vectors?

Sparse vectors are made up of two parallel arrays, one for indexing and the other for storing values. These vectors are used to save space by storing non-zero values.
E.g.- val sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))
The vector in the above example is of size 5, but the non-zero values are only found at indices 0 and 4.
When there are just a few non-zero values, sparse vectors come in handy. If there are just a few zero values, dense vectors should be used instead of sparse vectors, as sparse vectors would create indexing overhead, which might affect performance.

The following is an example of a dense vector:
val denseVec = Vectors.dense(4405d,260100d,400d,5.0,4.0,198.0,9070d,1.0,1.0,2.0,0.0)
The usage of sparse or dense vectors does not affect the outcomes of calculations, but when they are used incorrectly, they influence the amount of memory needed and the calculation time.

21. What role does Caching play in Spark Streaming?

The partition of a data stream’s contents into batches of X seconds, known as DStreams, is the basis of Spark Streaming. These DStreams allow developers to cache data in memory, which may be particularly handy if the data from a DStream is utilized several times.

The cache() function or the persist() method with proper persistence settings can be used to cache data.

For input streams receiving data through networks such as Kafka, Flume, and others, the default persistence level setting is configured to achieve data replication on two nodes to achieve fault tolerance.

// Cache method
val cacheDf = dframe.cache()

// Persist method
val persistDf = dframe.persist(StorageLevel.MEMORY_ONLY)

Key benefits of caching:

Cost-effectiveness: Because Spark calculations are costly, caching aids in data reuse, which leads to reuse computations, lowering the cost of operations.
Time-saving: By reusing computations, we may save a lot of time.
More Jobs Achieved: Worker nodes may perform/execute more jobs by reducing computation execution time.

22. What API does PySpark utilize to implement graphs?

Spark RDD is extended with a robust API called GraphX, which supports graphs and graph-based calculations.
The Resilient Distributed Property Graph is an enhanced property of Spark RDD that is a directed multi-graph with many parallel edges.
User-defined characteristics are associated with each edge and vertex. Multiple connections between the same set of vertices are shown by the existence of parallel edges.

GraphX offers a collection of operators that can allow graph computing, such as subgraph, mapReduceTriplets, joinVertices, and so on.
It also offers a wide number of graph builders and algorithms for making graph analytics chores easier.

23. What is meant by Piping in PySpark?

Apache Spark supports the pipe() function on RDDs, which allows you to assemble distinct portions of jobs that can use any language.
The RDD transformation may be created using the pipe() function, and it can be used to read each element of the RDD as a String.
These may be altered as needed, and the results can be presented as Strings.

24. What steps are involved in calculating the executor memory?

Suppose you have the following details regarding the cluster:
No. of nodes = 10
No. of cores in each node = 15 cores
RAM of each node = 61GB

We use the following method to determine the number of cores:
No. of cores = How many concurrent tasks the executor can handle.
As a rule of thumb, 5 is the best value.

Hence, we use the following method to determine the number of executors:
No. of executors = No. of cores/Concurrent Task
= 15/5 => 3

No. of executors = No. of nodes * No. of executors in each node
= 10 * 3 => 30 executors per Spark job

25. Do we have a checkpoint feature in Apache Spark?

Yes, there is an API for checkpoints in Spark. Checkpointing helps in streaming apps more immune to errors. We can store the data and metadata in a checkpointing directory.
If there’s a failure, the spark may retrieve this data and resume where it left off.

In Spark, checkpointing may be used for the following data categories-

Metadata checkpointing: Metadata means information about information. It refers to storing metadata in a fault-tolerant storage system such as HDFS. You can consider configurations, DStream actions, and unfinished batches as types of metadata.
Data checkpointing: Because some of the stateful operations demand it, we save the RDD to secure storage. The RDD for the next batch is defined by the RDDs from previous batches, in this case.

26. In Spark, how would you calculate the total number of unique words?

// Open the text file in RDD mode
sc.textFile(“hdfs://Hadoop/user/sample_file.txt”);

// A function that converts each line into words
def toWords(line):
return line.split();

// As a flatMap transformation, run the toWords function on each item of the RDD in Spark
words = line.flatMap(toWords);

// Create a (key,value) pair for each word:
def toTuple(word):
return (word, 1);
wordTuple = words.map(toTuple);

// Run the reduceByKey() command
def sum(x, y):
return x+y:
counts = wordsTuple.reduceByKey(sum)

// Print
counts.collect()

27. List some of the benefits of using PySpark.

PySpark is a specialized in-memory distributed processing engine that enables you to handle data in a distributed fashion effectively.
PySpark-based programs are 100 times quicker than traditional apps.
PySpark can handle data from Hadoop HDFS, Amazon S3, and a variety of other file systems.
Through the use of Streaming and Kafka, PySpark is also utilized to process real-time data.
You can use PySpark streaming to swap data between the file system and the socket.
PySpark contains machine learning and graph libraries.

Table of Contents

1. What is PySpark?

2. PySpark Interview Topics