Last updated on March 20th, 2025 at 03:34 am
Here, we will discuss Hadoop Spark Interview Questions, which interviewers ask in most company interviews for Data Engineer job positions.
Table of Contents
1. What is Hadoop?
Hadoop is an open-source framework for distributed storage (HDFS) and processing (MapReduce) of large datasets across clusters.
It enables scalable, fault-tolerant big data solutions.
2. Hadoop Interview Topics
- HDFS Architecture: Namenode/data node roles, replication, block storage.
- MapReduce: Phases (map, shuffle, reduce), job workflow, optimizations.
- YARN: Resource management, job scheduling.
- Ecosystem Tools: Hive, Pig, HBase, Spark integration.
- Fault Tolerance: Data replication, speculative execution.
- Optimization: Combiner/Partitioner usage, handling small files.
- Hadoop vs. Spark: Latency, in-memory processing trade-offs.
- Use Cases: Batch processing, log analysis, ETL workflows.
- Challenges: Data skew, cluster tuning.
- Cloud Integration: AWS EMR, Azure HDInsight.
Hadoop Spark Interview Questions
1. What is Hadoop MapReduce?
Hadoop MapReduce framework is used to process large datasets in parallel across the Hadoop cluster.
2. What are the differences between the relational database and HDFS?
RDBMS | HDFS |
In RDBMS, it relies on structured data and schema is always known. | Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion. |
RDBMS provides limited or no processing capabilities. | Hadoop follows schema on read policy. |
Licensed software, therefore, needs to pay for software. | Hadoop is open source framework, hence no need to pay for software. |
In RDBMS reads are fast because the schema is already known. | Writes are fast in Hadoop because no schema validation happens during HDFS write. |
RDBMS is used for OLTP(online transactional processing) systems. | Hadoop is used for data discovery, data analytics or OLAP System. |
3. Explain big data and explain 5v’s of big data.
Bigdata is a term for the collection of large and complex datasets, that make it to difficult process using relational database management tools or traditional data processing applications.
It is difficult to capture, visualize, curate, store, search, share, transfer, and analyze big data.
IBM has defined big data with 5v’s they are-
- Volume
- Velocity
- Variety
- Veracity
- Value: It is good to have access to big data but unless we turn it into value it is useless.
4. What is Hadoop and its components?
Apache Hadoop is a framework that provides us with various services or tools, to store and process the bigdata.
It helps in analyzing bigdata and making business decisions out of it, which cannot be done using traditional systems.
The main components of Hadoop are-
- Storage( NameNode, DataNode)
- Processing framework yarn (resource manager, Node Manager)
5. What are HDFS and Yarn?
- HDFS:
- HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment.
- It follows master and slave topology.
- NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors, etc.
- DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
- YARN:
- YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment for the processes.
- ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
- NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.
6. Tell me about various Hadoop Daemons and their roles in the Hadoop cluster.
HDFS daemons i.e. NameNode, DataNode, and Secondary NameNode.
YARN daemons i.e. ResorceManager, NodeManager, and JobHistoryServer.
JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.
7. Compare HDFS with the Network attached service(NAS).
Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients.
NAS can either be hardware or software which provides services for storing and accessing files.
Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
In HDFS, Data Blocks are distributed across all the machines in a cluster, whereas in NAS data is stored on dedicated hardware.
8. List the difference between Hadoop 1.0 vs Hadoop 2.0.
In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”.
If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
In Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.
MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.
9. What are active and Passive Namenodes?
In a high-availability architecture, there are 2 namenodes. i.e.
- Active “NameNode” is the “NameNode” which works and runs in the cluster.
- Passive “NameNode” is a standby “NameNode”, which has similar data as the active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster.
10. Why does one remove or add datanodes frequently?
The feature of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster.
Another feature of the Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. this is why Hadoop admins work is to commission or decommission nodes in Hadoop cluster.
11. What happens when two clients try to access the same file in HDFS?
When the first client requests for file or data HDFS provides access to write, but when the second client requests it rejects it by saying already another client accessing it.
12. How does NameNode tackle data node failures?
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
13. What will you do when NameNode is down?
Use the file system metadata replica (FsImage) to start a new NameNode.
Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.
14. What is checkpoint?
Checkpointing is a process that takes a FsImage, edits the log, and compacts them into a new FsImage.
Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage.
15. What is HDFS fault tolerant?
When data is stored over HDFS, namenode replicates the data in the server datanode, which has a default value of 3.
If any DataNode fails Namenode automatically copies data to another datanode to make sure data is fault tolerant.
16. Can NameNode and DataNode be commodity hardware?
No. Because Namenode is built with high memory space and higher quality software, but Datanode is built with cheaper hardware.
17. Why do we use HDFS for files with large data sets but not when there are a lot of small files?
NameNode stores the metadata information regarding the file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too many files will lead to the generation of too much metadata. And, storing these metadata in the RAM will become a challenge. Hence HDFS is only works with large datasets instead large no. of small.
18. How do you define block, and what is the default block size?
A Block is nothing but the smallest continuous location in a hard drive where data is stored.
The default block size of Hadoop 1 is 64 MB and Hadoop 2 is 128 MB.
19. How do you define Rack awareness in Hadoop?
Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack.
20. What is the difference between HDFS block, and input split?
- The HDFS Block is the physical division of the data while Input Split is the logical division of the data.
- HDFS block divides data into blocks to store the blocks together processing, whereas Input split Divides the data into the input split and assign it to the mapper function for processing.
21. Name three modes that Hadoop can run.
- Standalone mode
- Pseudo- distribution mode
- Fully distributed mode
22. What do you know about SequenceFileFormat?
SequenceFileInputFormat is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
23. What is Hive?
Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook.
Hive abstracts the complexity of Hadoop MapReduce.
24. What is Serde in Hive?
The SerDe interface allows you to instruct Hive about how a record should be processed.
A SerDe is a combination of a Serializer and a Deserializer. Hive uses SerDe (and FileFormat) to read and write the table’s row.
25. Can the default hive metastore used by multiple users at the same time?
Derby database is the default Hive Metastore.
Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.
26. What is the default location for the hive to store table data?
The default location where Hive stores table data is inside HDFS in /user/hive/warehouse.
27. What is Apache Hbase?
HBase is an open-source, multidimensional, distributed, scalable, and NoSQL database written in Java.
HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop.
It is designed to provide a fault-tolerant way of storing the large collection of sparse data sets.
HBase achieves high throughput and low latency by providing faster Read/Write Access on huge datasets.
28. What are the components of Apache Hbase?
Apache HBase has three major components, i.e. HMaster Server, HBase RegionServer, and Zookeeper.
- Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
- HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
- ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
29. What are the components of the Region server?
- WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
- Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
- MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
- HFile: HFile is stored in HDFS. It stores the actual cells on the disk.
30. What is the difference between the Hbase and the Relation database?
HBase is an open-source, multidimensional, distributed, scalable, and NoSQL database written in Java.
Hbase | Relational Database |
It is schema-less. | Automated partitioning is done in HBase. |
It is a schema-based database. | It is row-oriented data store. |
It is used to store de-normalized data. | It is used to store normalized data. |
Automated partitioning is done is HBase. | There is no such provision or built-in support for partitioning. |
31. What is Apache Spark?
Apache Spark is a framework for real-time data analytics in a distributed computing environment.
It executes in-memory computations to increase the speed of data processing.
It is 100x faster than MapReduce for large-scale data processing by exploiting in-memory computations and other optimizations.
32. Can you build Spark with any particular Hadoop version?
Yes. Spark can be built with any version of Hadoop.
33. What is RDD?
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel.
The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.
34. Are Hadoop and Bigdata co-related?
- Big Data is an asset, while Hadoop is an open-source software program, which accomplishes a set of goals and objectives to deal with that asset.
- Hadoop is used to process, store, and analyze complex unstructured data sets through specific proprietary algorithms and methods to derive actionable insights.
So yes, they are related but are not alike.
35. Why is Hadoop used in big data analytics?
Hadoop allows running many exploratory data analysis tasks on full datasets, without sampling.
Features that make Hadoop an essential requirement for Big Data are –
- Data collection
- Storage
- Processing
- Runs independently.
36. Name some of the important tools used for data analytics.
The important Big Data analytics tools are –
- NodeXL
- KNIME
- Tableau
- Solver
- OpenRefine
- Rattle GUI
- Qlikview
37. What is FSCK?
FSCK or File System Check is a command used by HDFS. It checks if any file is corrupt, or if there are some missing blocks for a file.
FSCK generates a summary report, which lists the overall health of the file system.
38. What are the different core methods of Reducer?
There are three core methods of a reducer–
- setup(): It helps to configure parameters like heap size, distributed cache, and input data size.
- reduce(): Also known as once per key with the concerned reduce task. It is the heart of the reducer.
- cleanup(): It is a process of cleaning up all the temporary files at the end of a reducer task.
39. What are the most common Input file formats in Hadoop?
The most common input formats in Hadoop are –
- Key-value input format
- Sequence file input format
- Text input format
40. What are the different file formats that can be used in Hadoop?
File formats used with Hadoop, include –
- CSV
- JSON
- Columnar
- Sequence files
- AVRO
- Parquet file
41. What is commodity hardware?
Commodity Hardware is the basic hardware resource required to run the Apache Hadoop framework.
It is a common term used for affordable devices, usually compatible with other such devices.
42. What do you mean by logistic regression?
Logistic Regression is a classification algorithm used to predict the probability of a binary outcome (e.g., yes/no, 0/1) by modeling the relationship between input features and a log-odds (logit) function.
Unlike linear regression, it uses the sigmoid function to map linear outputs to probabilities between 0 and 1.
43. Name the port number for namenode, task tracker, and job tracker.
- NameNode – Port 50070
- Task Tracker – Port 50060
- Job Tracker – Port 50030.
44. Name the most popular data management tools that are used with edge nodes in Hadoop.
The most commonly used data management tools that work with Edge Nodes in Hadoop are –
- Oozie
- Ambari
- Pig
- Flume.
45. What is the block in the Hadoop distributed file system?
When the file is stored in HDFS, the file system breaks down into a set of blocks.
46. What is the functionality of jps command?
The jps command enables us to check if the Hadoop daemons like namenode, datanode, resourcemanager, nodemanager, etc. are running on the machine.
47. What types of biases can happen through sampling?
Three types of biases can happen through sampling, which are –
- Survivorship bias
- Selection bias
- Under coverage bias.
48. What is the difference between Sqoop and DistCP?
DistCP is used for transferring data between clusters, while Sqoop is used for transferring data between Hadoop and RDBMS, only.
49. How much data is enough to get a valid outcome?
The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.
50. Is Hadoop different from other parallel computing systems? How?
Yes, it is.
- Hadoop is a distributed file system.
- It allows us to store and manage large amounts of data in a cloud of machines, managing data redundancy.
- The main benefit of this is that since the data is stored in multiple nodes, it is better to process it in a distributed way.
- Each node is able to process the data stored on it instead of wasting time moving the data across the network.
- In a relational database computing system, we can query data in real-time, but it is not efficient to store data in tables, records, and columns when the data is huge.
- Hadoop also provides a schema for building a column database with Hadoop HBase for run-time queries on rows.
51. What is a Backup Node?
Backup Node is an extended checkpoint node for performing checkpointing and supporting the online streaming of file system edits.
Its functionality is similar to Checkpoint, and it forces synchronization with NameNode.
52. What are the common data challenges?
The most common data challenges are–
- Ensuring data integrity
- Achieving a 360-degree view
- Safeguarding user privacy
- Taking the right business action with real-time resonance
53. How do you overcome above mentioned data challenges?
Data challenges can be overcome by –
- Adopting data management tools that provide a clear view of data assessment.
- Using tools to remove any low-quality data.
- Auditing data from time to time to ensure user privacy is safeguarded.
- Using AI-powered tools, or software as a service (SaaS) products to combine datasets and make them usable.
54. What is the hierarchical Clustering algorithm?
The Hierarchical Grouping Algorithm is the one that combines and divides the groups that already exist.
55. What is K- Mean clustering?
K-Means clustering is an unsupervised machine learning algorithm that partitions data into K clusters. It minimizes variance by iteratively assigning points to the nearest cluster centroid and updating centroids until convergence, effectively grouping similar data points.
56. Can you mention the criteria for a good data model?
A good data model –
- It should be easily consumed
- Large data changes should be scalable
- Should offer predictable performances
- Should adapt to changes in requirements.
57. Name the different commands for starting up and shutting down the Hadoop daemons.
- To start all the daemons: ./sbin/start-all.sh
- To shut down all the daemons: ./sbin/stop-all.sh
58. Talk about the different tombstone markers used for deletion purposes in Hbase.
There are three main tombstone markers used for deletion in HBase. They are-
- Family Delete Marker – For marking all the columns of a column family.
- Version Delete Marker – For marking a single version of a single column.
- Column Delete Marker – For marking all the versions of a single column.
59. How can big data add value to businesses?
Big Data Analytics helps businesses transform raw data into meaningful and actionable insights that can shape their business strategies.
The most important contribution of Big Data to business is data-driven business decisions.
60. How do you deploy bigdata solution?
We can deploy big data in 3 stages. they are
- Data Ingestion: begin by collecting data from multiple sources, be it social media platforms, log files, business documents, anything relevant to your business. Data can either be extracted through real-time streaming or in batch jobs.
- Data storage: Once the data extracted, it can be stored in Hbase, or Hdfs. While HDFS storage is perfect for sequential access, HBase is ideal for random read/write access.
- Data processing: Usually, data processing is done via frameworks like Hadoop, Spark, MapReduce, Flink, and Pig, to name a few.
61. List the different file permissions in HDFS files or directory levels.
There are three user levels in HDFS – Owner, Group, and Others.
For each of the user levels, there are three available permissions:
- read (r)
- write (w)
- execute(x).
62. Elaborate on the process that overwrites the replication factor in HDFS.
In HDFS, there are two ways to overwrite the replication factors – on file basis and on directory basis.
63. Explain overfitting.
Overfitting refers to a modeling error that occurs when a function is tightly fit (influenced) by a limited set of data points.
64. What is feature selection?
Feature Selection refers to the process of extracting only the required features from a specific dataset.
When data is extracted from disparate sources. Feature selection can be done via three techniques–
- Filters method
- Wrappers method
- Embedded method.
65. Define Outliers?
Outliers are the values that are far removed from the group, they do not belong to any specific cluster or group in the dataset. The presence of outliers usually affects the behavior of the model.
Here are five outlier detection methods:
- Extreme value analysis
- Probabilistic analysis
- Linear models
- Information-theoretic models
- High-dimensional outlier detection.
66. How can you handle missing values in Hadoop?
There are different ways to estimate the missing values. These include regression, multiple data imputation, listwise/pairwise deletion, maximum likelihood estimation, and approximate Bayesian bootstrap.