Last updated on March 20th, 2025 at 03:33 am
Here, we will discuss Spark Interview Questions, which interviewers ask in most company interviews for mainly Data Engineer job positions.
Table of Contents
1. What is Apache Spark?
Apache Spark is an open-source distributed computing engine for large-scale data processing.
It offers in-memory processing, speed, and ease of use via APIs (Python, Scala, Java, R). Core components: Spark Core (RDDs), Spark SQL, MLlib, Streaming, and GraphX.
2. Apache Spark Interview Topics
- RDD vs. DataFrame/Dataset: Immutability, optimizations.
- Lazy Evaluation: Execution plan optimization.
- Transformations/Actions: map(), reduceByKey(), collect().
- Shuffling/Partitioning: Minimize data movement.
- Spark Architecture: Driver/executors, cluster managers (YARN, Kubernetes).
- Fault Tolerance: Lineage, checkpointing.
- Spark Streaming: Micro-batching vs. real-time.
- Optimizations: Caching, broadcast variables.
- Spark vs. Hadoop: Speed, in-memory vs. disk.
- Use Cases: ETL, ML, real-time analytics.
Spark Interview Questions
1. What is Spark?
Spark is a General purpose, memory compute engine. It can support any storage, any compute engine.
Spark saves storage in memory rather disc in MapReduce. It is plug and play compute engine.
2. Difference between Spark & MR?
- Performance: Spark was designed to be faster than MapReduce, and by all accounts, it is; in some cases, Spark can be up to 100 times faster than MapReduce.
- Operability: Spark is easier to program than MapReduce.
- Data Processing: MapReduce and Spark are both great at different types of data processing tasks.
- Failure Recovery.
- Security.
3. Explain the architecture of spark.
The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple slaves.
The Spark architecture depends upon two abstractions: Resilient Distributed Dataset (RDD) and Directed Acyclic Graph (DAG).
In Spark, there are 2 kinds of operations – Transformations and Actions
Transformations are lazy which means when we execute the below lines, no actual computation has happened but a diagram will be created, but actions are not. A DAG is generated when we compute spark statements. Execution happens when action is encountered before that only entries are made into DAG.
4. What is RDD?
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
5. How spark achieves fault tolerance?
RDD Provides Fault tolerance through lineage graph.
A lineage graph keeps a track of transformations to be executed after action has been called.
6. What is Transformations & action in spark?
- Transformations: Create new RDDs (Resilient Distributed Datasets) from existing ones. They are lazy — only executed when an action is called. Examples: map(), filter(), flatMap(), groupBy().
- Actions: Trigger computations and return results or save data. They execute transformations. Examples: collect(), count(), reduce(), saveAsTextFile().
7. Difference between Narrow & wide transformations?
- A Narrow Transformation is one in which a single input partition maps to a single output partition. example filter,map, flatmap
- A Wide Transformation is a much more expensive operation and is sometimes referred to as a shuffle in Spark. Ex: ReduceByKey, Groupby.
8. What is difference between DAG & Lineage?
- DAG: A DAG is generated when we compute spark statements. Execution happens when action is encountered before that, only entries are made into DAG.
- Lineage: RDD Provides Fault tolerance through lineage graph. A lineage graph keeps a track of transformations to be executed after action has been called.
9. What is partition and how spark Partitions the data?
A Partition in simple terms is a split in the input data, so partitions in spark are basically smaller logical chunks or divisions of the input data.
Spark distributes this partitioned data among the different nodes to perform distributed processing on the data.
10. What is spark core?
Spark Core is the foundation of Apache Spark, providing essential functionalities for distributed data processing. It handles:
- Memory Management and fault recovery
- Job Scheduling and task dispatching
- RDD (Resilient Distributed Dataset) abstraction for distributed data processing
- API Support in Java, Scala, Python, and R
11. What is spark driver or driver program?
A Spark driver is the process that creates and owns an instance of SparkContext. It is the cockpit of jobs and tasks execution (using DAGScheduler and Task Scheduler).
12. What is spark executors?
Executors are worker nodes’ that running individual tasks in a given Spark job.
They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application.
13. What is worker node?
Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node.
14. What is lazy evaluation in spark?
Lazy evaluation in Spark means that the execution will not start until an action is triggered.
In Spark, the picture of lazy evaluation comes when Spark transformations occur.
15. What is pair RDD in spark?
- Paired RDD in Spark is an RDD with the distributed collection of objects containing key-value pairs.
- It is a very powerful data structure because it supports to act on each key operation in parallel or re-group data across the network.
16. Difference between persist() and cache() in spark?
Both persist() and cache() are the Spark optimization technique, used to store the data.
cache() method by default stores the data in-memory (MEMORY_ONLY) whereas in persist() method developer can define the storage level to in-memory or in-disk.
17. What is serialization and deserialization?
- Serialization can be defined as which is converting from object to bytes which can be sent over network.
- Deserialization can be defined as converting the data to a form that can be stored sent over the network into a form which can be read.
18. Avoid returning null in scala code
In Scala, it’s best to avoid returning null because it can lead to NullPointerException — defeating Scala’s safer type system. Here are better alternatives:
def findUser(id: Int): Option[String] = { val users = Map(1 -> "Alice", 2 -> "Bob") users.get(id) // Returns Some(user) or None }
19. Difference between map() and flatmap()?
- Map() operation applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic.
- FlatMap() is similar to Map, FlatMap allows returning 0, 1, or more elements from the map function.
20. What are the various levels of persistence in spark?
Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels namely: MEMORY_ONLY; MEMORY_ONLY_SER; MEMORY_AND_DISK; MEMORY_AND_DISK_SER, DISK_ONLY; OFF_HEAP
21. What is an accumulator in a spark?
sparkContext.accumulator() is used to define accumulator variables. value property on the accumulator variable is used to retrieve the value from the accumulator.
Accumulators are variables that are used for aggregating information across the executors.
22. What is the broadcast variable in spark?
Broadcast variables in Apache Spark are a mechanism for sharing variables across executors that are meant to be read-only.
Without broadcast variables, these variables would be shipped to each executor for every transformation and action, and this can cause network overhead.
23. What is checkpointing in spark?
- Checkpointing stores the RDD physically to HDFS and destroys the lineage that created it.
- The checkpoint file won’t be deleted even after the Spark application is terminated.
- Checkpoint files can be used in subsequent job runs or driver program
24. What is sparkContext?
sparkContext is an entry point to Spark and is defined in org.apache.spark package and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster.
Its object sc is the default variable available in spark-shell and it can be programmatically created using sparkContext class.
25. What is Executor memory in Spark?
Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag.
26. Explain spark stages.
Spark stages are the physical unit of execution for the computation of multiple tasks. The Spark stages are controlled by the Directed Acyclic Graph (DAG) for any data processing and transformations on the resilient distributed datasets (RDD).
There are two types of stages in spark: ShuffleMapstage and ResultStage.
27. What is spark SQL?
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
28. Difference between RDD vs Dataframe & Dataset in Spark?
- RDD is a low level API whereas DataFrame/Dataset are high level APIs. With RDD, you have more control on what you do.
- A DataFrame/Dataset tends to be more efficient than an RDD. What happens inside Spark core is that a DataFrame/Dataset is converted into an optimized RDD.
29. How to spark SQL is different from HQL & SQL?
- Spark-SQL: SparkSQL is a special component on the sparkCore engine that supports SQL and HiveQueryLanguage without changing any syntax.
- SQL: SQL is a traditional query language that directly interacts with RDBMS.
- HQL: HQL is a JAVA-based OOP language that uses the Hibernate interface to convert the OOP code into query statements and then interacts with databases.
30. What is a catalyst Optimizer?
Catalyst optimizer makes use of some advanced programming language features to build optimized queries.
Catalyst optimizer was developed using the programming construct Scala.
Catalyst Optimizer allows both Rule-based optimization and Cost-based optimization.
31. What is Spark Streaming?
Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. This processed data can be pushed out to file systems, databases.
32. What is DStream?
DStream is a continuous stream of data. It receives input from various sources like Kafka, Flume, Kinesis, or TCP sockets.
It can also be a data stream generated by transforming the input stream.
At its core, DStream is a continuous stream of RDD (Spark abstraction).
33. How to create a Micro batch and its benefits?
- It allows developers to write code, and influence the architecture.
- Microservices are small applications that your development teams create independently.
34. What is windowing in spark streaming?
In Spark Streaming, windowing is a technique used to perform computations over a sliding time window of incoming data streams.
It allows you to apply transformations on data collected over a specific duration, rather than processing each record individually.
35. What is Scala programming Languages & its advantages?
- Easy to Pick Up.
- Pretty Good IDE support.
- Scalability
36. What is the difference between Statically typed & Dynamically typed language?
- In statically typed languages, type-checking happens at compile time.
- In dynamically typed languages, type-checking happens at run-time.
37. What is the difference var and val in Scala?
The keywords var and val both are used to assign memory to variables.
var keyword initializes mutable variables.
val keyword initializes immutable variables.
38. What is the difference between == in java and Scala?
In Java, C++, and C# the == operator tests for reference, not value equality.
In Scala, == is testing for value equality.
39. What is Typesafe in scala?
Typesafe means that the set of values that may be assigned to a program variable must fit well-defined and testable criteria.
40. What is type inference in Scala?
With Scala type inference, Scala automatically detects the type of the function without explicitly specified by the user.
41. What is Unit in Scala? What is the difference between java void’s and Scala unit?
Unit is a final class defined in “Scala” package that is “Scala.Unit”. Unit is something similar to Java’s void. But they have a few differences.
Java’s void does not have any value. It is nothing. Scala’s Unit has one value () is the one and only value of type Unit in Scala.
42. What is a Scala singleton object?
Scala Singleton Object is an object that is declared by using an object keyword instead of a class.
No object is required to call methods declared inside a singleton object. In Scala, there is no static concept.
43. What is a companion object in Scala?
A companion object in Scala is an object that’s declared in the same file as a class and has the same name as the class.
For instance, when the following code is saved in a file named Pizza.Scala, the Pizza object is considered to be a companion object to the Pizza class.
44. Difference between a singleton object and a class in Scala?
An object is a singleton — an instance of a class that is guaranteed to be unique. For every object in the code, an anonymous class is created, which inherits from whatever classes you declared an object to implement. Classes have fields and methods
45. What is Scala Map?
Scala map is a collection of key/value pairs. Any value can be retrieved based on its key. Keys are unique in the Map, but values need not be unique.
46. what is Scala set?
Set is a collection that contains no duplicate elements. There are two kinds of Sets, the immutable and the mutable.
47. What is the use of Tuples in Scala?
A tuple is a data structure which can store elements of the different data type. It is also used for storing and retrieving of data.
In Scala, tuples are immutable and store heterogeneous types of data.
48. What is Scala’s case class?
Scala Case Class is like a regular class, except it is good for modeling immutable data. It also serves useful in pattern matching, such a class has a default apply () method which handles object construction.
49. What is the scala option?
Scala Option [ T ] is a container for zero or one element of a given type. An Option [T] can be either Some [T] or None object, which represents a missing value.
50. What is the use case of the App class in Scala?
Scala provides a helper class, called App, that provides the main method. Instead of writing your own main method, classes can extend the App class to produce concise and executable applications in Scala.
51. Difference between terms & types in Scala? Nill, Null, None, Nothing?
- Null: It’s a Trait, an instance of Null- Similar to Java null.
- Nil: Represents an empty List of anything of zero length.
- Nothing: It’s a trait, a subtype of everything, but not a superclass of anything. There are no instances of Nothing.
- None: Used to represent a sensible return value. Just to avoid a null pointer.
52. What are traits in Scala?
In Scala, the trait is a collection of abstract and non-abstract methods. You can create a trait that can have all abstract methods or some abstract and some non-abstract methods.
A variable that is declared either by using val or var keyword in a trait gets internally implemented in the class that implements the trait.
53. Difference between Traits and abstract classes in Scala?
Traits | Abstract Class |
Allow multiple inheritances. | Do not Allow multiple inheritances. |
Constructor parameters are not allowed in Trait. | The code of the abstract class is fully interoperable. |
The Code of traits is interoperable until it is implemented. | The code of abstract class is fully interoperable. |
Traits can be added to an object instance in Scala. | Abstract classes cannot be added to object instances in Scala. |
54. Difference between Call-by-value and call-by-name parameter?
- call-by-value is the same value that will be used throughout the function.
- Whereas in a Call by Name, the expression itself is passed as a parameter to the function and it is only computed inside the function, whenever that particular parameter is called.
55. What are Higher order functions in Scala?
Scala Higher Order Functions is a function that either takes a function as an argument or returns a function. In other words, we can say a function that works with a function is called a higher-order function.
Higher order function allows you to create function composition, lambda function anonymous function, etc.
56. What is the Pure function in Scala?
A function is called pure function if it always returns the same result for same argument values and it has no side effects like modifying an argument (or global variable) or outputting something.
57. Explain the Scala anonymous function in Scala.
In Scala, An anonymous function is also known as a function literal. A function that does not contain a name is known as an anonymous function. An anonymous function provides a lightweight function definition.
58. What is closure in Scala?
A closure is a function, whose return value depends on the value of one or more variables declared outside this function.
59. What is currying in Scala?
Currying is the process of converting a function with multiple arguments into a sequence of functions that take one argument.
60. What is an option in Scala? Why do we use it?
Scala Option[ T ] is a container for zero or one element of a given type. An Option[T] can be either Some[T] or None object, which represents a missing value. Option type is used frequently in Scala programs and you can compare this with the null value
61. What is tail recursion in Scala?
Recursion is a method that breaks the problem into smaller subproblems and calls itself for each of the problems.
62. What is yield in Scala?
For each iteration of your for loop, yield generates a value that is remembered by the for loop (behind the scenes)
63. Can we able to do datasets in python?
A simple way to get sample datasets in Python is to use the pandas ‘read_csv’ method to load them directly from the internet.
64. How to join two tables using data frames?
empDF. join ( deptDF, empDF ("emp_dept_id") === deptDF ("dept_id"),"inner" ).show(false)
65. How to remove duplicate records in the Dataframe?
Use distinct() – Remove Duplicate Rows on DataFrame.
Use dropDuplicate () – Remove Duplicate Rows on DataFrame.
66. How to add columns in the Dataframe?
Using withColumn () to Add a New Column. Here, we have added a new column.
67. SQL basics concepts such as Rank, Dense Rank, Row Number?
- RANK() Returns the rank of each row in the result set of partitioned columns. select Name, Subject, Marks, RANK().
- DENSE_RANK() This is the same as the RANK() function. The only difference is returns rank without gaps.
- ROW_NUMBER will always generate unique values without any gaps, even if there are ties.
68. Query to find the 2nd largest number in the table.
SELECT MAX(sal) as Second_Largest FROM emp_test WHERE sal < ( SELECT MAX(sal) FROM emp_test)
69. To find duplicate records in the table?
select a.* from Employee a where rowid != (select max(rowid) from Employee b where a.Employee_num =b.Employee_num;
70. Difference between a List and Tuple?
LIST | TUPLE |
Lists are mutable. | Tuples are immutable. |
The implication of iterations is time-consuming. | The implication of iterations is comparatively Faster. |
The list is better for performing operations, such as insertion and deletion. | Unexpected changes and errors are more likely to occur. |
Lists consume more memory. | Tuple consumes less memory as compared to the list. |
Lists have several built-in methods. | Tuple does not have many built-in methods. |
In a tuple, it is hard to take place. | In tuple, it is hard to take place. |
71. Difference between def and Lambda?
- lambda is a keyword that returns a function object and does not create a ‘name’. Whereas def creates a name in the local namespace.
- lambda functions are good for situations where you want to minimize lines of code as you can create a function in one line of python code.
- lambda functions are somewhat less readable for most Python users.
72. Why bigdata on the cloud preferred these days?
Big Data Cloud brings the best of open source software to an easy-to-use and secure environment that is seamlessly integrated and Serverless.
73. What is AWS EMR?
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis.
Amazon EMR offers the expandable low-configuration service as an easier alternative to running in-house cluster computing.
74. How to write a UDF in the hive?
- Writing UDF (User Defined function) hive makes it easy to plug in your own processing code and invoke it from a Hive query. UDF’s have to be writhen in Java, the Language that Hive itself is written in.
- Create a Java class for the User Defined Function which extends ora.apache.hadoop.hive.sq.exec.UDF implements more than one evaluate() method. Put in your desired logic and you are almost there.
- Package your Java class into a JAR file (I am using Maven).
- Go to Hive CLI, add your JAR, and verify your JARs are in the Hive CLI classpath.
- CREATE TEMPORARY FUNCTION in Hive which points to your Java class.
- Use it in Hive SQL.
75. File formats row based vs column based?
In a row storage format, each record in the dataset has to be loaded and parsed into fields, and then the data for Name is extracted.
With the column-oriented format, it can directly go to the Name column as all the values for that column are stored together.
It doesn’t need to go through the whole record.
76. What is RDD?
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects.
Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
77. How to join two larger tables in spark?
Spark uses SortMerge joins to join large table. It consists of hashing each row on both table and shuffle the rows with the same hash into the same partition. There the keys are sorted on both side and the SortMerge algorithm is applied.
78. What are Bucketed tables?
In the table directory, the Bucket numbering is 1-based and every bucket is a file.
Bucketing is a standalone function. This means you can perform bucketing without performing partitioning on a table.
A bucketed table creates nearly equally distributed data file sections.
79. How to read the parquet file format in spark?
DataFrameReader provides parquet() function (spark.read.parquet) to read the parquet files and creates a Spark DataFrame.
80. How to tune spark executor, cores and executor memory?
- Number of cores = Concurrent tasks as the executor can run.
- Total available cores in one Node(CPU) – we come to 3 executors per node.
The property spark.executor.memory specifies the amount of memory to allot to each executor.
81. Default partition size in spark?
The default number of partitions is based on the following. On the HDFS cluster, by default, Spark creates one Partition for each block of the file.
- In Version 1 Hadoop the HDFS block size is 64 MB.
- In Version 2 Hadoop the HDFS block size is 128 MB.
82. Is there any use for running the spark program on a single machine?
Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts.
It is also possible to run these daemons on a single machine for testing.
83. how do find how many resources are available in YARN?
yarn.resource-types.memory-mb.increment-allocation: The fairscheduler grants memory in increments of this value. If you submit a task with a resource request that is not a multiple of memory mb.increment-allocation, the request will be rounded up to the nearest increment.
Defaults to 1024 MB. yarn.resource-types.vcores.increment-allocation
84. Differences between cluster and client Mode?
- In cluster mode, the driver will get started within the cluster in any of the worker machines. So, the client can fire the job and forget it.
- In client mode, the driver will get started with the client. So, the client has to be online and in touch with the cluster.
85. Explain the dynamic allocation in spark.
Spark dynamic allocation is a feature allowing your Spark application to automatically scale up and down the number of executors.
Only the number of executors, not the memory size and not the number of cores of each executor must still be set specifically in your application or when executing spark-submit command.
86. Difference between partition by and cluster by in a hive?
- In Hive partitioning, the table is divided into a number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. Records with the same bucketed column will be stored in the same bucket. The “clustered by” clause is used to divide the table into buckets.
- Cluster By is a short-cut for both Distribute By and Sort By. Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
87. How to choose a partitioning column in the hive? and which column shouldn’t use partition and why?
When the column with a high search query has low cardinality.
For example, if you create a partition by the country name then a maximum of 195 partitions will be made and these number of directories are manageable by the hive.
On the other hand, do not create partitions on the columns with very high cardinality.
88. How to transfer data from unix system to HDFS?
hdfs dfs -put test /hadoop ubuntu@ubuntu-VirtualBox
89. Can we extract only different data from two different tables?
Using the Join column we can extract data.
SELECT tablenmae1.colunmname, tablename2.columnnmae FROM tablenmae1 JOIN tablename2 ON tablenmae1.colunmnam = tablename2.columnnmae ORDER BY columnname;
90. What is the difference between SQL vs NoSQL?
- SQL databases are vertically scalable while NoSQL databases are horizontally scalable. SQL databases have a predefined schema.
- NoSQL databases use dynamic schema for unstructured data. SQL requires specialized DB hardware for better performance while NoSQL uses commodity hardware.
91. How to find a particular text name in HDFS?
You can use cat command on HDFS to read regular text files.
hdfs dfs -cat /path/to/file.csv
92. Explain about sqoop ingestion process.
Apache Sqoop is a data ingestion tool designed for efficiently transferring bulk data between Apache Hadoop and structured data-stores such as relational databases, and vice-versa.
93. Explain about sort Merge Bucket Join?
Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join.
SMB join can best be used when the tables are large.
In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.
94. Explain about tungsten.
Tungsten is a Spark SQL component that provides increased performance by rewriting Spark operations in bytecode, at runtime.
95. How can we join two bigger tables in spark?
Either using Sort Merge Joins if we are joining two big tables, or Broadcast Joins if at least one of the datasets involved is small enough to be stored in the memory of the single executors.
A ShuffleHashJoin is the most basic way to join tables in Spark
96. Explain about left outer join.
The left outer join returns a resultset table with the matched data from the two tables and then the remaining rows of the left table and null from the right table’s columns.
97. How to count the lines in a file by using a Linux command?
-wc
98. How to achieve map side joins in the hive?
Only possible since the right table that is to the right side of the join conditions, is lesser than 25 MB in size. Also, we can convert a right-outer join to a map-side join in the Hive.
99. When we use the select command does it go to reducer in Hive?
We can use reducer if there is no aggregation of data in mapside only then it use reducer.
100. How to validate the data once the ingestion is done?
Data validation is used as a part of processes such as ETL (Extract, Transform, and Load) where you move data from a source database to a target data warehouse so that you can join it with other data for analysis.
Data validation helps ensure that when you perform analysis, your results are accurate.
Steps to data validation:
- Determine data sample: validate a sample of your data rather than the entire set.
- Validate the database: Before you move your data, you need to ensure that all the required data is present in your existing database.
- Validate the data format: Determine the overall health of the data and the changes that will be required of the source data to match the schema in the target.
Methods for data validation:
- Scripting: Data validation is commonly performed using a scripting language.
- Enterprise tools: Enterprise tools are available to perform data validation.
- Open source tools: Open source options are cost-effective, and if they are cloud-based, can also save you money on infrastructure costs.
101. What is the use of split by the command in sqoop?
split-by in sqoop is used to create input splits for the mapper. It is very useful for the parallelism factor as splitting imposes the job to run faster.
Hadoop MAP Reduce is all about divide and conquer. When using the split-by option, you should choose a column that contains values that are uniformly distributed.
102. Difference between DataFrame vs Datasets?
- DataFrames gives a schema view of data basically, it is an abstraction. In dataframes, view of data is organized as columns with column name and types info. In addition, we can say data in dataframe is as same as the table in relational database. As similar as RDD, execution in dataframe too is lazy triggered.
- In Spark, Datasets are an extension of DataFrames. Basically, it earns two different APIs characteristics, such as strongly typed and untyped. Datasets are by default a collection of strongly typed JVM objects, unlike dataframes. Moreover, it uses Spark’s Catalyst optimizer.
103. Difference between schema on read vs schema on write?
- Schema on read differs from schema on write because you create the schema only when reading the data.
- Structured is applied to the data only when it’s read, this allows unstructured data to be stored in the database.
- The main advantages of schema in write are precision and query speed.
104. Different types of partition in the hive?
Types of Hive Partitioning-
- Static Partitioning
- Dynamic Partitioning
105. How to find counts based on age group?
SELECT Col1, COUNT(*) FROM Table GROUP BY Col1.
106. How to find a word in a log file by using pyspark?
- input_file = sc.textFile(“/path/to/text/file”)
- map = input_file.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1))
- counts = map.reduceByKey(lambda a, b: a + b)
- counts.saveAsTextFile(“/path/to/output/”)
107. Explain about Executor node in spark.
Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job.
They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application.
Once they have run the task they send the results to the driver.
108. Difference between Hadoop & spark?
- Hadoop reads and writes files to HDFS, Spark processes data in RAM using a concept known as an RDD, Resilient Distributed Dataset.
- Hadoop has its own storage system HDFS while Spark requires a storage system like HDFS which can be easily grown by adding more nodes.
- They both are highly scalable as HDFS storage can go more than hundreds of thousands of nodes.
- Spark can also integrate with other storage systems like S3 bucket.
109. Query to find duplicate value in SQL?
Using the GROUP BY clause to find the duplicate values use the GROUP BY clause to group all rows by the target column, which is the column that you want to check duplicates.
Then, use the COUNT()
110. Difference between Row number and Dense Rank in SQL?
- Rank(): SQL function generates the rank of the data within an ordered set of values but the next rank after the previous rank is the row_number of that particular row.
- Dense_Rank(): SQL function generates the next number instead of generating row_number. Below is the SQL example which will clarify the concept
111. What are the various hive optimization techniques?
- Tez-Execution Engine in Hive: Tez Execution Engine – Hive Optimization Techniques, to increase the Hive performance.
- Usage of Suitable File Format in Hive.
- Hive Partitioning.
- Bucketing in Hive.
112. How MapReduce will work? Explain?
MapReduce can perform distributed and parallel computations using large datasets across a large number of nodes.
A MapReduce job usually splits the input datasets and then process each of them independently by the Map tasks in a completely parallel manner.
The output is then sorted and input to reduce tasks. it uses a key-value pair.
113. How many types of tables are in Hive?
Hive knows two different types of tables: The internal table and the External table. The Internal table is also known as the managed table.
114. How to drop a table in HBase?
Dropping a Table using HBase Shell Using the drop command, you can delete a table.
Before dropping a table, you have to disable it.
hbase (main):018:0> disable ’emp’ 0 row (s) in 1.4580 seconds hbase (main):019:0> drop ’emp’ 0 row (s) in 0.3060 seconds
115. What is the difference between Batch and real-time processing?
- In real-time processing, the processor needs to be very responsive and active all the time. In batch processing processor only needs to busy when work is assigned to it.
- Real-time processing needs high computer architecture and high hardware specification. Normal computer specification can also work with batch processing.
- Time to complete the task is very critical in real-time.
- Real-time processing is expensive. Batch processing is cost-effective
116. How can Apache Spark be used alongside Hadoop?
Running Alongside Hadoop You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines.
To access Hadoop data from Spark, just use a hdfs:// URL (typically hdfs://:9000/path, but you can find the right URL on your Hadoop Namenode’s web UI).
117. What are data explosion and lateral view in Hive?
In Hive, the lateral view explodes the array data into multiple rows. In other words, the lateral view expands the array into rows. Hive has a way to parse array data type using LATERAL VIEW.
Use LATERAL VIEW with UDTF to generate zero or more output rows for each input row. Explode is one type of User Defined Table Building Function.
118. What are combiner, shuffling, and sorting in Mapreduce?
- Shuffling is the process by which it transfers the mapper intermediate output to the reducer. Reducer gets 1 or more keys and associated values based on reducers.
- In the Sorting phase, merging and sorting of map output takes place.
- The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
119. What is the difference between reduceByKey and GroupByKey?
- The groupByKey method operates on an RDD of key-value pairs, so a key generator function is not required as input.
- The reduceByKey is a higher-order method that takes an associative binary operator as input and reduces values with the same key. This function merges the values of each key using the reduceByKey method in Spark.
120. What are static and dynamic partitions in Hive?
- Dynamic partition loads the data from non partitioned table. Dynamic Partition takes more time to load data compared to static partition. When you have large data stored in a table then Dynamic partition is suitable.
- Static partitions are preferred when loading big data in a Hive table and they save your time in loading data compared to dynamic partitions. In static partitioning, we need to specify the partition column value in each and every LOAD statement.
121. Udf example in Hive?
A UDF processes one or several columns of one row and outputs one value.
For example: SELECT lower(str) from table For each row in “table,” the “lower” UDF takes one argument, the value of “str”, and outputs one value, the lowercase representation of “str”.
122. What is Serde in Hive?
In SerDe interface handles both serialization and deserialization and also interprets the results of serialization as individual fields for processing.
It allows Hive to read data from a table, and write it back to HDFS in any format, the user can write data formats.
123. How to check the file size in Hadoop?
You can use the hadoop fs -ls -h command to check the size. The size will be displayed in bytes.
124. How to submit the spark Job?
Using –deploy-mode, you specify where to run the Spark application driver program.
- Cluster Mode: In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs.
- Client mode: In client mode, the driver runs locally where you are submitting your application from. client mode is mainly used for interactive and debugging purposes.
Using the –master option, you specify what cluster manager to use to run your application.
Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.
125. What is vectorization and why is it used?
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time
126. What are Complex data types in Hive?
ARRAY Struct Map
127. What is sampling in Hive?
- Prepare the dataset.
- Create a Hive Table and Load the Data.
- Sampling using Random function.
- Create a Bucketed table.
- Load data into the Bucketed table.
- Sampling using bucketing.
- Block sampling in the hive.
128. What are the different types of XML files in Hadoop?
- resource-types.xml
- node-resources.xml
- yarn-site.xml
129. What is case class?
A Case Class is just like a regular class, which has a feature for modeling unchangeable data. It is also constructive in pattern matching.
It has been defined with a modifier case, due to this case keyword, we can get some benefits to stop oneself from doing a sections of codes that have to be included in many places with little or no alteration.
130. Difference between broadcast and accumulators?
- Broadcast variables: used to efficiently, distribute large values.
- Accumulators: used to aggregate the information of a particular collection.
131. What is the difference between spark context and spark session?
- SparkContext has been available since Spark 1.x versions and it’s an entry point to Spark when you want to program and use Spark RDD. Most of the operations/methods or functions we use in Spark come from SparkContext for example accumulators, broadcast variables, parallelize, and more.
- SparkSession is essentially a combination of SQLContext, HiveContext, and future StreamingContext.
132. What is the operation of DataFrame?
A Spark DataFrame can be said to be a distributed data collection that is organized into named columns and is also used to provide operations such as filtering, computation of aggregations, grouping and also can be used with Spark SQL.
133. Explain why spark is preferred over MapReduce.
The benefits of Apache Spark over Hadoop MapReduce are given below: Processing at high speeds:
The process of Spark execution can be up to 100 times faster due to its inherent ability to exploit the memory rather than using disk storage.
134. What is the difference between partitioning and Bucketing?
Bucketing decomposes data into more manageable or equal parts.
With partitioning, there is a possibility that you can create multiple small partitions based on column values.
If you go for bucketing, you are restricting the number of buckets to store the data.
135. How to handle incremental data in big data?
- Move existing HDFS data to a temporary folder.
- Run the last modified mode fresh import.
- Merge with this fresh import with old data saved in a temporary folder.
136. Explain the Yarn Architecture.
Apache YARN framework contains a Resource Manager (master daemon), Node Manager (slave daemon), and an Application Master.
YARN is the main component of Hadoop v2.0. YARN helps to open up Hadoop by allowing it to process and run data for batch processing, stream processing, interactive processing, and graph processing which are stored in HDFS.
In this way, It helps to run different types of distributed applications other than MapReduce.
137. What is incremental sqoop?
Incremental imports mode can be used to retrieve only rows newer than some previously imported set of rows. Append mode works for numerical data that is incrementing over time, such as auto-increment keys
138. Why do we use Hbase and how does it store data?
HBase provides a flexible data model and low-latency access to small amounts of data stored in large data sets.
HBase on top of Hadoop will increase the throughput and performance of distributed cluster setup.
In turn, it provides faster random reads and write operations
139. What are various optimization techniques in the hive?
Apache Hive Optimization Techniques-
- Partitioning.
- Bucketing.
140. Sqoop command to exclude tables while retrieval?
sqoop import --connect jdbc:mysql://localhost/sqoop --username root --password hadoop - table --target-dir '/Sqoop21/AllTables' --exclude-tables ,
141. How to create sqoop password alias?
sqoop import --connect jdbc:mysql://database.example.com/employees \ --username dbuser - password-alias mydb.password.alias
Similarly, if the command line option is not preferred, the alias can be saved in the file provided with –password-file option.
142. What is sqoop job optimization?
To optimize performance, set the number of map tasks to a value lower than the maximum number of connections that the database supports.
143. What are sqoop boundary queries and split by usage?
- The boundary query is used for splitting the value according to id_no of the database table. To boundary query, we can take a minimum value and maximum value to split the value.
- split-by in sqoop is used to create input splits for the mapper. It is very useful for the parallelism factor as splitting imposes the job to run faster.
144. What is hbase compaction technique and write operation hbase using spark?
HBase Minor Compaction: The procedure of combining the configurable number of smaller HFiles into one large HFile is what we call Minor compaction.
hbase-spark connector which provides HBaseContext to interact Spark with HBase.
HBaseContext pushes the configuration to the Spark executors and allows it to have an HBase Connection per Spark Executor.
145. What are hive-managed Hbase tables and how to create them?
Hive tables Managed tables are Hive-owned tables where the entire lifecycle of the tables’ data is managed and controlled by Hive.
External tables are tables where Hive has loose coupling with the data.
Replication Manager replicates external tables successfully to a target cluster.
CREATE [EXTERNAL] TABLE foo(…) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ TBLPROPERTIES (‘hbase.table.name’ = ‘bar’); [/sql]
146. How Hbase can be a Distributed database?
Hbase is one of the NoSql column-oriented distributed databases available in Apache Foundation.
HBase gives more performance for retrieving fewer records rather than Hadoop or Hive.
It’s very easy to search for given any input value because it supports indexing, transactions, and updating.
147. What is a hive metastore and how to access that?
Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database.
It provides client access to this information by using metastore service API.
Hive metastore consists of two fundamental units:
- A service that provides metastore access to other Apache Hive services.
- Disk storage for the Hive metadata which is separate from HDFSstorage.
148. What is Hive Managed and External tables
Managed tables are Hive-owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive.
External tables are tables where Hive has loose coupling with the data.
Replication Manager replicates external tables successfully to a target cluster.
The managed tables are converted to external tables after replication.
149. How partition can be restored?
using MSCK REPAIR
150. What is data loading in the hive?
Hive provides us the functionality to load pre-created table entities either from our local file system or from HDFS.
The LOAD DATA statement is used to load data into the hive table.
151. How to automate Hive jobs?
You can also use Hive CLI and it’s very easy to do such jobs. You can write a shell script in Linux or .bat in Windows. In script, you can simply go like the below entries. $HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’; or if you have file : $HIVE_HOME/bin/hive -f /home/my/hive-script.sql Make sure you have set $HIVE_HOME in your env.
152. Where do we run the job in Spark?
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.
153. How to allocate resources in Spark?
Resources allocation – Dynamic/Static; Upstream or Downstream application.
154. Difference between the Edge node and Data Node?
The majority of work is assigned to worker nodes. Worker nodes store most of the data and perform most of the calculations Edge nodes facilitate communications from end users to master and worker nodes.
155. What is Hive context?
HiveContext is a superset of SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. And if you want to work with Hive you have to use HiveContext.
156. How to read files from HDFS or other sources in Spark?
Use textFile() and wholeTextFiles() methods of the SparkContext to read files from any file system and to read from HDFS, you need to provide the hdfs path as an argument to the function.
157. How to add a custom schema to rdd?
In spark, the schema is an array StructField of type StructType.
Each StructType has 4 parameters. Column Name; Data type of that column; Boolean value(indication if values in this column can be null or not) and Metadata column( this is an optional column that can be used to add additional information about the column)
158. How to convert dataframe to RDD?
To convert a dataframe to rdd simply use the .rdd method: rdd = df.rdd But the setback here is that it may not give the regular spark RDD, it may return a Row object.
In order to have the regular RDD format run the code below: rdd = df.rdd.map(tuple)
159. Difference between case class and class?
A class can extend another class, whereas a case class can not extend another case class (because it would not be possible to correctly implement their equality).
160. What is optimization technique in spark?
Spark optimization techniques are used to modify the settings and properties of Spark to ensure that the resources are utilized properly and the jobs are executed quickly. All this ultimately helps in processing data efficiently.
The most popular Spark optimization techniques are listed below:
- Data Serialization: Here, an in-memory object is converted into another format that can be stored in a file or sent over a network-
- Java serialization: The ObjectOutputStream framework is used for serializing objects.
- Kyro serialization: To improve the performance, the classes have to be registered using the registerKryoClasses method.
- Data structure tuning: We can reduce the memory consumption while using Spark, by tweaking certain Java features that might add overhead.
- Garbage collection optimization: G1 and GC must be used for running Spark applications. The G1 collector manages growing heaps. GC tuning is essential according to the generated logs, to control the unexpected behavior of applications.
161. What is unit data type in Scala?
The Unit type in Scala is used as a return statement for a function when no value is to be returned.
Unit type can be e compared to void data type of other programming languages like Java.
162. What is boundary query in sqoop?
The boundary query is used for splitting the value according to id_no of the database table.
To boundary query, we can take a minimum value and maximum value to split the value.
To make split using boundary queries, we need to know all the values in the table.
163. What is the use of sqoop eval command?
sqoop eval command allows users to execute user-defined queries against respective database servers and preview the result in the console. So, the user can expect the resultant table data to import.
Using eval, we can evaluate any type of SQL query that can be either DDL or DML statement.
164. How can we decide number of bucketing?
The number of buckets is determined by hashFunction (bucketingColumn) mod numOfBuckets. numOfBuckets is chose when you create the table with partitioning. The hash function output depends on the type of the column chosen.
165. Is it possible to bucketing and partitioning on same column?
Yes. Partitioning is you data is divided into number of directories on HDFS. Each directory is a partition.
166. How to do optimized joins in Hive?
- Use Tez to Fasten the execution Apache TEZ is an execution engine used for faster query execution.
- Enable compression in Hive Compression techniques reduce the amount of data being transferred
- Use ORC file format. ORC (optimized record columnar) is great when it comes to hive performance tuning.
167. How to optimize join of 2 big tables?
Use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table.
It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query.
If the tables don’t meet the conditions, Hive will simply perform the normal Inner Join.
If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join
168. What are major issues faced in spark development?
- Debugging – Spark although can be written in Scala, limits your debugging technique during compile time. You would encounter many run-time exceptions while running the Spark job. This is, many a times, because of the data. Sometimes because of the data type mismatch (there is dynamic data type inference) and sometimes data having null values and all. So there will be lot of iterations of run-time debugging.
- Optimization – Optimizing a Spark code is a job to do. You need to optimize from the code side and from the resource allocation side too. A very well written code with good logic often performs badly because of badly done resource allocation.
169. What is dynamic allocation?
Dynamic partitions provide us with flexibility and create partitions automatically depending on the data that we are inserting into the table.
170. What types of transformations do we perform in spark?
- Narrow transformation
- Wide transformation
171. How to load data in hive table?
- Using Insert Command
- table to table load
172. Difference between Map Vs Map Partition?
Mappartitions is a transformation that is similar to Map.
In Map, a function is applied to every element of an RDD and returns every other element of the resultant RDD.
In the case of mapPartitions, instead of each element, the function is applied to each partition of RDD mapPartitions exercises the function at the partition level
173. If we have some header information in a file how to read from it, and how to convert it to dataset or dataframe?
We can add the option like header is true in while reading the file.
174. Difference between case class vs Struct type?
Structs are value types and are copied on assignment. Structs are value types while classes are reference types. Structs can be instantiated without using a new operator. A struct cannot inherit from another struct or class, and it cannot be the base of a class.
175. What is sort by vs Order by in hive?
Hive sort by and order by commands are used to fetch data in sorted order. The main differences between sort by and order by commands are given below.
Sort by. hive> SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
May use multiple reducers for final output.
176. How to increase the performance of Sqoop?
- Controlling the amount of parallelism that Sqoop will use to transfer data is the main way to control the load on your database.
- Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion.
- However, it will also increase the load on the database as Sqoop will execute more concurrent queries.
177. While sqooping some data loss. how to handle that?
Some lost data is recoverable, but this process often requires the assistance of IT professionals and costs time and resources your business could be using elsewhere.
In other instances, lost files and information cannot be recovered, making data loss prevention even more essential. Reformatting can also occur during system updates and result in data loss.
178. How to update record in Hbase table?
Using put command you can insert a record into the HBase table easily. Here is the HBase Create data syntax.
We will be using Put command to insert data into HBase table
179. What happens when sqoop fails in between the large data import job?
sqoop import – job failure between data import, due to insert collisions in some cases, or lead to duplicated data in others.
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database.
180. What are hadoop components and their services?
- HDFS: Hadoop Distributed File System is the backbone of Hadoop which runs on java language and stores data in Hadoop applications. They act as a command interface to interact with Hadoop. the two components of HDFS – Data node, Name Node. Name node manages file systems and operates all data nodes and maintains records of metadata updating. In case of deletion of data, they automatically record it in Edit Log. Data Node (Slave Node) requires vast storage space due to reading and writing operations.
- Yarn: It’s an important component in the ecosystem and called an operating system in Hadoop which provides resource management and job scheduling task.
- Hbase: It is an open-source framework storing all types of data and doesn’t support the SQL database. They run on top of HDFS and written in java language.
- HBase master, Regional Server: The HBase master is responsible for load balancing in a Hadoop cluster and controls the failover. They are responsible for performing administration role. The regional server’s role would be a worker node and responsible for reading, writing data in the cache.
- Sqoop: It is a tool that helps in data transfer between HDFS and MySQL and gives hand-on to import and export data
- Apache spark: It is an open-source cluster computing framework for data analytics and an essential data processing engine. It is written in Scala and comes with packaged standard libraries.
- Apache Flume: It is a distributed service collecting a large amount of data from the source (webserver) and moves back to its origin and transferred to HDFS. The three components are Source, sink, and channel.
- MapReduce: It is responsible for data processing and acts as a core component of Hadoop. Map Reduce is a processing engine that does parallel processing in multiple systems of the same cluster.
- Apache Pig: Data Manipulation of Hadoop is performed by Apache Pig and uses Pig Latin Language. It helps in the reuse of code and easy to read and write code.
- Hive: It is an open-source platform for performing data warehousing concepts; it manages to query large data sets stored in HDFS. It is built on top of the Hadoop Ecosystem. the language used by Hive is Hive Query language.
- Apache Drill: Apache Drill is an open-source SQL engine which process non-relational databases and File system. They are designed to support Semi-structured databases found in Cloud storage.
- Zookeeper: It is an API that helps in distributed Coordination. Here a node called Znode is created by an application in the Hadoop cluster.
- Oozie: Oozie is a java web application that maintains many workflows in a Hadoop cluster. Having Web service APIs controls over a job is done anywhere. It is popular for handling Multiple jobs effectively.
181. What are important configuration files in Hadoop?
- HADOOP-ENV.sh ->>It specifies the environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop). We…
- CORE-SITE.XML ->>It is one of the important configuration files which is required for runtime environment settings of…
- HDFS-SITE.XML ->>It is one of the important configuration files which is required for runtime environment settings of…
- MAPRED-SITE.XML ->>It is one of the important configuration files which is required for runtime environment.
182. What is rack awareness?
With the rack awareness policy’s we store the data in different Racks so no way to lose our data.
Rack awareness helps to maximize the network bandwidth because the data blocks transfer within the Racks.
It also improves the cluster performance and provides high data availability.
183. Problem with having lots of small files in HDFS? and how to overcome?
Problems with small files and HDFS A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
- Hadoop Archive
- Sequence files
184. Main difference between Hadoop 1 and Hadoop 2?
Hadoop 1.x System is a Single Purpose System. We can use it only for MapReduce Based Applications. If we observe the components of Hadoop 1.x and 2.x, Hadoop 2.x Architecture has one extra and new component that is : YARN (Yet Another Resource Negotiator).
185. What is block scanner in HDFS?
Block Scanner is basically used to identify corrupt datanode Block. During a write operation, when a datanode writes in to the HDFS, it verifies a checksum for that data.
This checksum helps in verifying the data corruptions during the data transmission.
186. What do you mean by high availability of name node? How is it achieved?
In hadoop version 2.x there are two namenodes one of which is in active state and the other is in passive or standby state at any point of time.
187. Explain counters in MapReduce?
A Counter in MapReduce is a mechanism used for collecting and measuring statistical information about MapReduce jobs and events.
Counters keep the track of various job statistics in MapReduce like number of operations occurred and progress of the operation.
Counters are used for Problem diagnosis in MapReduce.
188. Why the output of map tasks are spilled to local disk and not in HDFS?
Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to produce the final output.
189. Define Speculative execution?
Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is needed.
190. Is it legal to set the number of reducer tasks to zero?
Yes, It is legal to set the number of reduce-tasks to zero if there is no need for a reducer. In this case the outputs of the map task is directly stored into the HDFS which is specified in the setOutputPath.
191. Where does the data of hive table gets stored?
Hive stores data at the HDFS location /user/hive/warehouse folder if not specified a folder using the LOCATION clause while creating a table.
192. Why HDFS is not used by hive metastore for storage?
Because HDFS is slow, and due to it’s distributed and dynamic nature, once something is stored in HDFS, it would be really hard to find it without proper metadata… So the metadata is kept in memory in a special (usually dedicated) server called the namenode ready to be queried.
193. When should we use sort by and order by?
When there is a large dataset then one should go for sort by as in sort by , all the set reducers sort the data internally before clubbing together and that enhances the performance.
While in Order by, the performance for the larger dataset reduces as all the data is passed through a single reducer which increases the load and hence takes longer time to execute the query.
194. How hive distribute in the rows into buckets?
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers.
All Distribute BY columns will go to the same reducer.
195. What do you mean by data locality?
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation.
This minimizes network congestion and increases the overall throughput of the system.
196. What are the installation modes in Hadoop?
- Standalone Mode.
- Pseudo-Distributed Mode.
- Fully Distributed Mode.
197. What is the role of combiner in Hadoop?
Combiner that plays a key role in reducing network congestion. The main job of Combiner a “Mini-Reducer is to handle the output data from the Mapper, before passing it to Reducer.
198. What is the role of partitioner in Hadoop?
The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper output.
By hash function, key (or a subset of the key) is used to derive the partition. A total number of partitions depends on the number of reduce task.
199. Difference between RDBMS and NoSQL?
- RDBMS is called relational databases while NoSQL is called a distributed database. They do not have any relations between any of the databases.
- When RDBMS uses structured data to identify the primary key, there is a proper method in NoSQL to use unstructured data.
- RDBMS is scalable vertically and NoSQL is scalable horizontally.
200. What is column family?
A column family is a database object that contains columns of related data. It is a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is a set of columns.
201. How do reducers communicate with each other?
Yes, reducers can communicate with each other by dispatching intermediate key value pairs that get shuffled to another reduce. Reducers running on the same machine can communicate with each other through shared memory, but not reducers on different machines.
202. Name the components of spark Ecosystem.
- Spark Core
- Spark SQL
- Spark Streaming
- MLlib
- GraphX
203. What is block report in spark?
Block or report user Block or report is spark. Block user. Prevent this user from interacting with your repositories and sending you notifications.
204. What is distributed cache?
In computing, a distributed cache is an extension of the traditional concept of cache used in a single locale.
A distributed cache may span multiple servers so that it can grow in size and in transactional capacity.
205. Normalization vs Denormalization?
Normalization is the process of dividing larger tables in to smaller ones reducing the redundant data, while denormalization is the process of adding redundant data to optimize performance. – Normalization is carried out to prevent databases anomalies.
206. How can you optimize the MapReduce jobs?
- Proper configuration of your cluster.
- LZO compression usage.
- Proper tuning of the number of MapReduce tasks.
- Combiner between Mapper and Reducer.
207. What are the advantages of combiner?
- Use of combiner reduces the time taken for data transfer between mapper and reducer.
- Combiner improves the overall performance of the reducer.
- It decreases the amount of data that reducer has to process.
208. What are different schedulers in yarn?
There are three types of schedulers available in YARN: FIFO, Capacity and Fair. FIFO (first in, first out) is the simplest to understand and does not need any configuration.
It runs the applications in submission order by placing them in a queue.
209. Explain Hive metastore and Warehouse?
A Hive warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.
210. Difference between Hive vs beeline?
The primary difference between the Hive CLI & Beeline involves how the clients connect to ApacheHive.
The Hive CLI, which connects directly to HDFS and the Hive Metastore, and can be used only on a host with access to those services.
Beeline, which connects to HiveServer2 and requires access to only one .jar file: hive-jdbc-version-standalone.jar
211. What are temporary tables in hive?
Temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query.
Rather than manually deleting tables needed only as temporary data in a complex query, Hive automatically deletes all temporary tables at the end of the Hive session in which they are created.
212. What is lateral view?
The LATERAL VIEW statement is used with user-defined table generating functions such as EXPLODE() to flatten the map or array type of a column.
The explode function can be used on both ARRAY and MAP with LATERAL VIEW.
213. What is the purpose of view in hive?
Views are similar to tables, which are generated based on the requirements. We can save any result set data as a view in Hive ; Usage is similar to as views used in SQL
214. Handling nulls while importing data?
To force Sqoop to leave NULL value blank during import, put the following options in the Sqoop command line: –null-string The string to be written for a null value for string columns. –null-non string The string to be written for a null value for non-string columns.
215. How is spark better than Hive?
Hive is the best option for performing data analytics on large volumes of data using SQLs.
Spark, on the other hand, is the best option for running big data analytics.
It provides a faster, more modern alternative to MapReduce.
216. Processing of big tables in spark?
Spark uses SortMerge joins to join large table. It consists of hashing each row on both table and shuffle the rows with the same hash into the same partition. There the keys are sorted on both side and the sortMerge algorithm is applied.
217. Benefits of window function in spark?
Spark Window functions are used to calculate results such as the rank, row number e.t.c over a range of input rows and these are available to you by importing org.apache.spark.sql.functions.
218. Difference between window functions and group by?
- GROUP BY functionality only offers aggregate functions; whereas Window functions offer aggregate, ranking, and value functionalities.
- SQL Window function is efficient and powerful. It not only offers GROUP BY aggregate functionality but advanced analytics with ranking and value options.
219. How can we add a column to dataframe?
Use withColumn() transformation function.
220. Difference between logical and physical plan?
- Logical Plan just depicts what I expect as output after applying a series of transformations like join, filter, where, groupBy, etc clause on a particular table.
- Physical Plan is responsible for deciding the type of join, the sequence of the execution of filter, where, groupBy clause, etc. This is how SPARK SQL works internally!
221. Benefits of Scala over python?
Scala is a statically typed language that provides an interface to catch the compile time errors. Thus refactoring code in Scala is much easier and ideal than Python. Being a dynamic programming language, testing process, and its methodologies are much complex in Python
222. What Is Schema Enforcement?
Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema.
223. Benefits of enforce schema over default schema?
Because objects are no longer tied to the user creating them, users can now be defined with a default schema.
The default schema is the first schema that is searched when resolving unqualified object names.
224. What are the challenges faced in spark?
- No space left on device:
- This is primarily due to executor memory, try increasing the executor memory. Example –executor memory 20G
- Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
- The default spark.sql.broadcastTimeout is 300 Timeout in seconds for the broadcast wait time in broadcast joins.
- To overcome this problem increase the timeout time as per required example
–conf “spark.sql.broadcastTimeout= 1200”
225. How is Scala is different from other languages?
Scala, when compared to Java, is relatively a new language. It is a machine-compiled language, whereas Java is object-oriented.
Scala has enhanced code readability and conciseness.
226. What is functional programming in Scala?
Functional programming is a programming paradigm that uses functions as the central building block of programs. In functional programming, we strive to use pure functions and immutable values.
227. What is spark config?
SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set () method.
228. Can we configure CPU cores in spark context?
In spark, this controls the number of parallel tasks an executor can run.
From the driver code, SparkContext connects to cluster manager (standalone/Mesos/YARN).
229. How does partition happen while creating RDD?
When you call rdd.repartition (x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
230. To rename a column in Dataframe to some other name? how to achieve that?
Using Spark withColumn() function we can add , rename , derive, split etc a Dataframe Column.
There are many other things which can be achieved using withColumn() which we will check one by one with suitable examples.]
231. Difference between spark 1.6 and 2.x?
Even though Spark is very faster compared to Hadoop, Spark 1.6x has some performance issues which are corrected in Spark 2.x., they are-
- sparkSession
- Faster analysis
- added SQL features
- MLib improvements
- New streaming module
- Unified dataset and data frame API’s
232. How do you decide number of executors?
Number of executors is related to the amount of resources, like cores and memory, you have in each worker.
233. How to remove duplicates from an array of elements?
The ways for removing duplicate elements from the array:
- Using extra space
- Constant extra space
- Using Set
- Using Frequency array
- Using HashMap
234. What is diamond problem in spark and how to resolve it?
Diamond problem occurs when we use multiple inheritance in programming languages like C++ or Java.
The solution to the diamond problem is default methods and interfaces. We can achieve multiple inheritance by using these two things. The default method is similar to the abstract method. The only difference is that it is defined inside the interfaces with the default implementation.
235. Diamond problem in scala occurs when child class/object tries to refer?
Multiple parent classes having same method name.
236. For the following code in scala: lazy val output = {println(“Hello”); 1} println(“Learning Scala”) println(output). What can be the result, in proper order?
Learning scala, Hello,1.
237. Suppose we have a series of 9 Mapreduce Jobs, then how many Disk I/Os are needed in total?
18
238. Which operations is not lazy?
collect, take
239. Suppose while running spark on hadoop2, input file is of size 800 MB. How many RDD partitions will be created in all ?
7 partitions
240. Which will help RDDs to achieve resiliency?
RDDs maintain a Lineage graph RDD contents cannot be changed.
241. RDDs says materialized in which condition?
when action is called to execute file with collect.
242. Actions are functions applied on RDD, resulting into another RDD.
True
243. Spark transformations & actions are evaluated lazily?
False
244. What is higher order functions?
map(), reduce(), foreach()
245. Which of the below gives one to one mapping between input & output. *?
map
246. By default spark UI is available on which port?
port 4040
247. What is broadcast variable?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead.
248. What is accumulator?
Accumulator is a shared variable a single file kept in the driver program and the remaining executor update it.
None of the executors can read the value of accumulator, but it can only update it. these are similar to counters in MapReduce.
249. What are the ways to remove duplicates in hive?
- Use Insert Overwrite and DISTINCT Keyword
- GROUP BY Clause to Remove Duplicate
- Use Insert Overwrite with row_number() analytics functions