PwC Interview Questions

Preparing for a PwC data analyst interview? This curated list of PwC interview questions spans technical expertise, analytical thinking, and soft skills, reflecting PwC’s focus on client impact, data-driven insights, and collaboration.

1. Overview of PwC

PwC, or PricewaterhouseCoopers, is a leading multinational professional services network and one of the Big Four accounting firms. Headquartered in London, it operates in 151 countries with over 360,000 employees. PwC offers a wide range of services including audit, tax, consulting, and advisory, helping global corporations and organizations navigate complex business challenges and achieve growth.

2.1 SQL & Data Manipulation

1. Differentiate INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN, and explain their use cases in complex queries.

SQL JOINs combine data from multiple tables.

INNER JOIN: Returns rows where there’s a match in both tables. Use for precise overlaps (e.g., linking orders to customers).

LEFT JOIN: Returns all rows from the left table and matched rows from the right. Unmatched right rows show NULL. Use to retain all left data (e.g., all products, including unsold ones).

RIGHT JOIN: Mirrors LEFT JOIN but prioritizes the right table. Rarely used; LEFT JOIN is preferred for readability.

FULL OUTER JOIN: Combines all rows from both tables, filling NULL for unmatched sides. Use to analyze mismatches (e.g., unmatched customer or product records).

2. Write a query to identify the top-selling product in each region.

Uses a window function to rank products by sales within each region, then filters to the top result.

WITH ranked_products AS (
  SELECT 
    region, 
    product_id, 
    SUM(sales) AS total_sales,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY SUM(sales) DESC) AS rank
  FROM sales
  GROUP BY region, product_id
)
SELECT region, product_id, total_sales
FROM ranked_products
WHERE rank = 1;

3. Strategies to clean datasets with missing values/outliers.

Missing Values:
- Delete rows/columns if missing data is insignificant.
- Impute using mean/median (numeric) or mode (categorical).
- Use COALESCE() in SQL to replace NULL with defaults.
Outliers:
- Remove using statistical thresholds (e.g., Z-score > 3).
- Cap values at percentile limits (e.g., 99th percentile).
- Investigate outliers for data entry errors or genuine anomalies.

4. How to leverage GROUP BY and HAVING for insights.

GROUP BY aggregates data into groups (e.g., total sales per category).
HAVING filters aggregated results, unlike WHERE, which filters raw rows.

SELECT category, AVG(revenue) AS avg_revenue
FROM sales
GROUP BY category
HAVING AVG(revenue) > 1000;  -- Only categories with avg revenue > 1000

5. Rank salespeople by performance using ROW_NUMBER.

Assigns unique ranks, breaking ties arbitrarily. Use RANK() or DENSE_RANK() for tied rankings.

SELECT 
  salesperson_id,
  total_sales,
  ROW_NUMBER() OVER (ORDER BY total_sales DESC) AS performance_rank
FROM (
  SELECT salesperson_id, SUM(sales) AS total_sales
  FROM transactions
  GROUP BY salesperson_id
) AS sales_summary;

6. Calculate cumulative revenue by month for each product category.

Uses a window function to calculate a running total of revenue per category, ordered chronologically.

SELECT 
  category,
  month,
  SUM(revenue) OVER (
    PARTITION BY category 
    ORDER BY month 
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM monthly_sales;

7. Retrieve the top 5 products by sales volume, excluding those with zero sales in the past 3 months.

Filter products with sales in the last 3 months using HAVING, then rank by total sales.

WITH recent_sales AS (
  SELECT product_id, SUM(sales_volume) AS total_sales
  FROM sales
  WHERE sale_date >= DATE_SUB(CURRENT_DATE, INTERVAL 3 MONTH)
  GROUP BY product_id
  HAVING SUM(sales_volume) > 0  -- Exclude products with zero recent sales
),
ranked_products AS (
  SELECT product_id, total_sales,
    ROW_NUMBER() OVER (ORDER BY total_sales DESC) AS rank
  FROM recent_sales
)
SELECT product_id, total_sales
FROM ranked_products
WHERE rank <= 5;

8. Identify customers who made purchases in two or more consecutive months.

Use Case: Track loyalty or subscription patterns.
Identify customers with purchases in consecutive months:

WITH monthly_purchases AS (
  SELECT customer_id, DATE_TRUNC('month', purchase_date) AS month
  FROM transactions
  GROUP BY customer_id, DATE_TRUNC('month', purchase_date)
),
lagged AS (
  SELECT customer_id, month,
    LAG(month) OVER (PARTITION BY customer_id ORDER BY month) AS prev_month
  FROM monthly_purchases
)
SELECT DISTINCT customer_id
FROM lagged
WHERE EXTRACT(MONTH FROM AGE(month, prev_month)) = 1;  -- Check for consecutive months

9. Calculate monthly user retention rate.

Calculate monthly retention rate (users active in month *t* who were also active in *t-1*):
Formula: Retention Rate = (Retained Users / Previous Month’s Active Users) × 100.

WITH user_activity AS (
  SELECT user_id, DATE_TRUNC('month', login_date) AS month
  FROM logins
  GROUP BY user_id, month
),
retained AS (
  SELECT a.month, COUNT(DISTINCT a.user_id) AS retained_users
  FROM user_activity a
  JOIN user_activity b ON a.user_id = b.user_id
    AND a.month = b.month + INTERVAL '1 month'
  GROUP BY a.month
),
total_users AS (
  SELECT month, COUNT(DISTINCT user_id) AS total_active
  FROM user_activity
  GROUP BY month
)
SELECT r.month, 
  (retained_users::FLOAT / total_active) * 100 AS retention_rate
FROM retained r
JOIN total_users t ON r.month = t.month + INTERVAL '1 month';

10. Find the nth highest salary dynamically (parameterized).

Find the nth highest salary using a parameterized query (e.g., in PostgreSQL):

CREATE FUNCTION get_nth_highest(n INT) RETURNS TABLE (salary INT) AS $$
BEGIN
  RETURN QUERY
  SELECT DISTINCT salary
  FROM employees
  ORDER BY salary DESC
  LIMIT 1 OFFSET n-1;
END;
$$ LANGUAGE plpgsql;

11. Explain how indexing works and how to choose columns for optimal performance.

How it works: Indexes create a lookup structure (e.g., B-tree) to quickly locate rows.
When to index:
- Columns in WHERE, JOIN, or ORDER BY clauses.
- High-cardinality columns (e.g., unique IDs).
- Avoid over-indexing: Writes slow down due to index updates.
Composite indexes: Use for multi-column filters (e.g., (region, product_id)).

12. Clarify the difference HAVING vs WHERE and use cases.

Use WHERE for row-level conditions, HAVING for aggregate conditions.

WHERE: Filters raw rows before aggregation.

SELECT product_id, SUM(sales)
FROM orders
WHERE year = 2023  -- Filters individual rows
GROUP BY product_id;

HAVING: Filters aggregated results after GROUP BY.

SELECT product_id, SUM(sales) AS total_sales
FROM orders
GROUP BY product_id
HAVING SUM(sales) > 1000;  -- Filters grouped results

2.2 Python

13. How to handle missing data in Pandas

Handling missing data in Pandas involves identifying, analyzing, and imputing/dropping NaN values:

Detection: Use df.isnull().sum() to count missing values per column.
Deletion: Drop rows/columns with df.dropna(axis=0, thresh=N) (e.g., remove rows with fewer than N non-null values).
Imputation:
- Numeric: Fill with mean/median (df.fillna(df.mean())).
- Categorical: Use mode (df.fillna(df.mode().iloc[0])).
- Time Series: Forward/backward fill (df.ffill() or df.bfill()).
Advanced: Use sklearn.impute.SimpleImputer for predictive imputation

# Fill missing age values with median  
df['age'].fillna(df['age'].median(), inplace=True)

Best Practice: Avoid dropping data unless missingness is random and insignificant.

14. Differences between Lists vs NumPy Arrays and use cases.

Lists	NumPy Arrays
Native Python, flexible types	Homogeneous (fixed data type)
Slow for math operations	Vectorized operations (fast C-based)
Dynamic resizing	Fixed size after creation

Use Cases:

Lists: Store mixed data (e.g., [1, “text”, True]).
NumPy Arrays: Math-heavy tasks (e.g., matrix operations, ML training).

import numpy as np  
list_data = [1, 2, 3]  
array_data = np.array([1, 2, 3])  
# Vectorized multiplication (arrays only)  
result = array_data * 2  # [2, 4, 6]

15. Use Python for large dataset analysis to extract customer insights.

To extract insights efficiently:

Optimize Memory: Use df.astype({‘column’: ‘category’}) for strings.
Chunking: Process data in batches with pandas.read_csv(chunksize=1000)
Sampling: Debug with df.sample(n=1000)
Parallelize: Use Dask or Modin for out-of-core computation.
Aggregation: Group data (df.groupby(‘region’)[‘sales’].sum()) and visualize trends.

Example Workflow:
- Load data in chunks.
- Filter irrelevant columns/rows.
- Use df.describe() for summary stats.
- Identify patterns (e.g., correlation heatmaps).

16. Perform data transformations like filtering rows or creating new columns.

Filter Rows:
- Boolean Indexing:
  - high_sales = df[df[‘sales’] > 1000]
- Query:
  - df.query(‘region == “West” & revenue > 5000’)

Create Columns:
- Direct Assignment:
  - df[‘profit’] = df[‘revenue’] – df[‘cost’]
- Apply Functions:
  - df[‘name_upper’] = df[‘name’].apply(lambda x: x.upper())
- Vectorized Operations:
  - df[‘discounted_price’] = df[‘price’] * 0.9

Best Practice: Avoid loops; leverage built-in Pandas methods for speed.

17. Write a function to find unique integer pairs summing to a target.

Sort the list to group duplicates, track seen numbers, and ensure pairs are stored in a sorted tuple to avoid redundancy.

def find_unique_pairs(nums, target):  
    seen = set()  
    pairs = set()  
    nums.sort()  # Sort to handle duplicates  
    for num in nums:  
        complement = target - num  
        if complement in seen:  
            # Store pairs in sorted order to avoid duplicates like (a,b) and (b,a)  
            pairs.add((complement, num))  
        seen.add(num)  
    return list(pairs)

18. Check if a string is a palindrome (ignoring spaces, punctuation, and case)

Example:
Input: “A man, a plan, a canal: Panama” → Output: True

Steps:
- Use regex to strip invalid characters.
- Compare the cleaned string to its reverse.

import re  

def is_palindrome(s):  
    # Remove non-alphanumeric chars and convert to lowercase  
    cleaned = re.sub(r'[^a-zA-Z0-9]', '', s).lower()  
    return cleaned == cleaned[::-1]

19. Differences between Deep vs Shallow Copy and scenarios for use.

Shallow Copy:
- Creates a new object but references the original nested objects.
- Use copy.copy() or list.copy()
- Best for simple, flat structures (e.g., list of integers).
Deep Copy:
- Creates entirely independent copies of nested objects.
- Use copy.deepcopy()
- Essential for mutable nested structures (e.g., list of lists).

import copy  

original = [[1, 2], [3, 4]]  
shallow = copy.copy(original)  
deep = copy.deepcopy(original)  

original[0][0] = 99  
# shallow: [[99, 2], [3, 4]]  
# deep: [[1, 2], [3, 4]]

20. Explain Decorators functionality and provide a practical example.

Decorators wrap functions to extend their behavior without modifying their code.
Practical Example: Timing function execution:

import time  
from functools import wraps  

def timer(func):  
    @wraps(func)  
    def wrapper(*args, **kwargs):  
        start = time.time()  
        result = func(*args, **kwargs)  
        end = time.time()  
        print(f"{func.__name__} took {end - start:.2f} seconds")  
        return result  
    return wrapper  

@timer  
def slow_function():  
    time.sleep(2)  

slow_function()  # Output: "slow_function took 2.00 seconds"

Use Cases: Logging, caching, access control, or input validation.
Key: Decorators use @wraps to preserve the original function’s metadata (e.g., name, docstring).

2.3 Excel

21. Use Pivot Tables for sales by quarter analysis.

Pivot Tables in Excel streamline quarterly sales analysis by summarizing raw data into actionable insights.

Steps:
- Ensure your dataset has columns like Date, Product, Region, and Sales.
- Select the data range → Insert → Pivot Table.
- Drag Date to Rows. Right-click a date → Group → Quarters.
- Drag Sales to Values (summarize as SUM or AVERAGE).
- Add Region or Product to Columns/Rows for granular breakdowns.
Use Case: Compare Q1 vs Q2 performance across regions or identify top-selling products per quarter.
Pro Tip: Use Pivot Charts (e.g., bar/line charts) to visualize trends directly from the Pivot Table.

22. When to use VLOOKUP, IF statements, INDEX/MATCH.

VLOOKUP:
- Searches for a value in the first column of a table and returns a value from a specified column.
- Use for simple lookups (e.g., fetch product price from an ID).
- Limitation: Cannot look to the left.

=VLOOKUP(A2, PriceTable, 3, FALSE)

INDEX/MATCH:
- Combines INDEX (returns a value at a row/column intersection) and MATCH (finds a value’s position).
- Use for flexible lookups (leftward searches or dynamic column references).

=INDEX(PriceColumn, MATCH(A2, IDColumn, 0))

IF Statements:
- Apply conditional logic (e.g., flag high-risk orders: =IF(B2>10000, “High”, “Low”)).
- Nest with AND/OR for complex rules.

Rule of Thumb: Use INDEX/MATCH over VLOOKUP for scalability and flexibility.

23. Visualize trends (e.g., monthly sales report) for easy interpretation.

Line Charts: Ideal for showing trends over time. Highlight peaks/valleys with data labels.
Bar/Column Charts: Compare monthly sales side-by-side. Use color gradients for emphasis.
Sparklines: Mini-charts within cells (via Insert → Sparklines) for quick trend summaries.
Dynamic Dashboards:
1. Link charts to Pivot Tables.
2. Add slicers (PivotTable Analyze → Insert Slicer) for interactive filtering (e.g., by region).
Best Practices:
- Label axes and add clear titles.
- Use consistent date formats.
- Avoid clutter (e.g., limit gridlines, use white space).

Example: A line chart showing monthly revenue with annotations for promotional periods.

Tools: Excel’s Forecast Sheet (under Data tab) can project future trends based on historical data.

2.4 Scenario-Based Analysis

24. Approach to analyze churn and design retention strategies.

Track usage/purchase data and churn triggers.
Segment users via RFM or activity tiers.
Apply predictive models (e.g., logistic regression).
Analyze root causes (surveys/feedback).
Design interventions: discounts, loyalty programs, KPI monitoring.

25. Present insights effectively to non-technical stakeholders.

Use plain language, avoid jargon.
Visualize trends (bar/line charts).
Link insights to revenue/business goals.
Tailor depth (execs = ROI; teams = actions).
Offer clear recommendations (e.g., pricing tiers).

2.5 Communication & Storytelling

Replace jargon with plain terms.
Use analogies (e.g., “API = restaurant menu”).
Visualize data with charts/diagrams.
Link insights to business goals.
Summarize key takeaways first.

27. Maintain accuracy and clarity in high-pressure situations.

Prepare concise talking points.
Pause to organize thoughts.
Stick to facts; flag assumptions.
Use frameworks (e.g., SBL: Situation, Background, Recommendation).
Verify critical details aloud.

2.6 Guesstimate Questions

28. Estimate annual smartphone sales in India.

Population: ~1.4 billion people.
Smartphone Penetration: ~55% (770 million users).
Replacement Cycle: ~2.5 years → ~308 million annual replacements.
New Users: ~5% of non-users adopt yearly (31.5 million).
Total: ~340 million units (adjusting for affordability, market saturation).
Refinement: Align with industry reports (~160 million units sold in 2023).

29. Estimate daily revenue of roadside tea stalls across India.

Stalls: ~2 million (1 per 700 people).
Customers/Day: ~150 per stall (₹15/cup).
Revenue/Stall: ₹2,250 daily.
Total: ₹2,250 × 2 million = ₹4.5 billion (~₹1,600 crore monthly).
Refinement: Adjust for urban/rural divide, seasonal demand.

Table of Contents

1. Overview of PwC