Using Libraries: NumPy, Pandas & Matplotlib

Python's greatest strength is its ecosystem of libraries. This chapter covers three essential libraries used in data science, analysis, and visualization: NumPy, Pandas, and Matplotlib.

Why This Chapter Matters

These three libraries are the foundation of Python's data science stack. Understanding them opens doors to machine learning, data analysis, scientific computing, and business intelligence.

NumPy — Numerical Computing

NumPy (Numerical Python) provides a fast, multi-dimensional array object called ndarray and hundreds of mathematical functions.

Installing NumPy

pip install numpy

Creating Arrays

import numpy as np

From a list

arr = np.array([1, 2, 3, 4, 5]) print(arr) # [1 2 3 4 5] print(arr.dtype) # int64 print(arr.shape) # (5,)

2D array (matrix)

matrix = np.array([[1, 2, 3], [4, 5, 6]]) print(matrix.shape) # (2, 3)

Convenience constructors

zeros = np.zeros((3, 4)) # 3x4 array of zeros ones = np.ones((2, 3)) # 2x3 array of ones identity = np.eye(3) # 3x3 identity matrix range_arr = np.arange(0, 10, 2) # array([0, 2, 4, 6, 8]) linspace = np.linspace(0, 1, 5) # 5 evenly spaced points 0 to 1 random_arr = np.random.rand(3, 3) # 3x3 random floats

Array Operations (Vectorized)

NumPy operations apply element-wise without loops — much faster than Python lists.

arr = np.array([1, 2, 3, 4, 5])

print(arr * 2) # [2 4 6 8 10] print(arr + 10) # [11 12 13 14 15] print(arr ** 2) # [1 4 9 16 25] print(arr > 3) # [False False False True True]

Element-wise operations between arrays

a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) print(a + b) # [5 7 9] print(a * b) # [4 10 18] print(np.dot(a, b)) # 32 (dot product)

Indexing and Slicing

arr = np.array([10, 20, 30, 40, 50])
print(arr[0])      # 10
print(arr[1:4])    # [20 30 40]
print(arr[-1])     # 50

Boolean indexing

print(arr[arr > 25]) # [30 40 50]

2D indexing

matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print(matrix[0, :]) # first row: [1 2 3] print(matrix[:, 1]) # second column: [2 5 8] print(matrix[1, 2]) # row 1, col 2: 6

Useful Math Functions

arr = np.array([4, 9, 16, 25])
print(np.sqrt(arr))   # [2. 3. 4. 5.]
print(np.mean(arr))   # 13.5
print(np.std(arr))    # standard deviation
print(np.sum(arr))    # 54
print(np.min(arr))    # 4
print(np.max(arr))    # 25
print(np.sort(arr))   # sorts a copy

Pandas — Data Analysis

Pandas introduces two powerful data structures: Series (1D) and DataFrame (2D table).

Installing Pandas

pip install pandas

Series

A Series is a labeled 1D array.

import pandas as pd

scores = pd.Series([95, 88, 72, 91], index=["Asha", "Leo", "Mina", "Sam"]) print(scores) print(scores["Asha"]) # 95 print(scores[scores > 85]) # filter

DataFrame

A DataFrame is a 2D table — like a spreadsheet.

data = {
    "Name": ["Asha", "Leo", "Mina", "Sam"],
    "Score": [95, 88, 72, 91],
    "Grade": ["A", "B", "C", "A"]
}

df = pd.DataFrame(data) print(df) print(df.shape) # (4, 3) print(df.dtypes) # column types print(df.describe()) # stats summary print(df.head(2)) # first 2 rows print(df.tail(2)) # last 2 rows

Selecting Data

# Select a column
print(df["Name"])
print(df[["Name", "Score"]])   # multiple columns

Row selection

print(df.iloc[0]) # by integer position print(df.loc[0]) # by label (same here)

Conditional filtering

top = df[df["Score"] >= 90] print(top)

Adding and Modifying Columns

df["Passed"] = df["Score"] >= 60
df["Score_Boosted"] = df["Score"] + 5
df = df.drop(columns=["Score_Boosted"])
df = df.rename(columns={"Score": "Final Score"})

Handling Missing Data

import numpy as np

df.loc[2, "Score"] = np.nan # set a missing value print(df.isnull()) # boolean mask print(df.isnull().sum()) # count missing per column df_clean = df.dropna() # drop rows with any NaN df_filled = df.fillna(0) # fill missing with 0

Reading and Writing Files

# CSV
df = pd.read_csv("students.csv")
df.to_csv("output.csv", index=False)

Excel

df = pd.read_excel("data.xlsx")

JSON

df = pd.read_json("data.json")

Grouping and Aggregation

# Group by Grade and compute mean score
summary = df.groupby("Grade")["Score"].mean()
print(summary)

Multiple aggregations

summary2 = df.groupby("Grade").agg({"Score": ["mean", "max", "count"]})

Sorting

df_sorted = df.sort_values("Score", ascending=False)

Matplotlib — Data Visualization

Matplotlib is the foundational plotting library for Python.

Installing Matplotlib

pip install matplotlib

Line Plot

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5] y = [10, 20, 15, 30, 25]

plt.plot(x, y, marker="o", color="blue", linestyle="--") plt.title("My Line Chart") plt.xlabel("X Axis") plt.ylabel("Y Axis") plt.grid(True) plt.savefig("chart.png") plt.show()

Bar Chart

names = ["Asha", "Leo", "Mina"]
scores = [95, 88, 72]

plt.bar(names, scores, color=["green", "orange", "red"]) plt.title("Student Scores") plt.ylabel("Score") plt.show()

Scatter Plot

import numpy as np

x = np.random.rand(50) y = np.random.rand(50)

plt.scatter(x, y, alpha=0.7, color="purple") plt.title("Scatter Plot") plt.show()

Histogram

data = np.random.randn(1000)
plt.hist(data, bins=30, color="teal", edgecolor="black")
plt.title("Distribution")
plt.show()

Subplots

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot([1, 2, 3], [10, 20, 15]) axes[0].set_title("Line")

axes[1].bar(["A", "B", "C"], [5, 10, 8]) axes[1].set_title("Bar")

plt.tight_layout() plt.show()

Putting It Together — A Mini Analysis

import pandas as pd
import matplotlib.pyplot as plt

Load data

df = pd.read_csv("sales.csv")

Clean

df = df.dropna(subset=["revenue"])

Analyze

monthly = df.groupby("month")["revenue"].sum()

Visualize

monthly.plot(kind="bar", color="steelblue") plt.title("Monthly Revenue") plt.xlabel("Month") plt.ylabel("Revenue ($)") plt.tight_layout() plt.savefig("revenue.png") plt.show()

Common Mistakes

forgetting to import numpy as np / import pandas as pd / import matplotlib.pyplot as plt
modifying a DataFrame column without understanding copy vs view (use .copy())
using a for loop over a DataFrame instead of vectorized operations
not calling plt.show() or plt.savefig() to see/save plots
ignoring SettingWithCopyWarning from Pandas

Mini Exercises

Create a NumPy array of 1–20 and select all values greater than 10.
Create a DataFrame from a dictionary of your choice and filter rows by a condition.
Read a CSV file with Pandas and print the 5 rows with the highest values in one column.
Plot a bar chart comparing at least 4 categories.
Combine NumPy and Matplotlib to plot a sine wave.

Review Questions

What is the key advantage of NumPy arrays over Python lists for math?
What is the difference between iloc and loc in Pandas?
How do you handle missing values in a Pandas DataFrame?
What is groupby() used for?
How do you save a Matplotlib figure to a file?

Reference Checklist

I can create and manipulate NumPy arrays
I can create DataFrames from dicts and CSVs
I can filter, sort, and group Pandas DataFrames
I can handle missing data with dropna() and fillna()
I can create line, bar, scatter, and histogram plots
I can save plots and build multi-panel figures with subplots