Introduction to Pandas

Pandas is the most popular Python library for data manipulation and analysis. It provides powerful data structures that make working with structured data intuitive and efficient.

Why Pandas for Machine Learning?

1. Data Loading: Easily read data from CSV, Excel, JSON, SQL databases, and more 2. Data Cleaning: Handle missing values, duplicates, and data type conversions 3. Data Transformation: Filter, sort, group, and reshape data effortlessly 4. Feature Engineering: Create new features from existing data 5. Integration: Works seamlessly with NumPy, Scikit-learn, and visualization libraries

Installation

pip install pandas

Basic Import Convention

import pandas as pd
import numpy as np
Create a simple DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})
print(df)
      name  age  salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000

Core Data Structures

Series: 1D labeled array
s = pd.Series([1, 2, 3, 4, 5], name='numbers')
print(s)
0    1
1    2
2    3
3    4
4    5
Name: numbers, dtype: int64
DataFrame: 2D labeled table (most commonly used)
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['x', 'y', 'z']
})
Access column as Series
print(df['A'])  Returns Series
print(df.A)     Alternative syntax