Pandas Overview

What is Pandas?

Pandas is an open-source Python library designed for data manipulation and analysis. It provides high-performance, easy-to-use data structures like Series and DataFrame. Pandas simplifies handling and analyzing structured data, making it a powerful tool for data science.

Why Use Pandas?

Efficiently handle large datasets.
Simplify data cleaning and preparation.
Perform complex data transformations with ease.
Provide built-in functionality for working with data formats like CSV, JSON, Excel, and SQL.
Support a wide range of analytical tools to extract insights from data.

What Can Pandas Do?

Load and save datasets in various formats (e.g., CSV, JSON, Excel).
Handle missing or inconsistent data.
Perform filtering, sorting, and grouping operations.
Support time-series analysis.
Merge and join datasets efficiently.

What is a Series?

A Series in Pandas is a one-dimensional labeled array capable of holding any data type. It can be thought of as a column in a spreadsheet or a single data column in a database table.

Example:

import pandas as pd

data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

Labels

Labels in Pandas are the indices or names associated with data elements in a Series or DataFrame. They allow for intuitive data access.

Example:

data = [10, 20, 30]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series['b'])  # Access element with label 'b'

Create Labels

You can define custom labels using the index parameter when creating a Series or DataFrame.

Example:

data = [1, 2, 3]
series = pd.Series(data, index=['x', 'y', 'z'])
print(series)

What is a DataFrame?

A DataFrame is a two-dimensional, tabular data structure in Pandas with labeled rows and columns, similar to a spreadsheet or SQL table.

Example:

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)

Locate Row

You can locate rows using the loc and iloc functions:

loc: Access rows by labels.
iloc: Access rows by index position.

Example:

print(df.loc[0])  # Locate by label
print(df.iloc[1])  # Locate by index position

Read CSV Files

Pandas makes it easy to read data from CSV files using read_csv.

Example:

df = pd.read_csv('data.csv')
print(df)

Pandas Read JSON

Pandas can also read JSON files using read_json.

Example:

df = pd.read_json('data.json')
print(df)

Dictionary as JSON

You can create DataFrames from dictionaries, which can act like JSON objects.

Example:

data = {'Name': ['Anna', 'Tom'], 'Score': [88, 92]}
df = pd.DataFrame(data)
print(df)

Pandas - Analyzing DataFrames

You can analyze DataFrames with methods like:

df.head(): View the first few rows.
df.info(): Get an overview of the DataFrame.
df.describe(): Get summary statistics.

Example:

print(df.describe())

Pandas - Cleaning Data

Empty Cells

Handle missing values using fillna or dropna.

df.fillna(0, inplace=True)
df.dropna(inplace=True)

Data in Wrong Format

Convert data types using pd.to_datetime or astype.

df['Date'] = pd.to_datetime(df['Date'])

Wrong Data

Replace incorrect values with replace.

df['Age'].replace(0, 25, inplace=True)

Duplicates

Remove duplicate rows using drop_duplicates.

df.drop_duplicates(inplace=True)