Pandas Overview

What is Pandas?

Pandas is an open-source Python library designed for data manipulation and analysis. It provides high-performance, easy-to-use data structures like Series and DataFrame. Pandas simplifies handling and analyzing structured data, making it a powerful tool for data science.

Why Use Pandas?

  • Efficiently handle large datasets.

  • Simplify data cleaning and preparation.

  • Perform complex data transformations with ease.

  • Provide built-in functionality for working with data formats like CSV, JSON, Excel, and SQL.

  • Support a wide range of analytical tools to extract insights from data.

What Can Pandas Do?

  • Load and save datasets in various formats (e.g., CSV, JSON, Excel).

  • Handle missing or inconsistent data.

  • Perform filtering, sorting, and grouping operations.

  • Support time-series analysis.

  • Merge and join datasets efficiently.

What is a Series?

A Series in Pandas is a one-dimensional labeled array capable of holding any data type. It can be thought of as a column in a spreadsheet or a single data column in a database table.

Example:

import pandas as pd

data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

Labels

Labels in Pandas are the indices or names associated with data elements in a Series or DataFrame. They allow for intuitive data access.

Example:

data = [10, 20, 30]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series['b'])  # Access element with label 'b'

Create Labels

You can define custom labels using the index parameter when creating a Series or DataFrame.

Example:

data = [1, 2, 3]
series = pd.Series(data, index=['x', 'y', 'z'])
print(series)

What is a DataFrame?

A DataFrame is a two-dimensional, tabular data structure in Pandas with labeled rows and columns, similar to a spreadsheet or SQL table.

Example:

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)

Locate Row

You can locate rows using the loc and iloc functions:

  • loc: Access rows by labels.

  • iloc: Access rows by index position.

Example:

print(df.loc[0])  # Locate by label
print(df.iloc[1])  # Locate by index position

Read CSV Files

Pandas makes it easy to read data from CSV files using read_csv.

Example:

df = pd.read_csv('data.csv')
print(df)

Pandas Read JSON

Pandas can also read JSON files using read_json.

Example:

df = pd.read_json('data.json')
print(df)

Dictionary as JSON

You can create DataFrames from dictionaries, which can act like JSON objects.

Example:

data = {'Name': ['Anna', 'Tom'], 'Score': [88, 92]}
df = pd.DataFrame(data)
print(df)

Pandas - Analyzing DataFrames

You can analyze DataFrames with methods like:

  • df.head(): View the first few rows.

  • df.info(): Get an overview of the DataFrame.

  • df.describe(): Get summary statistics.

Example:

print(df.describe())

Pandas - Cleaning Data

Empty Cells

Handle missing values using fillna or dropna.

df.fillna(0, inplace=True)
df.dropna(inplace=True)

Data in Wrong Format

Convert data types using pd.to_datetime or astype.

df['Date'] = pd.to_datetime(df['Date'])

Wrong Data

Replace incorrect values with replace.

df['Age'].replace(0, 25, inplace=True)

Duplicates

Remove duplicate rows using drop_duplicates.

df.drop_duplicates(inplace=True)