Pandas Overview
What is Pandas?
Pandas is an open-source Python library designed for data manipulation and analysis. It provides high-performance, easy-to-use data structures like Series and DataFrame. Pandas simplifies handling and analyzing structured data, making it a powerful tool for data science.
Why Use Pandas?
Efficiently handle large datasets.
Simplify data cleaning and preparation.
Perform complex data transformations with ease.
Provide built-in functionality for working with data formats like CSV, JSON, Excel, and SQL.
Support a wide range of analytical tools to extract insights from data.
What Can Pandas Do?
Load and save datasets in various formats (e.g., CSV, JSON, Excel).
Handle missing or inconsistent data.
Perform filtering, sorting, and grouping operations.
Support time-series analysis.
Merge and join datasets efficiently.
What is a Series?
A Series in Pandas is a one-dimensional labeled array capable of holding any data type. It can be thought of as a column in a spreadsheet or a single data column in a database table.
Example:
import pandas as pd
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)
Labels
Labels in Pandas are the indices or names associated with data elements in a Series or DataFrame. They allow for intuitive data access.
Example:
data = [10, 20, 30]
series = pd.Series(data, index=['a', 'b', 'c'])
print(series['b']) # Access element with label 'b'
Create Labels
You can define custom labels using the index
parameter when creating a Series or DataFrame.
Example:
data = [1, 2, 3]
series = pd.Series(data, index=['x', 'y', 'z'])
print(series)
What is a DataFrame?
A DataFrame is a two-dimensional, tabular data structure in Pandas with labeled rows and columns, similar to a spreadsheet or SQL table.
Example:
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
Locate Row
You can locate rows using the loc
and iloc
functions:
loc
: Access rows by labels.iloc
: Access rows by index position.
Example:
print(df.loc[0]) # Locate by label
print(df.iloc[1]) # Locate by index position
Read CSV Files
Pandas makes it easy to read data from CSV files using read_csv
.
Example:
df = pd.read_csv('data.csv')
print(df)
Pandas Read JSON
Pandas can also read JSON files using read_json
.
Example:
df = pd.read_json('data.json')
print(df)
Dictionary as JSON
You can create DataFrames from dictionaries, which can act like JSON objects.
Example:
data = {'Name': ['Anna', 'Tom'], 'Score': [88, 92]}
df = pd.DataFrame(data)
print(df)
Pandas - Analyzing DataFrames
You can analyze DataFrames with methods like:
df.head()
: View the first few rows.df.info
()
: Get an overview of the DataFrame.df.describe()
: Get summary statistics.
Example:
print(df.describe())
Pandas - Cleaning Data
Empty Cells
Handle missing values using fillna
or dropna
.
df.fillna(0, inplace=True)
df.dropna(inplace=True)
Data in Wrong Format
Convert data types using pd.to
_datetime
or astype
.
df['Date'] = pd.to_datetime(df['Date'])
Wrong Data
Replace incorrect values with replace
.
df['Age'].replace(0, 25, inplace=True)
Duplicates
Remove duplicate rows using drop_duplicates
.
df.drop_duplicates(inplace=True)