Phân tích dữ liệu với Pandas Python

Pandas là gì?

Pandas là thư viện Python không thể thiếu cho Data Science. Với Series và DataFrame, Pandas giúp thao tác và phân tích dữ liệu cực kỳ hiệu quả.

Bắt đầu với Pandas

import pandas as pd

df = pd.read_csv("doanh_thu_2024.csv")
print(df.head())
print(df.info())
print(df.describe())

df_hanoi = df[df["thanh_pho"] == "Hà Nội"]
df_cao = df[df["doanh_thu"] > 100_000_000]

Các thao tác phổ biến

# Group by
doanh_thu = df.groupby("thang")["doanh_thu"].agg(["sum", "mean", "count"])

# Xử lý missing values
df["gia"].fillna(df["gia"].median(), inplace=True)

# Merge DataFrame
df_merged = pd.merge(df_orders, df_customers, on="customer_id")

# Tạo cột mới
df["loi_nhuan"] = df["doanh_thu"] - df["chi_phi"]

# Xuất kết quả
df.to_excel("bao_cao.xlsx", index=False)

Pandas kết hợp với Matplotlib và Seaborn tạo thành bộ công cụ phân tích dữ liệu hoàn chỉnh.