Seaborn DataV Std.
Seaborn DataV Std.
作者:韩佳明Hirsun
笔者认为,seaborn可作为数据可视化的首选。
data:image/s3,"s3://crabby-images/31415/31415fcd7b300815f9a91ab0734b7d2e65f4d108" alt="1654680752027.png"
Example gallery: https://seaborn.pydata.org/examples/index.html
前言
基于Pandas和matplotlib
x轴和y轴可以相互调换。
data:image/s3,"s3://crabby-images/ceea2/ceea2ee373769896b50130bdd34f2b7f390fa3f0" alt="1643832171224.png"
Use default
# with dataframe
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("masculinity.csv")
sns.countplot(x="how_masculine",data=df)
plt.show()
# with list
import seaborn as sns
import matplotlib.pyplot as plt
height = [62, 64, 69, 75, 66, 68, 65, 71, 76, 73]
weight = [120, 136, 148, 175, 137, 165, 154, 172, 200, 187]
sns.scatterplot(x=height, y=weight)
plt.show()
sns.set() # 同样可以把matplotlib设置成sns的默认样式
Why?
data:image/s3,"s3://crabby-images/d8d66/d8d661a642f6ca580b2a912457342a1c72b3bbbc" alt="1643911321290.png"
Scatter plot
import seaborn as sns
import matplotlib.pyplot as plt
height = [62, 64, 69, 75, 66, 68, 65, 71, 76, 73]
weight = [120, 136, 148, 175, 137, 165, 154, 172, 200, 187]
sns.scatterplot(x=height, y=weight)
plt.show()
data:image/s3,"s3://crabby-images/51e68/51e688e33ff6930e45ec30b60407ea75902e2159" alt="1654101155065.png"
count plot
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("masculinity.csv")
sns.countplot(x="how_masculine", data=df)
plt.show()
data:image/s3,"s3://crabby-images/3294b/3294b18d20a6d7290484dfcd43534cc908d258aa" alt="1654101252680.png"
kdeplot
统计单变量密度
kdeplot 更丝滑
# bw_adjust 指定过拟合
sns.kdeplot(data=tips, x="total_bill", bw_adjust=5)
data:image/s3,"s3://crabby-images/a3951/a395138a883b4db7974ee4ccf12935e226248150" alt="1654677976562.png"
hue
可以指定hue,这将又增加一个维度,rug指定的列是另一个存放种类的列,采用的列的值的种类是有限的(产品质量好中坏)。相当于把x拆成了多个x.
import matplotlib.pyplot as plt
import seaborn as sns
hue_colors = {"Yes": "black",
"No":"red"}
# HTML hex color codes: Green and Grey
# hue_colors = {"Yes": "#808080",
# "No": "#00FF00"}
sns.scatterplot(x= "total_bill", y= "tip", data=tips, hue="smoker", palette=hue_colors, hue_order = ["No", "Yes"])
plt.show()
data:image/s3,"s3://crabby-images/e4c2a/e4c2af88db7519a36ea47ab74f0d6f2f8a0993a7" alt="1654101429643.png"
例如,对以下数据
decade | category | female_winner | |
---|---|---|---|
0 | 1900 | Chemistry | 0.000000 |
1 | 1900 | Literature | 0.100000 |
... | 1901 | Medicine | 0.000000 |
... | 1901 | Peace | 0.071429 |
... | 1902 | Physics | 0.076923 |
有hue时
ax = sns.lineplot(x='decade', y='female_winner', hue='category', data=prop_female_winners)
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
data:image/s3,"s3://crabby-images/360a7/360a749a630a9794e2eade17c6e678a3c7bbd692" alt="1654016180008.png"
无 hue时
data:image/s3,"s3://crabby-images/db0fb/db0fb695a4dcd98f081cc19b9a7dc81b2af1d1da" alt="1654016289619.png"
relplot
relplot() lets you create subplots in a single figure
data:image/s3,"s3://crabby-images/96b82/96b824dbe469b54f01eb5dc6603003ca13bd85ff" alt="1654237490869.png"
row & col
- col 和 row 用于创建子图
- hue 用于在同一个图中添加一个新列以区分,这是与col 和 row区分的
data:image/s3,"s3://crabby-images/3b49c/3b49c9329fa789e33297a876eba5369aa8f58bfd" alt="1654237680409.png"
size
import seaborn as sns
import matplotlib.pyplot as plt
sns.relplot( x= "total_bill", y= "tip", data=tips, kind= "scatter", size="size", hue="size")
plt.show()
data:image/s3,"s3://crabby-images/42404/42404de036f2060d612b24c7984191ff17fbf171" alt="1654238298996.png"
style
data:image/s3,"s3://crabby-images/0be35/0be352da12f6732d12eb5413dce388bac3f6470a" alt="1654238365181.png"
transparency
data:image/s3,"s3://crabby-images/b3acb/b3acb8f9162c38da430aa5c59e2eb44450cf8601" alt="1654238397371.png"
size
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders",
hue = "cylinders")
# Show plot
plt.show()
data:image/s3,"s3://crabby-images/b6567/b65676c3655a8049c3288efbf1c8f7381e17d9c2" alt="1654237899596.png"
dash 和 maker
data:image/s3,"s3://crabby-images/9dcbb/9dcbbbe002b3fc2a48a2575abb41ddafafaf984f" alt="1654238532711.png"
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Add markers and make each line have the same style
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line",
ci=None, style="origin",
hue="origin",
dashes = False,
markers = True)
# Show plot
plt.show()
data:image/s3,"s3://crabby-images/767b8/767b85557a79dca003c439d2991b76129e0c48cb" alt="1654238118436.png"
line with regplot
data:image/s3,"s3://crabby-images/c7c22/c7c228e8886831e26301d9c4fe84ad3fabf83086" alt="1654238596451.png"
- 阴影区域是置信区间
- 95% condent that the mean is within thisinterval
- 表示我们估计的不确定性
# 用标准差代替置信区间
sns.relplot(x="hour", y="NO_2",data=air_df,kind="line",ci="sd")
# Turning off confidence interval
sns.relplot(x="hour", y="NO_2", data=air_df, kind="line", ci=None)
catplot
区分 categorical value (类化值) 和 quantified value(量化值),后期还会有时间值。
catplot 用于 创建分类图。通常一个轴 是量化值列。一个轴是类化值列。
- Same advantages of relplot()
- Easily create subplots with col= and row=
countplot
只需要指定一个轴,另一个轴显示x的每个类的个数。
当然,也可以指定y。
sns.countplot(x="how_masculine", data=masculinity_data)
# 等价于
sns.catplot(x="how_masculine", data=masculinity_data, kind="count")
data:image/s3,"s3://crabby-images/8c9d8/8c9d8f6c85691696e1f6579aa7edf5a1dce3ffd8" alt="1654241933070.png"
barplot
sns.catplot(x="day", y="total_bill",data=tips,kind="bar")
# 可以使用 ci = None 来删去误差线
data:image/s3,"s3://crabby-images/4e30a/4e30aa005460c09ad7f1a90d594837ecc5c03590" alt="1654241988121.png"
- Lines(也叫误差线) show 95% confidence intervals for the mean
- Shows uncertainty about our estimate
- Assumes our data is a random sample
data:image/s3,"s3://crabby-images/9aa1d/9aa1d60fc9ce3c4c1c2ffa5a5d0f27d7fbf1e6c9" alt="1654243411452.png"
boxplot
sns.catplot(x="time",y="total_bill",data=tips,kind="box",order=["Dinner","Lunch"])
data:image/s3,"s3://crabby-images/589c0/589c0dac8652b91a8e0663674f6f87ebcc66d5b2" alt="1654322697562.png"
ref. https://zhuanlan.zhihu.com/p/147645727
- 第一四分位数 (Q1),又称“较小四分位数”,等于该样本中所有数值由小到大排列后第25%的数字。
- 第二四分位数 (Q2),又称“中位数”,等于该样本中所有数值由小到大排列后第50%的数字。
- 第三四分位数 (Q3),又称“较大四分位数”,等于该样本中所有数值由小到大排列后第75%的数字。
- 第三四分位数与第一四分位数的差距又称四分位距(IQR)。
箱体图的组成如图所示,
- 上边缘,是上四分位数加上1.5倍的箱体;
- 下边缘是下四分位数减去1.5倍的箱体;
- 数据在上边缘以上或者下边缘以下,就称为离群值。
- 上箱体为上四分位数;下箱体为下四分位数;
- 箱体长度为上四分位数减去下四分位数。
data:image/s3,"s3://crabby-images/4fa4d/4fa4d094e4ca8c4cd6f58049d9e02b8fd095fb57" alt="1654323351446.png"
- 参数
sym = ""
:去除离群值的展示 - 参数
whis
- By default, the whiskers extend to 1.5 * the interquartile range
- Make them extend to 2.0 * IQR: whis=2.0
- Show the 5th and 95th percentiles: whis=[5, 95]
- Show min and max values: whis=[0, 100]
sns.catplot(x="time", y="total_bill", data=tips, kind="box", whis=[0, 100])
data:image/s3,"s3://crabby-images/cb232/cb232c292e8c6bb8d1d75917121ad0fa4db72238" alt="1654322864343.png"
pointplot
- Points show mean of quantitative variable
- Vertical lines show 95% con,dence intervals
data:image/s3,"s3://crabby-images/306f9/306f9a2fa0aea74977e1b9bbc84081c503724d5a" alt="1654323543619.png"
data:image/s3,"s3://crabby-images/c2194/c21942b74c5a09482e5a9f3eac97a0d4be555ca1" alt="1654323593081.png"
Both show:
- Mean of quantitative variable
- 95% con,dence intervals for the mean
Differences:
- Line plot has quantitative variable (usually time) on x-axis
- Point plot has categorical variable on x-axis
sns.catplot(x="age", y="masculinity_important", data=masculinity_data, hue="feel_masculine", kind="point")
data:image/s3,"s3://crabby-images/ab07e/ab07e43f5c81602e3d8d4dec84ef687d2f28520f" alt="1654323666507.png"
- 参数
join=False
: 取消误差线的显示 estimator = numpy.median
: 由于一个x可能对应多个y,不设置该参数时表示使用平平均值。设置后可以将点位改为中位数。- capsize=0.2 设置盖顶横线的长度
ci = "None"
关闭显示置信区间
displot
相当于matplotlib的hist
displot将 rugplot(),kdeplot()和matplotlib的hist和三为一
- 适用于画出某一列区间分布
- 比如某一列是温度,图表将温度自动分好区间,并统计个数,将每个区间的个数呈现在y上
- 统计的值必须是数值或者时间(连续的值)
# Display a Seaborn distplot
# 只接受一个DF sns.distplot(DF)
sns.distplot(df['Award_Amount'])
plt.show()
# Clear the distplot
plt.clf()
data:image/s3,"s3://crabby-images/289e3/289e3e66b760e14f59e3855950e441f66a37bf08" alt="1643832395576.png"
Common parameters
data:image/s3,"s3://crabby-images/66826/668263c8f6b4e6d79a0b8ce3e58ad0d712c16e67" alt="CleanShot 2022-02-03 at 04.08.54@2x.png"
可以对自变量使用限制区间,比如xlim=(0,25000)
Regression Plots
regplot 和 lmplot 几乎完全一样,但是
- regplot 没有参数 aspect, 但是 x 轴宽度会自适应
- regplot 没有参数 row, 不能一次性绘制多个图
regplot
- 给定DFx,DFy,将DFx对应的DFy的值呈现出来
- 一个x和一个y确定一个点或者多个点
- index有序,DFx呈现会自动排序
- 用于表示大量相对无规律值的分布特点
统计的值可以是数值或者时间(连续的值),也可以是 Categorical values
用来基于已有数据在图像中展现出回归直线
# Create an lmplot of premiums vs. insurance_losses
# 一次只画一个数据集
# 可以在下面指定参数 marker='+'
sns.lmplot(y = "premiums",x = 'insurance_losses', data = df)
# Display the second plot
plt.show()
data:image/s3,"s3://crabby-images/239b1/239b11a01cc90eb1f36b16f3e1d08cbd52ef2e25" alt="1643832742701.png"
# Plotting the age of Nobel Prize winners
sns.lmplot(y = "year", x = "age", data = nobel, lowess=True,
aspect=2, line_kws={'color' : 'black'})
# lowess 关闭折线区域
# aspect 控制x轴长度,即图片的宽度
data:image/s3,"s3://crabby-images/ece58/ece581541fa043583cc62662481a2617a67d30d0" alt="1654018341077.png"
sns.regplot(y = "age", x = "year", row='category', aspect=2, line_kws={'color' : 'black'}, data = nobel)
data:image/s3,"s3://crabby-images/083fa/083facdf2b281193320bd7fadf4e884e83d82fcc" alt="1654019231611.png"
order
sns.regplot(data=df, x='temp', y='total_rentals', order=2)
data:image/s3,"s3://crabby-images/f8603/f8603f71b0c4546cc5e7226eedf2f95b4cc4038a" alt="1643885588793.png"
x_jitter
Seaborn也支持分类变量的回归绘图。 看看租金在各个月内如何变化可能很有意思。 在此示例中,使用抖动参数使得更容易看到每个月的租赁值的各个分配。
sns.regplot(data=df, x='mnth', y='total_rentals', x_jitter=.1, order=2)
data:image/s3,"s3://crabby-images/05162/051623057899bdef3a0f62a277fa15b445634e69" alt="1643885785684.png"
x_estimator
sns.regplot(data=df, x='mnth', y='total_rentals', x_estimator=np.mean, order=2)
In some cases, an x_estimator can be useful for highlighting trends
data:image/s3,"s3://crabby-images/ccff8/ccff89bbf4cac7c83d8cc06ca428d6398b4e0dd2" alt="1643885914138.png"
Binning the data
当存在连续变量时,将它们分成不同的 bins 可能会有所帮助。 在这种情况下,我们可以将温度分成四个bins,Seaborn将照顾计算这些垃圾箱并策划结果。 这比尝试使用熊猫或其他一些机制更快地创建箱子。 此快捷功能可以帮助快速读取诸如温度的连续数据。
data:image/s3,"s3://crabby-images/81731/817316aa97f8f781222b272d7c3f7f56705451f1" alt="1643886054786.png"
disable reg line
data:image/s3,"s3://crabby-images/13d49/13d49378aceab35a34c203645a8c4700b21b5bf5" alt="1643886378007.png"
Evaluating regression
Evaluating regression with residplot()
画残差图
sns.residplot(data=df, x='temp', y='total_rentals')
data:image/s3,"s3://crabby-images/0a43c/0a43cd9662764e204f6812d5e1f8284cf28f9517" alt="1643885488489.png"
sns.residplot(data=df, x='temp', y='total_rentals', order=2)
data:image/s3,"s3://crabby-images/8f49c/8f49cbdc552dedd24bd9cd441fd91075a5a741f6" alt="1643885536234.png"
implot
和regplot很相似,但是比它更高级
# Create an lmplot of premiums vs. insurance_losses
sns.lmplot(y = "premiums",x = 'insurance_losses', data = df)
# Display the second plot
plt.show()
# 和 regplot 相比,下面的图似乎只是没有了上边框和右边框
data:image/s3,"s3://crabby-images/f4797/f479774a424f51d97503f8cee8bdb0ef50e6f83d" alt="1643832817477.png"
参数 hue
Organize data by colors (hue)
# Create a regression plot using hue
# 一般只画一次
sns.lmplot(data=df,
x="insurance_losses",
y="premiums",
hue="Region")
# Show the results
plt.show()
data:image/s3,"s3://crabby-images/5f65d/5f65d00a6b1c0ba41f1193bd52d0da2037a74008" alt="1643833052665.png"
参数 col & row
sns.lmplot(x="quality",
y="alcohol",
data=df,
col="type")
data:image/s3,"s3://crabby-images/580e2/580e27c181f797ec70fb4ccec3d923f083a10e50" alt="1643833236520.png"
# Create a regression plot with multiple rows
sns.lmplot(data=df,
x="insurance_losses",
y="premiums",
row="Region")
# Show the plot
plt.show()
data:image/s3,"s3://crabby-images/ab769/ab769ee259093f321a30b6a3f038c98cc3c7c10a" alt="1643833288899.png"
Style
set_style()
for style in ['white','dark','whitegrid','darkgrid','ticks']:
sns.set_style(style)
sns.distplot(df['Tuition'])
plt.show()
data:image/s3,"s3://crabby-images/4ae11/4ae1136086cc219a27225383cde39888db6593f0" alt="1643833567752.png"
Remove the spines
# Set the style to white
sns.set_style('white')
# Create a regression plot
sns.lmplot(data=df,
x='pop2010',
y='fmr_2')
sns.despine(left=True)
# Show the plot and clear the figure
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/2b6f5/2b6f5e98bff97077b37723b6d3af047c3a9c1424" alt="1643833763462.png"
Defining a color
use matplotlib to assign
Seaborn supports assigning colors to plots using matplotlib color codes
sns.set(color_codes=True)
sns.distplot(df['Tuition'], color='g')
use Palettes
for p in sns.palettes.SEABORN_PALETTES:
sns.set_palette(p)
sns.distplot(df['Tuition'])
data:image/s3,"s3://crabby-images/4faf5/4faf5956659452ece08079b752b0d79e6de8b9fb" alt="1643834183142.png"
Displaying Palettes
- Seaborn uses the set_palette() function to define a palette
- sns.color_palette() returns the current palette
- sns.palplot() function displays a palette
一般地,palette影响的是一张图内的多个曲线的颜色,而不是子图之间的颜色。
for p in sns.palettes.SEABORN_PALETTES:
sns.set_palette(p)
sns.palplot(sns.color_palette())
plt.show()
data:image/s3,"s3://crabby-images/ea1a2/ea1a25f7a82da313d881c78aa43c3d3cc502d5c9" alt="1643834497812.png"
Defining Custom Palettes
data:image/s3,"s3://crabby-images/77323/77323c5f2c05f9b9b00343696d8801f3cd2e69e2" alt="1643834621754.png"
data:image/s3,"s3://crabby-images/9a0c2/9a0c262d8d7d11085dcb101d18b78e459876827f" alt="1654324862208.png"
data:image/s3,"s3://crabby-images/c2c2a/c2c2a879c980190117ca79c790af0ca1524867ce" alt="1654324884580.png"
use in plot
# Create a violinplot with the husl palette
sns.violinplot(data=df,
x='Award_Amount',
y='Model Selected',
palette='husl')
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/8cdf3/8cdf39940571d387e30f109aa86d08ee492f40f5" alt="1643883543016.png"
set_context()
Smallest to largest: "paper" , "notebook" , "talk" , "poster"
sns.set_context("talk")
data:image/s3,"s3://crabby-images/e3883/e3883df91fcc858c4f94969eeb046237f43a5bfa" alt="1654325111624.png"
Plots of each observation
- 用于描述每一类别值的大小分布
- 指定y,y是存放种类的列,采用的列的值的种类是有限的(苹果香蕉梨)
- 将每个类别对应的多个x值点在图上。
- 统计的值必须是数值或者时间(连续的值)
stripplot
Seaborn 的 stripplot() 显示数据集中的每个观察值。在某些情况下,可能很难看到单个数据点。我们可以使用 jitter 参数来更轻松地查看平均承保费用如何随诊断报销代码而变化。
sns.stripplot(data=df, y="DRG Definition",x="Average Covered Charges",jitter=True)
data:image/s3,"s3://crabby-images/83ee0/83ee0dc16829948d0c964cf75f3af42d57505070" alt="1643845492580.png"
swarmplot
和stripplot很相似。
我们可以使用 swarmplot() 绘制所有数据的更复杂的可视化。该图使用复杂的算法以不重叠的方式放置观察结果。这种方法的缺点是 swarmplot() 不能很好地扩展到大型数据集。
sns.swarmplot(data=df, y="DRG Definition", x="Average Covered Charges")
data:image/s3,"s3://crabby-images/fe84b/fe84b8ff5ce41cd7dd9857ef2c0c8a5710ac0de6" alt="1643845915866.png"
Abstract representations
可以指定hue,这将又增加一个维度,rug指定的列是另一个存放种类的列,采用的列的值的种类是有限的(产品质量好中坏)。相当于把x拆成了多个x.
boxplot
下一类图显示了数据的抽象表示。boxplot() 是这种类型中最常见的。该图用于显示与数据分布相关的几个度量,包括中位数、上四分位数和下四分位数以及异常值。
- 统计的值必须是数值或者时间(连续的值)
- 指定y,y是存放种类的列,采用的列的值的种类是有限的(苹果香蕉梨)
- 每一列的值的特征将呈现在图上(x轴)(比如产量)
- 用于表示大量相对无规律值的分布特点
实例:评价不同水果的产量
data:image/s3,"s3://crabby-images/0b925/0b925bf09bbeb6d572d6dcdfe93ca2223ab112b8" alt="1643846287887.png"
violinplot
和箱型图相似
violinplot() 是核密度图和箱线图的组合,适用于提供数据分布的替代视图。因为该图使用核密度计算,所以它不显示所有数据点。这对于显示大型数据集很有用,但创建起来可能需要大量计算。
sns.violinplot(x='AGE', y='WTKG3', data=data, inner=None)
plt.show()
data:image/s3,"s3://crabby-images/941df/941df205dda716e2ddde7c8fe9f19321825017cf" alt="1654691413852.png"
lvplot
sns.lvplot(data=df, y="DRG Definition", x="Average Covered Charges")
该分组中的最后一个图是 lvplot(),它代表字母值图。API 与 boxplot() 和 violinplot() 相同,但可以更有效地扩展到大型数据集。lvplot() 是 boxplot() 和 violinplot() 的混合体,渲染速度相对较快且易于解释。
data:image/s3,"s3://crabby-images/ae0e2/ae0e216796403d76b07b892342dd0073a3bc9344" alt="1643846529178.png"
Statistical estimates
可以指定hue,这将又增加一个维度,rug指定的列是另一个存放种类的列,采用的列的值的种类是有限的(产品质量好中坏)。相当于把x拆成了多个x.
barplot
最后一类图是数据的统计估计。barplot() 显示了对值的估计以及置信区间。在这个例子中,我们包含了第 1 章中描述的色调参数,它为我们查看这些分类数据提供了另一种有用的方法。
给定一个y列,展现y对应的x的值。
- 统计的值必须是数值或者时间(连续的值)
- 指定y,y是存放种类的列,采用的列的值的种类是有限的(苹果香蕉梨),且是唯一的,不重复的
- 每个种类的对应的x值将呈现(x)
- 适用于值的种类有限
sns.barplot(data=df, y="DRG Definition",x="Average Covered Charges",hue="Region")
data:image/s3,"s3://crabby-images/d7cba/d7cba4938a326a9df7238e272ce66d1a872576a0" alt="1643846632070.png"
pointplot
pointplot() 与 barplot() 相似,因为它显示了一个汇总度量和置信区间。pointplot() 对于观察值如何跨分类值变化非常有用。
和barplot相同,不过会连线
data:image/s3,"s3://crabby-images/c6b21/c6b2118ff59dfcb8a063a9e0ffe3a3cba33dbe1d" alt="1643847510873.png"
# Create a pointplot and include the capsize in order to show caps on the error bars
sns.pointplot(data=df,
y='Award_Amount',
x='Model Selected',
capsize=.1) # capsize 加个帽子
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/67708/67708555266cf0e451de61cfb1132c19f3565cd4" alt="1643884618865.png"
countplot
类似于barplot,不过是只有y,就统计y列下每一种值的个数,而不是直接呈现y对应的x的个数
sns.countplot(data=df, y="DRG_Code", hue="Region")
data:image/s3,"s3://crabby-images/cbb52/cbb52d30a4ed9773fedde13e9cfe48630fe812dc" alt="1643847725134.png"
Matrix Plots
grid format
pd.crosstab(df["mnth"], df["weekday"], values=df["total_rentals"],aggfunc='mean').round(0)
# values 指的是将x和y形成对应的值的函数结构,考虑到x和y对应的值可能不只有一个。
# round 控制保留小数位
data:image/s3,"s3://crabby-images/2ce40/2ce40fb77230c78eae8918eb781dee87c375465f" alt="1643901804516.png"
heatmap
sns.heatmap(df_crosstab, annot=True, fmt="d", cmap="YlGnBu", cbar=True, center=df_crosstab.loc[9, 6])
data:image/s3,"s3://crabby-images/08eca/08eca577425daa4e19d710ee4f49d52db8e2e660" alt="1643902031151.png"
一般的,颜色越浅值越大。不过可以使用参数center重新设置焦点。
correlation matrix
Pandas corr function calculates correlations between columns in a dataframe
df.corr()
sns.heatmap(df.corr())
data:image/s3,"s3://crabby-images/87d59/87d59d7d6bfb72efa60099900bb5adb9885e6f08" alt="1643902185784.png"
在上图中,颜色越浅,表明两列的相关性越高。
Grid Plots
指定属性列,把把一个图拆成多个图,把属性相同的行的数据放到一个同一个图中。
Tidy data
- Seaborn's grid plots require data in "tidy format"
- One observation per row of data
data:image/s3,"s3://crabby-images/ee2f8/ee2f8b81f637b907c0c13ac53ae814d5fd2898a5" alt="1643902367017.png"
# g = sns.FaceGrid(df,col = xxx, row = xxx, col_order = xxx,row_order = xxx) 这一步指定怎样切割图
# g.map(sns.xxx,"which_col_name") 指定图的样式 和 数据源参数,order也可以在这里指定
g = sns.FacetGrid(df, col="HIGHDEG")
g.map(sns.boxplot, 'Tuition',order=['1', '2', '3', '4']) #map 这一步是必须的 ,这里的boxplot只指定了一个参数
data:image/s3,"s3://crabby-images/ba274/ba2741b51644088351f366ed0e4ea21050fd6380" alt="1643904206013.png"
# Create FacetGrid with Degree_Type and specify the order of the rows using row_order
g2 = sns.FacetGrid(df,
row="Degree_Type",
row_order=['Graduate', 'Bachelors', 'Associates', 'Certificate'])
# Map a pointplot of SAT_AVG_ALL onto the grid
g2.map(sns.pointplot, 'SAT_AVG_ALL',) # 没有y轴
# Show the plot
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/586e9/586e9c4a2a18a65685c7289be8695a72eed0d892" alt="1643904374803.png"
factorplot()
更加快捷的画grid plots
# Create a facetted pointplot of Average SAT_AVG_ALL scores facetted by Degree Type
sns.factorplot(data=df,
x='SAT_AVG_ALL',
kind='point', #kind 可以指定 scatter box 等,别忘了kde也是一种kind
row='Degree_Type',
row_order=['Graduate', 'Bachelors', 'Associates', 'Certificate'])
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/9adec/9adec61ea3f6eca71c31e7ca23671c673071b44b" alt="1643904604973.png"
lmplot
给lmplot指定col 和 row等参数实现 grid plots
下面实现了三个属性维度
# Create an lmplot that has a column for Ownership, a row for Degree_Type and hue based on the WOMENONLY column
sns.lmplot(data=df,
x='SAT_AVG_ALL',
y='Tuition',
col="Ownership",
col_order=inst_ord
row='Degree_Type',
row_order=['Graduate', 'Bachelors'],
hue='WOMENONLY',
)
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/aeff5/aeff5a9b110a67ce7c4c21c07a61c76d9ce37736" alt="1643904870981.png"
Pair Plot
由 x 和 y 定位一个或者多个值
PairGrid
g = sns.PairGrid(df, vars=["Fair_Mrkt_Rent", "Median_Income"])
g = g.map(plt.scatter)
data:image/s3,"s3://crabby-images/8e40f/8e40ffd0dc27d4c4db04223715116768a4d7723a" alt="1643906168843.png"
sns.pairplot(df, vars=["Fair_Mrkt_Rent","Median_Income"], kind='reg',diag_kind='hist')
# kind 指定丿 diag_king指定 捺
# kind 和 diag_kind 默认参数是scatter
data:image/s3,"s3://crabby-images/4233e/4233ee06bcd569fb3d8e87f2f2a33a33d3d72e67" alt="1643906329961.png"
pairplot
sns.pairplot(df.query('BEDRMS < 3'),
vars=["Fair_Mrkt_Rent","Median_Income", "UTILITY"], # 随机组合3x3 = 9
hue='BEDRMS', palette='husl', #palette将影响不同hue的颜色
plot_kws={'alpha': 0.5}) # 改变透明度
# 如果不指定kind将智能分配合适的kind
data:image/s3,"s3://crabby-images/aa56d/aa56d6ba9d5729855e4a5126deff4557a3591b03" alt="1643906671957.png"
高度自定义:自定义x和y
# Build a pairplot with different x and y variables
sns.pairplot(data=df,
x_vars=["fatal_collisions_speeding", "fatal_collisions_alc"], #指定x
y_vars=['premiums', 'insurance_losses'], #指定y
kind='scatter',
hue='Region',
palette='husl')
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/398fb/398fb32003138355ff0935e83eeaf89c89e4ea8c" alt="1643906866451.png"
Joint plot
data:image/s3,"s3://crabby-images/47ab8/47ab8eb34f40943cf7cebab2f6f75999a038f2f1" alt="1643909289758.png"
JointGrid
# Build a JointGrid comparing humidity and total_rentals
sns.set_style("whitegrid")
g = sns.JointGrid(x="hum", y="total_rentals",
data=df,
xlim=(0.1, 1.0))
# 指定呈现regplot和distplot,也可以用g.plot_joint(sns.xxx)实现
g.plot(sns.regplot, sns.distplot)
plt.show()
plt.clf()
data:image/s3,"s3://crabby-images/8d37b/8d37b7122f120e26b8c788bd232c944cd3b00c28" alt="1643909850117.png"
g = sns.JointGrid(data=df, x="Tuition", y="ADM_RATE_ALL") # 制作好画板
# 在图内添加kde
g = g.plot_joint(sns.kdeplot)
# 在图边上添加kde,并填充
g = g.plot_marginals(sns.kdeplot, shade=True)
# 添加注解
g = g.annotate(stats.pearsonr)
data:image/s3,"s3://crabby-images/a6279/a62795e9b3e68bd64a16e7301fec16c2311df525" alt="1643910205501.png"
jointplot
更快捷的画Joint plot
边上自带displot
g = (sns.jointplot(x="Tuition", y="ADM_RATE_ALL", kind='scatter', # 指定kind
xlim=(0, 25000),
marginal_kws = dict(bins=15,rug=True), # 设定边上的displot的样式
data=df.query('UG < 2500 & Ownership == "Public"'))
.plot_joint(sns.kdeplot)) # 在图内叠加kde
data:image/s3,"s3://crabby-images/5eb1a/5eb1a45b830c94d195723a560ee1a6b5b7d0e612" alt="1643910506977.png"
FacetGrid 与 AxesSubplot对象
Seaborn 的绘图函数创建两种不同类型的对象:FacetGrids 和 AxesSubplots。要确定您正在使用哪种类型的对象,首先将绘图输出分配给一个变量。
FacetGrid 由一个或多个 AxesSubplots 组成,这就是它支持子图的方式。
data:image/s3,"s3://crabby-images/7d678/7d6783be85696bdae3fb68def91af586b7771563" alt="1654332634772.png"
FacetGrid 添加标题
data:image/s3,"s3://crabby-images/8612b/8612b2535728a2401bc3c7bcbf5fc4bc66679bfa" alt="1654332690708.png"
AxesSubplot 添加标题
data:image/s3,"s3://crabby-images/a91cd/a91cd8874898ccb965817969b1c8be1846c7c2a3" alt="1654333078982.png"
对于子图的标题,建议使用后期处理的方式添加。
Adding axis labels
g.set(xlabel="New X Label", ylabel="New Y Label")
Rotating x-axis tick labels
plt.xticks(rotation=90)
data:image/s3,"s3://crabby-images/ae952/ae952c2b0b936bd940d702d42b5a137d15d9d3f0" alt="1654333181304.png"
修改轴scale
# Plot the y-axis on a log scale
plt.yscale('log')
总结
- 查看分布用 displot
- displot将 rugplot(),kdeplot()和matplotlib的hist和三为一
- 回归分析 用 lmplot
- 检查数据的分布用lvplot等
- 需要按照属性对数据图进行对比,使用factorplot
- 最后,熟悉了数据,可以用pairplot 和 jointplot进行呈现
data:image/s3,"s3://crabby-images/bee3b/bee3b9a30cb7c1b44caf15496718e7e133aa4787" alt="1643910643554.png"