Joining Data with pandas

Hirsun (HAN Jiaming)大约 6 分钟

Joining Data with pandas

作者：韩佳明

未经许可，禁止转载，已做版权存证

Tables = DataFrames
Merging = Joining

Inner join

指定检查列，将数据（或pair）相同的行进行连接并保留。

wards_census = wards.merge(census, on=['ward'])
print(wards_census.head(4))

Suffixes

用于给相同的列名称加上自定义后缀以区分。

Left Join

左侧数据全部保留，将右侧数据连接到左侧。

wards_census = wards.merge(census,how = "left", on=['ward'])

right join 则反之

Merging a table to itself

Merging a table to itself with left join

Outer join

将左右进行连接，并保留没有被连接的行。

自定义检查列

original_sequels = sequels.merge(sequels, left_on='sequel', right_on='id', suffixes=('_org','_seq'))

merge_ordered()

对比

注意默认的连接方法是outer join

pd.merge_ordered(appl, mcd, on='date', suffixes=('_aapl','_mcd'), fill_method='ffill') # 指定空数据是用前一个值

merge_asof()

Similar to a merge_ordered() left join
Similar features as merge_ordered()
非精准匹配
merge 的 on 的 column 必须是已经排好序的
direction 默认是 backward

pd.merge_asof(visa, ibm, on=['date_time'],suffixes=('_visa','_ibm'),direction='forward')
# 匹配右边中不大于左边且最接近左边

变体应用

semi-join

genres_tracks = genres.merge(top_tracks, on='gid')
top_genres = genres[genres['gid'].isin(genres_tracks['gid'])]
print(top_genres.head())

anti-join

genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)
gid_list = genres_tracks.loc[genres_tracks['_merge'] == 'left_only','gid']
non_top_genres = genres[genres['gid'].isin(gid_list)]
print(non_top_genres.head())

Concatenate vertically

垂直拼接

pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=True)

pd.concat([inv_jan, inv_feb, inv_mar],ignore_index=False,keys=['jan','feb','mar'])

sort

pd.concat([inv_jan, inv_feb], sort=True)

拼接保留列

指定参数 join，默认仅拼接共同名称的列，

pd.concat([inv_jan, inv_feb], join='inner')

使用 append 代替concat

inv_jan.append([inv_feb, inv_mar],ignore_index=True,sort=True)

有条件的连接

适用于横向 join和垂直的拼接

为什么要有条件

Why:

Real world data is often NOT clean

What to do:

Fix incorrect data
Drop duplicate rows

条件参数

Checks if merge is of specied type

'one_to_one'
'one_to_many'
'many_to_one'
'many_to_many'

.merge(validate=None)

即连接前先检查是否符合条件，符合条件才会连接，不符合就不连接。

Joining Data with pandas

# Joining Data with pandas

# Inner join

# Suffixes

# Left Join

# Merging a table to itself

# Outer join

# 自定义检查列

# merge_ordered()

# 对比

# merge_asof()

# 变体应用

# semi-join

# anti-join

# Concatenate vertically

# sort

# 拼接保留列

# 使用 append 代替concat

# 有条件的连接

# 为什么要有条件

# 条件参数

Joining Data with pandas

Inner join

Suffixes

Left Join

Merging a table to itself

Outer join

自定义检查列

merge_ordered()

对比

merge_asof()

变体应用

semi-join

anti-join

Concatenate vertically

sort

拼接保留列

使用 append 代替concat

有条件的连接

为什么要有条件

条件参数