Pandas

Python

发布日期: 2020-03-05

更新日期: 2021-11-15

文章字数: 17.8k

阅读时长: 97 分

阅读次数:

Pandas 教程

Pandas 是 Python 语言的一个扩展程序库，用于数据分析。

Pandas 是一个开放源码、BSD 许可的库，提供高性能、易于使用的数据结构和数据分析工具。

Pandas 名字衍生自术语 “panel data”（面板数据）和 “Python data analysis”（Python 数据分析）。

Pandas 一个强大的分析结构化数据的工具集，基础是 Numpy（提供高性能的矩阵运算）。

Pandas 可以从各种文件格式比如 CSV、JSON、SQL、Microsoft Excel 导入数据。

Pandas 可以对各种数据进行运算操作，比如归并、再成形、选择，还有数据清洗和数据加工特征。

Pandas 广泛应用在学术、金融、统计学等各个数据分析领域。

Pandas 应用

Pandas 的主要数据结构是 Series （一维数据）与 DataFrame（二维数据），这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。

数据结构

Series 是一种类似于一维数组的对象，它由一组数据（各种Numpy数据类型）以及一组与之相关的数据标签（即索引）组成。

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

Pandas 安装

安装 pandas 需要基础环境是 Python，开始前我们假定你已经安装了 Python 和 Pip。

使用 pip 安装 pandas:

pip install pandas

安装成功后，我们就可以导入 pandas 包使用：

import pandas

实例 - 查看 pandas 版本

>>> **import** pandas
>>> pandas.__version__  # 查看版本
'1.1.5'

导入 pandas 一般使用别名 pd 来代替：

import pandas as pd

实例 - 查看 pandas 版本

>>> **import** pandas **as** pd
>>> pd.__version__  # 查看版本
'1.1.5'

一个简单等 pandas 实例：

实例

import pandas as pd

mydataset = {
 'sites': ["Google", "Runoob", "Wiki"],
 'number': [1, 2, 3]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

执行以上代码，输出结果为：

Pandas 数据结构 - Series

Pandas Series 类似表格中的一个列（column），类似于一维数组，可以保存任何数据类型。

Series 由索引（index）和列组成，函数如下：

pandas.Series( data, index, dtype, name, copy)

参数说明：

data：一组数据(ndarray 类型)。
index：数据索引标签，如果不指定，默认从 0 开始。
dtype：数据类型，默认会自己判断。
name：设置名称。
copy：拷贝数据，默认为 False。

创建一个简单的 Series 实例：

import pandas as pd

a = [1, 2, 3]

myvar = pd.Series(a)

print(myvar)

输出结果如下：

从上图可知，如果没有指定索引，索引值就从 0 开始，我们可以根据索引值读取数据：

import pandas as pd

a = [1, 2, 3]

myvar = pd.Series(a)

print(myvar[1])

输出结果如下：

我们可以指定索引值，如下实例：

import pandas as pd

a = ["Google", "Runoob", "Wiki"]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

输出结果如下：

根据索引值读取数据:

import pandas as pd

a = ["Google", "Runoob", "Wiki"]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar["y"])

输出结果如下：

Runoob

我们也可以使用 key/value 对象，类似字典来创建 Series：

import pandas as pd

sites = {1: "Google", 2: "Runoob", 3: "Wiki"}

myvar = pd.Series(sites)

print(myvar)

输出结果如下：

从上图可知，字典的 key 变成了索引值。

如果我们只需要字典中的一部分数据，只需要指定需要数据的索引即可，如下实例：

import pandas as pd

sites = {1: "Google", 2: "Runoob", 3: "Wiki"}

myvar = pd.Series(sites, index = [1, 2])

print(myvar)

输出结果如下：

设置 Series 名称参数：

import pandas as pd

sites = {1: "Google", 2: "Runoob", 3: "Wiki"}

myvar = pd.Series(sites, index = [1, 2], name="RUNOOB-Series-TEST" )

print(myvar)

Pandas 数据结构 - DataFrame

DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

DataFrame 构造方法如下：

pandas.DataFrame( data, index, columns, dtype, copy)

参数说明：

data：一组数据(ndarray、series, map, lists, dict 等类型)。
index：索引值，或者可以称为行标签。
columns：列标签，默认为 RangeIndex (0, 1, 2, …, n) 。
dtype：数据类型。
copy：拷贝数据，默认为 False。

Pandas DataFrame 是一个二维的数组结构，类似二维数组。

import pandas as pd

data = [['Google',10],['Runoob',12],['Wiki',13]]

df = pd.DataFrame(data,columns=['Site','Age'],dtype=float)

print(df)

输出结果如下：

以下实例使用 ndarrays 创建，ndarray 的长度必须相同，如果传递了 index，则索引的长度应等于数组的长度。如果没有传递索引，则默认情况下，索引将是range(n)，其中n是数组长度。

import pandas as pd

data = {'Site':['Google', 'Runoob', 'Wiki'], 'Age':[10, 12, 13]}

df = pd.DataFrame(data)

print (df)

输出结果如下：

从以上输出结果可以知道， DataFrame 数据类型一个表格，包含 rows（行）和 columns（列）：

还可以使用字典（key/value），其中字典的 key 为列名:

import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)

print (df)

输出结果为：

   a   b     c
0  1   2   NaN
1  5  10  20.0

没有对应的部分数据为 NaN。

Pandas 可以使用 loc 属性返回指定行的数据，如果没有设置索引，第一行索引为 0，第二行索引为 1，以此类推：

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# 数据载入到 DataFrame 对象
df = pd.DataFrame(data)

# 返回第一行
print(df.loc[0])
# 返回第二行
print(df.loc[1])

输出结果如下：

calories    420
duration     50
Name: 0, dtype: int64
calories    380
duration     40
Name: 1, dtype: int64

注意：返回结果其实就是一个 Pandas Series 数据。

也可以返回多行数据，使用 [[ … ]] 格式，… 为各行的索引，以逗号隔开：

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# 数据载入到 DataFrame 对象
df = pd.DataFrame(data)

# 返回第一行和第二行
print(df.loc[[0, 1]])

输出结果为：

   calories  duration
0       420        50
1       380        40

注意：返回结果其实就是一个 Pandas DataFrame 数据。

我们可以指定索引值，如下实例：

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

输出结果为：

      calories  duration
day1       420        50
day2       380        40
day3       390        45

Pandas 可以使用 loc 属性返回指定索引对应到某一行：

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

# 指定索引
print(df.loc["day2"])

输出结果为：

calories    380
duration     40
Name: day2, dtype: int64

Pandas CSV 文件

CSV（Comma-Separated Values，逗号分隔值，有时也称为字符分隔值，因为分隔字符也可以不是逗号），其文件以纯文本形式存储表格数据（数字和文本）。

CSV 是一种通用的、相对简单的文件格式，被用户、商业和科学广泛应用。

Pandas 可以很方便的处理 CSV 文件，本文以 nba.csv 为例，你可以下载 nba.csv 或打开 nba.csv 查看。

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.to_string())

to_string() 用于返回 DataFrame 类型的数据，如果不使用该函数，则输出结果为数据的前面 5 行和末尾 5 行，中间部分以 … 代替。

import pandas as pd

df = pd.read_csv('nba.csv')

print(df)

输出结果为：

              Name            Team  Number Position   Age Height  Weight            College     Salary
0    Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1      Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2     John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3      R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4    Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0
..             ...             ...     ...      ...   ...    ...     ...                ...        ...
453   Shelvin Mack       Utah Jazz     8.0       PG  26.0    6-3   203.0             Butler  2433333.0
454      Raul Neto       Utah Jazz    25.0       PG  24.0    6-1   179.0                NaN   900000.0
455   Tibor Pleiss       Utah Jazz    21.0        C  26.0    7-3   256.0                NaN  2900000.0
456    Jeff Withey       Utah Jazz    24.0        C  26.0    7-0   231.0             Kansas   947276.0
457            NaN             NaN     NaN      NaN   NaN    NaN     NaN                NaN        NaN

我们也可以使用 to_csv() 方法将 DataFrame 存储为 csv 文件：

import pandas as pd

# 三个字段 name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]

# 字典
dict = {'name': nme, 'site': st, 'age': ag}

df = pd.DataFrame(dict)

# 保存 dataframe
df.to_csv('site.csv')

执行成功后，我们打开 site.csv 文件，显示结果如下：

数据处理

head()

head( n) 方法用于读取前面的 n 行，如果不填参数 n ，默认返回 5 行。

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.head())

输出结果为：

            Name            Team  Number Position   Age Height  Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.head(10))

输出结果为：

            Name            Team  Number Position   Age Height  Weight            College      Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas   7730337.0
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette   6796117.0
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University         NaN
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State   1148640.0
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN   5000000.0
5   Amir Johnson  Boston Celtics    90.0       PF  29.0    6-9   240.0                NaN  12000000.0
6  Jordan Mickey  Boston Celtics    55.0       PF  21.0    6-8   235.0                LSU   1170960.0
7   Kelly Olynyk  Boston Celtics    41.0        C  25.0    7-0   238.0            Gonzaga   2165160.0
8   Terry Rozier  Boston Celtics    12.0       PG  22.0    6-2   190.0         Louisville   1824360.0
9   Marcus Smart  Boston Celtics    36.0       PG  22.0    6-4   220.0     Oklahoma State   3431040.0

tail()

tail( n ) 方法用于读取尾部的 n 行，如果不填参数 n ，默认返回 5 行，空行各个字段的值返回 NaN。

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.tail())

输出结果为：

             Name       Team  Number Position   Age Height  Weight College     Salary
453  Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0  Butler  2433333.0
454     Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0     NaN   900000.0
455  Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0     NaN  2900000.0
456   Jeff Withey  Utah Jazz    24.0        C  26.0    7-0   231.0  Kansas   947276.0
457           NaN        NaN     NaN      NaN   NaN    NaN     NaN     NaN        NaN

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.tail(10))

输出结果为：

               Name       Team  Number Position   Age Height  Weight   College      Salary
448  Gordon Hayward  Utah Jazz    20.0       SF  26.0    6-8   226.0    Butler  15409570.0
449     Rodney Hood  Utah Jazz     5.0       SG  23.0    6-8   206.0      Duke   1348440.0
450      Joe Ingles  Utah Jazz     2.0       SF  28.0    6-8   226.0       NaN   2050000.0
451   Chris Johnson  Utah Jazz    23.0       SF  26.0    6-6   206.0    Dayton    981348.0
452      Trey Lyles  Utah Jazz    41.0       PF  20.0   6-10   234.0  Kentucky   2239800.0
453    Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0    Butler   2433333.0
454       Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0       NaN    900000.0
455    Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0       NaN   2900000.0
456     Jeff Withey  Utah Jazz    24.0        C  26.0    7-0   231.0    Kansas    947276.0
457             NaN        NaN     NaN      NaN   NaN    NaN     NaN       NaN         NaN

info()

info() 方法返回表格的一些基本信息：

import pandas as pd

df = pd.read_csv('nba.csv')

print(df.info())

输出结果为：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457          # 行数，458 行，第一行编号为 0
Data columns (total 9 columns):            # 列数，9列
 #   Column    Non-Null Count  Dtype       # 各列的数据类型
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   373 non-null    object         # non-null，意思为非空的数据    
 8   Salary    446 non-null    float64
dtypes: float64(4), object(5)                 # 类型

non-null 为非空数据，我们可以看到上面的信息中，总共 458 行，College 字段的空值最多。

Pandas JSON

JSON（JavaScript Object Notation，JavaScript 对象表示法），是存储和交换文本信息的语法，类似 XML。

JSON 比 XML 更小、更快，更易解析.

Pandas 可以很方便的处理 JSON 数据，本文以 sites.json 为例，内容如下：

[
   {
   "id": "A001",
   "name": "菜鸟教程",
   "url": "www.runoob.com",
   "likes": 61
   },
   {
   "id": "A002",
   "name": "Google",
   "url": "www.google.com",
   "likes": 124
   },
   {
   "id": "A003",
   "name": "淘宝",
   "url": "www.taobao.com",
   "likes": 45
   }
]

import pandas as pd

df = pd.read_json('sites.json')

print(df.to_string())

to_string() 用于返回 DataFrame 类型的数据，我们也可以直接处理 JSON 字符串。

import pandas as pd

data =[
    {
      "id": "A001",
      "name": "菜鸟教程",
      "url": "www.runoob.com",
      "likes": 61
    },
    {
      "id": "A002",
      "name": "Google",
      "url": "www.google.com",
      "likes": 124
    },
    {
      "id": "A003",
      "name": "淘宝",
      "url": "www.taobao.com",
      "likes": 45
    }
]
df = pd.DataFrame(data)

print(df)

以上实例输出结果为：

     id    name             url  likes
0  A001    菜鸟教程  www.runoob.com     61
1  A002  Google  www.google.com    124
2  A003      淘宝  www.taobao.com     45

JSON 对象与 Python 字典具有相同的格式，所以我们可以直接将 Python 字典转化为 DataFrame 数据：

import pandas as pd


# 字典格式的 JSON                                                                                              
s = {
    "col1":{"row1":1,"row2":2,"row3":3},
    "col2":{"row1":"x","row2":"y","row3":"z"}
}

# 读取 JSON 转为 DataFrame                                                                                          
df = pd.DataFrame(s)
print(df)

以上实例输出结果为：

      col1 col2
row1     1    x
row2     2    y
row3     3    z

从 URL 中读取 JSON 数据：

import pandas as pd

URL = 'https://static.runoob.com/download/sites.json'
df = pd.read_json(URL)
print(df)

以上实例输出结果为：

     id    name             url  likes
0  A001    菜鸟教程  www.runoob.com     61
1  A002  Google  www.google.com    124
2  A003      淘宝  www.taobao.com     45

内嵌的 JSON 数据

假设有一组内嵌的 JSON 数据文件 nested_list.json ：

{
    "school_name": "ABC primary school",
    "class": "Year 1",
    "students": [
    {
        "id": "A001",
        "name": "Tom",
        "math": 60,
        "physics": 66,
        "chemistry": 61
    },
    {
        "id": "A002",
        "name": "James",
        "math": 89,
        "physics": 76,
        "chemistry": 51
    },
    {
        "id": "A003",
        "name": "Jenny",
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }]
}

使用以下代码格式化完整内容：

import pandas as pd

df = pd.read_json('nested_list.json')

print(df)

以上实例输出结果为：

          school_name   class                                           students
0  ABC primary school  Year 1  {'id': 'A001', 'name': 'Tom', 'math': 60, 'phy...
1  ABC primary school  Year 1  {'id': 'A002', 'name': 'James', 'math': 89, 'p...
2  ABC primary school  Year 1  {'id': 'A003', 'name': 'Jenny', 'math': 79, 'p...

这时我们就需要使用到 json_normalize() 方法将内嵌的数据完整的解析出来：

import pandas as pd
import json

# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
    data = json.loads(f.read())

# 展平数据
df_nested_list = pd.json_normalize(data, record_path =['students'])
print(df_nested_list)

以上实例输出结果为：

     id   name  math  physics  chemistry
0  A001    Tom    60       66         61
1  A002  James    89       76         51
2  A003  Jenny    79       90         78

data = json.loads(f.read()) 使用 Python JSON 模块载入数据。

json_normalize() 使用了参数 record_path 并设置为 [‘students’] 用于展开内嵌的 JSON 数据 students。

显示结果还没有包含 school_name 和 class 元素，如果需要展示出来可以使用 meta 参数来显示这些元数据：

import pandas as pd
import json

# 使用 Python JSON 模块载入数据
with open('nested_list.json','r') as f:
    data = json.loads(f.read())

# 展平数据
df_nested_list = pd.json_normalize(
    data,
    record_path =['students'],
    meta=['school_name', 'class']
)
print(df_nested_list)

以上实例输出结果为：

     id   name  math  physics  chemistry         school_name   class
0  A001    Tom    60       66         61  ABC primary school  Year 1
1  A002  James    89       76         51  ABC primary school  Year 1
2  A003  Jenny    79       90         78  ABC primary school  Year 1

接下来，让我们尝试读取更复杂的 JSON 数据，该数据嵌套了列表和字典，数据文件 nested_mix.json 如下：

{
    "school_name": "local primary school",
    "class": "Year 1",
    "info": {
      "president": "John Kasich",
      "address": "ABC road, London, UK",
      "contacts": {
        "email": "admin@e.com",
        "tel": "123456789"
      }
    },
    "students": [
    {
        "id": "A001",
        "name": "Tom",
        "math": 60,
        "physics": 66,
        "chemistry": 61
    },
    {
        "id": "A002",
        "name": "James",
        "math": 89,
        "physics": 76,
        "chemistry": 51
    },
    {
        "id": "A003",
        "name": "Jenny",
        "math": 79,
        "physics": 90,
        "chemistry": 78
    }]
}

nested_mix.json 文件转换为 DataFrame：

import pandas as pd
import json

# 使用 Python JSON 模块载入数据
with open('nested_mix.json','r') as f:
    data = json.loads(f.read())

df = pd.json_normalize(
    data,
    record_path =['students'],
    meta=[
        'class',
        ['info', 'president'],
        ['info', 'contacts', 'tel']
    ]
)

print(df)

以上实例输出结果为：

     id   name  math  physics  chemistry   class info.president info.contacts.tel
0  A001    Tom    60       66         61  Year 1    John Kasich         123456789
1  A002  James    89       76         51  Year 1    John Kasich         123456789
2  A003  Jenny    79       90         78  Year 1    John Kasich         123456789

读取内嵌数据中的一组数据

以下是实例文件 nested_deep.json，我们只读取内嵌中的 math 字段：

{
    "school_name": "local primary school",
    "class": "Year 1",
    "students": [
    {
        "id": "A001",
        "name": "Tom",
        "grade": {
            "math": 60,
            "physics": 66,
            "chemistry": 61
        }

    },
    {
        "id": "A002",
        "name": "James",
        "grade": {
            "math": 89,
            "physics": 76,
            "chemistry": 51
        }

    },
    {
        "id": "A003",
        "name": "Jenny",
        "grade": {
            "math": 79,
            "physics": 90,
            "chemistry": 78
        }
    }]
}

这里我们需要使用到 glom 模块来处理数据套嵌，glom 模块允许我们使用 . 来访问内嵌对象的属性。

第一次使用我们需要安装 glom：

pip3 install glom

import pandas as pd
from glom import glom

df = pd.read_json('nested_deep.json')

data = df['students'].apply(lambda row: glom(row, 'grade.math'))
print(data)

以上实例输出结果为：

0    60
1    89
2    79
Name: students, dtype: int64

常用操作

一、生成数据表

1、首先导入pandas库，一般都会用到numpy库，所以我们先导入备用：

import numpy as np 
import pandas as pd

2、导入CSV或者xlsx文件：

df = pd.DataFrame(pd.read_csv(‘name.csv’,header=1)) 
df = pd.DataFrame(pd.read_excel(‘name.xlsx’))

3、用pandas创建数据表：

df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
 "date":pd.date_range('20130102', periods=6),
  "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
 "age":[23,44,54,32,34,32],
 "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
  "price":[1200,np.nan,2133,5433,np.nan,4432]},
  columns =['id','date','city','category','age','price'])

二、数据表信息查看

1、维度查看：

df.shape

2、数据表基本信息（维度、列名称、数据格式、所占空间等）：

df.info()

3、每一列数据的格式：

df.dtypes

4、某一列格式：

df[‘B’].dtype

5、空值：

df.isnull()

6、查看某一列空值：

df.isnull()

7、查看某一列的唯一值：

df[‘B’].unique()

8、查看数据表的值：

df.values

9、查看列名称：

df.columns

10、查看前10行数据、后10行数据：

df.head() #默认前10行数据 
df.tail() #默认后10 行数据

三、数据表清洗

1、用数字0填充空值：

df.fillna(value=0)

2、使用列prince的均值对NA进行填充：

df[‘prince’].fillna(df[‘prince’].mean())

3、清楚city字段的字符空格：

df[‘city’]=df[‘city’].map(str.strip)

4、大小写转换：

df[‘city’]=df[‘city’].str.lower()

5、更改数据格式：

df[‘price’].astype(‘int’)

6、更改列名称：

df.rename(columns={‘category’: ‘category-size’})

7、删除后出现的重复值：

df[‘city’].drop_duplicates()

8、删除先出现的重复值：

df[‘city’].drop_duplicates(keep=’last’)

9、数据替换：

df[‘city’].replace(‘sh’, ‘shanghai’)

四、数据预处理

df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008], 
"gender":['male','female','male','female','male','female','male','female'],
"pay":['Y','N','Y','Y','N','Y','N','Y',],
"m-point":[10,12,20,40,40,40,30,20]})

1、数据表合并

1.1 merge

df_inner=pd.merge(df,df1,how='inner')  # 匹配合并，交集
df_left=pd.merge(df,df1,how='left')        #
df_right=pd.merge(df,df1,how='right')
df_outer=pd.merge(df,df1,how='outer')  #并集

1.2 append

result = df1.append(df2)

1.3 join

result = left.join(right, on='key')

1.4 concat

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

objs︰一个序列或系列、综合或面板对象的映射。如果字典中传递，将作为键参数，使用排序的键，除非它传递，在这种情况下的值将会选择（见下文）。任何没有任何反对将默默地被丢弃，除非他们都没有在这种情况下将引发 ValueError。
axis: {0，1，…}，默认值为 0。要连接沿轴。
join: {‘内部’、 ‘外’}，默认 ‘外’。如何处理其他 axis(es) 上的索引。联盟内、外的交叉口。
ignore_index︰布尔值、默认 False。如果为 True，则不要串联轴上使用的索引值。由此产生的轴将标记 0，…，n-1。这是有用的如果你串联串联轴没有有意义的索引信息的对象。请注意在联接中仍然受到尊重的其他轴上的索引值。
join_axes︰索引对象的列表。具体的指标，用于其他 n-1 轴而不是执行内部/外部设置逻辑。
keys︰序列，默认为无。构建分层索引使用通过的键作为最外面的级别。如果多个级别获得通过，应包含元组。
levels︰列表的序列，默认为无。具体水平（唯一值）用于构建多重。否则，他们将推断钥匙。
names︰列表中，默认为无。由此产生的分层索引中的级的名称。
verify_integrity︰布尔值、默认 False。检查是否新的串联的轴包含重复项。这可以是相对于实际数据串联非常昂贵。
副本︰布尔值、默认 True。如果为 False，请不要，不必要地复制数据。

例子：

frames = [df1, df2, df3] 
result = pd.concat(frames)

2、设置索引列

df_inner.set_index(‘id’)

3、按照特定列的值排序：

df_inner.sort_values(by=[‘age’])

4、按照索引列排序：

df_inner.sort_index()

5、如果prince列的值>3000，group列显示high，否则显示low：

df_inner[‘group’] = np.where(df_inner[‘price’] > 3000,’high’,’low’)

6、对复合多个条件的数据进行分组标记

df_inner.loc[(df_inner[‘city’] == ‘beijing’) & (df_inner[‘price’] >= 4000), ‘sign’]=1

7、对category字段的值依次进行分列，并创建数据表，索引值为df_inner的索引列，列名称为category和size

pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category’]),index=df_inner.index,columns=[‘category’,’size’]))

8、将完成分裂后的数据表和原df_inner数据表进行匹配

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)

五、数据提取

主要用到的三个函数：loc,iloc和ix，loc函数按标签值进行提取，iloc按位置进行提取，ix可以同时按标签和位置进行提取。

1、按索引提取单行的数值

df_inner.loc[3]

2、按索引提取区域行数值

df_inner.iloc[0:5]

3、重设索引

df_inner.reset_index()

4、设置日期为索引

df_inner=df_inner.set_index(‘date’)

5、提取4日之前的所有数据

df_inner[:’2013-01-04’]

6、使用iloc按位置区域提取数据

df_inner.iloc[:3,:2] #冒号前后的数字不再是索引的标签名称，而是数据所在的位置，从0开始，前三行，前两列。

7、适应iloc按位置单独提起数据

df_inner.iloc[[0,2,5],[4,5]] #提取第0、2、5行，4、5列

8、使用ix按索引标签和位置混合提取数据

df_inner.ix[:’2013-01-03’,:4] #2013-01-03号之前，前四列数据

9、判断city列的值是否为北京

df_inner[‘city’].isin([‘beijing’])

10、判断city列里是否包含beijing和shanghai，然后将符合条件的数据提取出来

df_inner.loc[df_inner[‘city’].isin([‘beijing’,’shanghai’])]

11、提取前三个字符，并生成数据表

pd.DataFrame(category.str[:3])

六、数据筛选

使用与、或、非三个条件配合大于、小于、等于对数据进行筛选，并进行计数和求和。

1、使用“与”进行筛选

df_inner.loc[(df_inner[‘age’] > 25) & (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]]

2、使用“或”进行筛选

df_inner.loc[(df_inner[‘age’] > 25) | (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘age’])

3、使用“非”条件进行筛选

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’])

4、对筛选后的数据按city列进行计数

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’]).city.count()

5、使用query函数进行筛选

df_inner.query(‘city == [“beijing”, “shanghai”]’)

6、对筛选后的结果按prince进行求和

df_inner.query(‘city == [“beijing”, “shanghai”]’).price.sum()

七、数据汇总

主要函数是groupby和pivote_table

1、对所有的列进行计数汇总

df_inner.groupby(‘city’).count()

2、按城市对id字段进行计数

df_inner.groupby(‘city’)[‘id’].count()

3、对两个字段进行汇总计数

df_inner.groupby([‘city’,’size’])[‘id’].count()

4、对city字段进行汇总，并分别计算prince的合计和均值

df_inner.groupby(‘city’)[‘price’].agg([len,np.sum, np.mean])

八、数据统计

数据采样，计算标准差，协方差和相关系数

1、简单的数据采样

df_inner.sample(n=3)

2、手动设置采样权重

weights = [0, 0, 0, 0, 0.5, 0.5] 
df_inner.sample(n=2, weights=weights)

3、采样后不放回

df_inner.sample(n=6, replace=False)

4、采样后放回

df_inner.sample(n=6, replace=True)

5、数据表描述性统计

df_inner.describe().round(2).T #round函数设置显示小数位，T表示转置

6、计算列的标准差

df_inner[‘price’].std()

7、计算两个字段间的协方差

df_inner[‘price’].cov(df_inner[‘m-point’])

8、数据表中所有字段间的协方差

df_inner.cov()

9、两个字段的相关性分析

df_inner[‘price’].corr(df_inner[‘m-point’]) #相关系数在-1到1之间，接近1为正相关，接近-1为负相关，0为不相关

10、数据表的相关性分析

df_inner.corr()

九、数据输出

分析后的数据可以输出为xlsx格式和csv格式

1、写入Excel

df_inner.to_excel(‘excel_to_python.xlsx’, sheet_name=’bluewhale_cc’)

2、写入到CSV

df_inner.to_csv(‘excel_to_python.csv’)

100 个 Pandas 函数汇总

统计汇总函数

函数	含义
min()	计算最小值
max()	计算最大值
sum()	求和
mean()	计算平均值
count()	计数（统计非缺失元素的个数）
size()	计数（统计所有元素的个数）
median()	计算中位数
var()	计算方差
std()	计算标准差
quantile()	计算任意分位数
cov()	计算协方差
corr()	计算相关系数
skew()	计算偏度
kurt()	计算峰度
mode()	计算众数
describe()	描述性统计（一次性返回多个统计结果）
groupby()	分组
aggregate()	聚合运算（可以自定义统计函数）
argmin()	寻找最小值所在位置
argmax()	寻找最大值所在位置
any()	等价于逻辑“或”
all()	等价于逻辑“与”
value_counts()	频次统计
cumsum()	运算累计和
cumprod()	运算累计积
pct_change()	运算比率（后一个元素与前一个元素的比率）

数据清洗函数

函数	含义
duplicated()	判断序列元素是否重复
drop_duplicates()	删除重复值
hasnans()	判断序列是否存在缺失（返回TRUE或FALSE）
isnull()	判断序列元素是否为缺失（返回与序列长度一样的bool值）
notnull()	判断序列元素是否不为缺失（返回与序列长度一样的bool值）
dropna()	删除缺失值
fillna()	缺失值填充
ffill()	前向后填充缺失值（使用缺失值的前一个元素填充）
bfill()	后向填充缺失值（使用缺失值的后一个元素填充）
dtypes()	检查数据类型
astype()	类型强制转换
pd.to_datetime	转日期时间型
factorize()	因子化转换
sample()	抽样
where()	基于条件判断的值替换
replace()	按值替换（不可使用正则）
str.replace()	按值替换（可使用正则）
str.split.str()	字符分隔

数据筛选函数

函数	含义
isin()	成员关系判断
between()	区间判断
loc()	条件判断（可使用在数据框中）
iloc()	索引判断（可使用在数据框中）
compress()	条件判断
nlargest()	搜寻最大的n个元素
nsmallest()	搜寻最小的n个元素
str.findall()	子串查询（可使用正则）

绘图与元素级运算函数

函数	含义
hist()	绘制直方图
plot()	可基于kind参数绘制更多图形（饼图，折线图，箱线图等）
map()	元素映射
apply()	基于自定义函数的元素级操作

时间序列函数

函数	含义
dt.date()	抽取出日期值
dt.time()	抽取出时间（时分秒）
dt.year()	抽取出年
dt.mouth()	抽取出月
dt.day()	抽取出日
dt.hour()	抽取出时
dt.minute()	抽取出分钟
dt.second()	抽取出秒
dt.quarter()	抽取出季度
dt.weekday()	抽取出星期几（返回数值型）
dt.weekday_name()	抽取出星期几（返回字符型）
dt.week()	抽取出年中的第几周
dt.dayofyear()	抽取出年中的第几天
dt.daysinmonth()	抽取出月对应的最大天数
dt.is_month_start()	判断日期是否为当月的第一天
dt.is_month_end()	判断日期是否为当月的最后一天
dt.is_quarter_start()	判断日期是否为当季度的第一天
dt.is_quarter_end()	判断日期是否为当季度的最后一天
dt.is_year_start()	判断日期是否为当年的第一天
dt.is_year_end()	判断日期是否为当年的最后一天
dt.is_leap_year()	判断日期是否为闰年

其它函数

函数	含义
append()	序列元素的追加（需指定其他序列）
diff()	一阶差分
round()	元素的四舍五入
sort_values()	按值排序
sort_index()	按索引排序
to_dict()	转为字典
tolist()	转为列表
unique()	元素排重

90个Pandas案例

1如何使用列表和字典创建 Series

使用列表创建 Series

import pandas as pd

ser1 = pd.Series([1.5, 2.5, 3, 4.5, 5.0, 6])
print(ser1)

Output:

0    1.5
1    2.5
2    3.0
3    4.5
4    5.0
5    6.0
dtype: float64

使用 name 参数创建 Series

import pandas as pd

ser2 = pd.Series(["India", "Canada", "Germany"], name="Countries")
print(ser2)

Output:

0      India
1     Canada
2    Germany
Name: Countries, dtype: object

使用简写的列表创建 Series

import pandas as pd

ser3 = pd.Series(["A"]*4)
print(ser3)

Output:

0    A
1    A
2    A
3    A
dtype: object

使用字典创建 Series

import pandas as pd

ser4 = pd.Series({"India": "New Delhi",
                  "Japan": "Tokyo",
                  "UK": "London"})
print(ser4)

Output:

India    New Delhi
Japan        Tokyo
UK          London
dtype: object

2如何使用 Numpy 函数创建 Series

import pandas as pd
import numpy as np

ser1 = pd.Series(np.linspace(1, 10, 5))
print(ser1)

ser2 = pd.Series(np.random.normal(size=5))
print(ser2)

Output:

0     1.00
1     3.25
2     5.50
3     7.75
4    10.00
dtype: float64
0   -1.694452
1   -1.570006
2    1.713794
3    0.338292
4    0.803511
dtype: float64

3如何获取 Series 的索引和值

import pandas as pd
import numpy as np

ser1 = pd.Series({"India": "New Delhi",
                  "Japan": "Tokyo",
                  "UK": "London"})

print(ser1.values)
print(ser1.index)

print("\n")

ser2 = pd.Series(np.random.normal(size=5))
print(ser2.index)
print(ser2.values)

Output:

['New Delhi' 'Tokyo' 'London']
Index(['India', 'Japan', 'UK'], dtype='object')


RangeIndex(start=0, stop=5, step=1)
[ 0.66265478 -0.72222211  0.3608642   1.40955436  1.3096732 ]

4如何在创建 Series 时指定索引

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

ser1 = pd.Series(values, index=code)

print(ser1)

Output:

IND        India
CAN       Canada
AUS    Australia
JAP        Japan
GER      Germany
FRA       France
dtype: object

5如何获取 Series 的大小和形状

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

ser1 = pd.Series(values, index=code)

print(len(ser1))

print(ser1.shape)

print(ser1.size)

Output:

6
(6,)
6

6如何获取 Series 开始或末尾几行数据

Head()

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

ser1 = pd.Series(values, index=code)

print("-----Head()-----")
print(ser1.head())

print("\n\n-----Head(2)-----")
print(ser1.head(2))

Output:

-----Head()-----
IND        India
CAN       Canada
AUS    Australia
JAP        Japan
GER      Germany
dtype: object


-----Head(2)-----
IND     India
CAN    Canada
dtype: object

Tail()

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

ser1 = pd.Series(values, index=code)

print("-----Tail()-----")
print(ser1.tail())

print("\n\n-----Tail(2)-----")
print(ser1.tail(2))

Output:

-----Tail()-----
CAN       Canada
AUS    Australia
JAP        Japan
GER      Germany
FRA       France
dtype: object


-----Tail(2)-----
GER    Germany
FRA     France
dtype: object

Take()

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

ser1 = pd.Series(values, index=code)

print("-----Take()-----")
print(ser1.take([2, 4, 5]))

Output:

-----Take()-----
AUS    Australia
GER      Germany
FRA       France
dtype: object

7使用切片获取 Series 子集

import pandas as pd

num = [000, 100, 200, 300, 400, 500, 600, 700, 800, 900]

idx = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

series = pd.Series(num, index=idx)

print("\n [2:2] \n")
print(series[2:4])

print("\n [1:6:2] \n")
print(series[1:6:2])

print("\n [:6] \n")
print(series[:6])

print("\n [4:] \n")
print(series[4:])

print("\n [:4:2] \n")
print(series[:4:2])

print("\n [4::2] \n")
print(series[4::2])

print("\n [::-1] \n")
print(series[::-1])

Output

 [2:2]

C    200
D    300
dtype: int64

 [1:6:2]

B    100
D    300
F    500
dtype: int64

 [:6]

A      0
B    100
C    200
D    300
E    400
F    500
dtype: int64

 [4:]

E    400
F    500
G    600
H    700
I    800
J    900
dtype: int64

 [:4:2]

A      0
C    200
dtype: int64

 [4::2]

E    400
G    600
I    800
dtype: int64

 [::-1]

J    900
I    800
H    700
G    600
F    500
E    400
D    300
C    200
B    100
A      0
dtype: int64

8如何创建 DataFrame

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp00'],
    'Name': ['John Doe', 'William Spark'],
    'Occupation': ['Chemist', 'Statistician'],
    'Date Of Join': ['2018-01-25', '2018-01-26'],
    'Age': [23, 24]})

print(employees)

Output:

   Age Date Of Join EmpCode           Name    Occupation
0   23   2018-01-25  Emp001       John Doe       Chemist
1   24   2018-01-26   Emp00  William Spark  Statistician

9如何设置 DataFrame 的索引和列信息

import pandas as pd

employees = pd.DataFrame(
    data={'Name': ['John Doe', 'William Spark'],
          'Occupation': ['Chemist', 'Statistician'],
          'Date Of Join': ['2018-01-25', '2018-01-26'],
          'Age': [23, 24]},
    index=['Emp001', 'Emp002'],
    columns=['Name', 'Occupation', 'Date Of Join', 'Age'])

print(employees)

Output

                 Name    Occupation Date Of Join  Age
Emp001       John Doe       Chemist   2018-01-25   23
Emp002  William Spark  Statistician   2018-01-26   24

10如何重命名 DataFrame 的列名称

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp00'],
    'Name': ['John Doe', 'William Spark'],
    'Occupation': ['Chemist', 'Statistician'],
    'Date Of Join': ['2018-01-25', '2018-01-26'],
    'Age': [23, 24]})

employees.columns = ['EmpCode', 'EmpName', 'EmpOccupation', 'EmpDOJ', 'EmpAge']

print(employees)

Output:

   EmpCode     EmpName EmpOccupation         EmpDOJ        EmpAge
0       23  2018-01-25        Emp001       John Doe       Chemist
1       24  2018-01-26         Emp00  William Spark  Statistician

11如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print("\nUse == operator\n")
print(employees.loc[employees['Age'] == 23])

print("\nUse < operator\n")
print(employees.loc[employees['Age'] < 30])

print("\nUse != operator\n")
print(employees.loc[employees['Occupation'] != 'Statistician'])

print("\nMultiple Conditions\n")
print(employees.loc[(employees['Occupation'] != 'Statistician') &
                    (employees['Name'] == 'John')])

Output:

Use == operator

   Age Date Of Join EmpCode  Name Occupation
0   23   2018-01-25  Emp001  John    Chemist

Use < operator

   Age Date Of Join EmpCode   Name    Occupation
0   23   2018-01-25  Emp001   John       Chemist
1   24   2018-01-26  Emp002    Doe  Statistician
3   29   2018-02-26  Emp004  Spark  Statistician

Use != operator

   Age Date Of Join EmpCode  Name  Occupation
0   23   2018-01-25  Emp001  John     Chemist
4   40   2018-03-16  Emp005  Mark  Programmer

Multiple Conditions

   Age Date Of Join EmpCode  Name Occupation
0   23   2018-01-25  Emp001  John    Chemist

12在 DataFrame 中使用“isin”过滤多行

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print("\nUse isin operator\n")
print(employees.loc[employees['Occupation'].isin(['Chemist','Programmer'])])

print("\nMultiple Conditions\n")
print(employees.loc[(employees['Occupation'] == 'Chemist') |
                    (employees['Name'] == 'John') &
                    (employees['Age'] < 30)])

Output:

Use isin operator

   Age Date Of Join EmpCode  Name  Occupation
0   23   2018-01-25  Emp001  John     Chemist
4   40   2018-03-16  Emp005  Mark  Programmer

Multiple Conditions

   Age Date Of Join EmpCode  Name Occupation
0   23   2018-01-25  Emp001  John    Chemist

13迭代 DataFrame 的行和列

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print("\n Example iterrows \n")
for index, col in employees.iterrows():
    print(col['Name'], "--", col['Age'])


print("\n Example itertuples \n")
for row in employees.itertuples(index=True, name='Pandas'):
    print(getattr(row, "Name"), "--", getattr(row, "Age"))

Output:

 Example iterrows

John -- 23
Doe -- 24
William -- 34
Spark -- 29
Mark -- 40

 Example itertuples

John -- 23
Doe -- 24
William -- 34
Spark -- 29
Mark -- 40

14如何通过名称或索引删除 DataFrame 的列

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print(employees)

print("\n Drop Column by Name \n")
employees.drop('Age', axis=1, inplace=True)
print(employees)

print("\n Drop Column by Index \n")
employees.drop(employees.columns[[0,1]], axis=1, inplace=True)
print(employees)

Output:

   Age Date Of Join EmpCode     Name    Occupation
0   23   2018-01-25  Emp001     John       Chemist
1   24   2018-01-26  Emp002      Doe  Statistician
2   34   2018-01-26  Emp003  William  Statistician
3   29   2018-02-26  Emp004    Spark  Statistician
4   40   2018-03-16  Emp005     Mark    Programmer

 Drop Column by Name

  Date Of Join EmpCode     Name    Occupation
0   2018-01-25  Emp001     John       Chemist
1   2018-01-26  Emp002      Doe  Statistician
2   2018-01-26  Emp003  William  Statistician
3   2018-02-26  Emp004    Spark  Statistician
4   2018-03-16  Emp005     Mark    Programmer

 Drop Column by Index

      Name    Occupation
0     John       Chemist
1      Doe  Statistician
2  William  Statistician
3    Spark  Statistician
4     Mark    Programmer

15向 DataFrame 中新增列

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

employees['City'] = ['London', 'Tokyo', 'Sydney', 'London', 'Toronto']

print(employees)

Output:

   Age Date Of Join EmpCode     Name    Occupation     City
0   23   2018-01-25  Emp001     John       Chemist   London
1   24   2018-01-26  Emp002      Doe  Statistician    Tokyo
2   34   2018-01-26  Emp003  William  Statistician   Sydney
3   29   2018-02-26  Emp004    Spark  Statistician   London
4   40   2018-03-16  Emp005     Mark    Programmer  Toronto

16如何从 DataFrame 中获取列标题列表

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print(list(employees))

print(list(employees.columns.values))

print(employees.columns.tolist())

Output:

['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']
['Age', 'Date Of Join', 'EmpCode', 'Name', 'Occupation']

17如何随机生成 DataFrame

import pandas as pd
import numpy as np

np.random.seed(5)

df_random = pd.DataFrame(np.random.randint(100, size=(10, 6)),
                         columns=list('ABCDEF'),
                         index=['Row-{}'.format(i) for i in range(10)])

print(df_random)

Output:

        A   B   C   D   E   F
Row-0  99  78  61  16  73   8
Row-1  62  27  30  80   7  76
Row-2  15  53  80  27  44  77
Row-3  75  65  47  30  84  86
Row-4  18   9  41  62   1  82
Row-5  16  78   5  58   0  80
Row-6   4  36  51  27  31   2
Row-7  68  38  83  19  18   7
Row-8  30  62  11  67  65  55
Row-9   3  91  78  27  29  33

18如何选择 DataFrame 的多个列

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

df = employees[['EmpCode', 'Age', 'Name']]
print(df)

Output:

  EmpCode  Age     Name
0  Emp001   23     John
1  Emp002   24      Doe
2  Emp003   34  William
3  Emp004   29    Spark
4  Emp005   40     Mark

19如何将字典转换为 DataFrame

import pandas as pd

data = ({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   })
print(data)

df = pd.DataFrame(data)

print(df)

Output:

{'Height': [165, 70, 120, 80, 180, 172, 150], 'Food': ['Steak', 'Lamb', 'Mango',
 'Apple', 'Cheese', 'Melon', 'Beans'], 'Age': [30, 20, 22, 40, 32, 28, 39], 'Sco
re': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2], 'Color': ['Blue', 'Green', 'Red', 'Whi
te', 'Gray', 'Black', 'Red'], 'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX'
]}
   Age  Color    Food  Height  Score State
0   30   Blue   Steak     165    4.6    NY
1   20  Green    Lamb      70    8.3    TX
2   22    Red   Mango     120    9.0    FL
3   40  White   Apple      80    3.3    AL
4   32   Gray  Cheese     180    1.8    AK
5   28  Black   Melon     172    9.5    TX
6   39    Red   Beans     150    2.2    TX

20使用 ioc 进行切片

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n -- Selecting a single row with .loc with a string -- \n")
print(df.loc['Penelope'])

print("\n -- Selecting multiple rows with .loc with a list of strings -- \n")
print(df.loc[['Cornelia', 'Jane', 'Dean']])

print("\n -- Selecting multiple rows with .loc with slice notation -- \n")
print(df.loc['Aaron':'Dean'])

Output:

 -- Selecting a single row with .loc with a string --

Age          40
Color     White
Food      Apple
Height       80
Score       3.3
State        AL
Name: Penelope, dtype: object

 -- Selecting multiple rows with .loc with a list of strings --

          Age Color    Food  Height  Score State
Cornelia   39   Red   Beans     150    2.2    TX
Jane       30  Blue   Steak     165    4.6    NY
Dean       32  Gray  Cheese     180    1.8    AK

 -- Selecting multiple rows with .loc with slice notation --

          Age  Color    Food  Height  Score State
Aaron      22    Red   Mango     120    9.0    FL
Penelope   40  White   Apple      80    3.3    AL
Dean       32   Gray  Cheese     180    1.8    AK

21检查 DataFrame 中是否是空的

import pandas as pd

df = pd.DataFrame()

if df.empty:
    print('DataFrame is empty!')

Output:

DataFrame is empty!

22在创建 DataFrame 时指定索引和列名称

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

code = ["IND", "CAN", "AUS", "JAP", "GER", "FRA"]

df = pd.DataFrame(values, index=code, columns=['Country'])

print(df)

Output:

       Country
IND      India
CAN     Canada
AUS  Australia
JAP      Japan
GER    Germany
FRA     France

23使用 iloc 进行切片

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n -- Selecting a single row with .iloc with an integer -- \n")
print(df.iloc[4])

print("\n -- Selecting multiple rows with .iloc with a list of integers -- \n")
print(df.iloc[[2, -2]])

print("\n -- Selecting multiple rows with .iloc with slice notation -- \n")
print(df.iloc[:5:3])

Output:

 -- Selecting a single row with .iloc with an integer --

Age           32
Color       Gray
Food      Cheese
Height       180
Score        1.8
State         AK
Name: Dean, dtype: object

 -- Selecting multiple rows with .iloc with a list of integers --

           Age  Color   Food  Height  Score State
Aaron       22    Red  Mango     120    9.0    FL
Christina   28  Black  Melon     172    9.5    TX

 -- Selecting multiple rows with .iloc with slice notation --

          Age  Color   Food  Height  Score State
Jane       30   Blue  Steak     165    4.6    NY
Penelope   40  White  Apple      80    3.3    AL

24iloc 和 loc 的区别

loc 索引器还可以进行布尔选择，例如，如果我们想查找 Age 小于 30 的所有行并仅返回 Color 和 Height 列，我们可以执行以下操作。我们可以用 iloc 复制它，但我们不能将它传递给一个布尔系列，必须将布尔系列转换为 numpy 数组
loc 从索引中获取具有特定标签的行（或列）
iloc 在索引中的特定位置获取行（或列）（因此它只需要整数）

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n -- loc -- \n")
print(df.loc[df['Age'] < 30, ['Color', 'Height']])

print("\n -- iloc -- \n")
print(df.iloc[(df['Age'] < 30).values, [1, 3]])

Output:

 -- loc --

           Color  Height
Nick       Green      70
Aaron        Red     120
Christina  Black     172

 -- iloc --

           Color  Height
Nick       Green      70
Aaron        Red     120
Christina  Black     172

25使用时间索引创建空 DataFrame

import datetime
import pandas as pd

todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date, periods=10, freq='D')

columns = ['A', 'B', 'C']

df = pd.DataFrame(index=index, columns=columns)
df = df.fillna(0)

print(df)

Output:

            A  B  C
2018-09-30  0  0  0
2018-10-01  0  0  0
2018-10-02  0  0  0
2018-10-03  0  0  0
2018-10-04  0  0  0
2018-10-05  0  0  0
2018-10-06  0  0  0
2018-10-07  0  0  0
2018-10-08  0  0  0
2018-10-09  0  0  0

26如何改变 DataFrame 列的排序

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n -- Change order using columns -- \n")
new_order = [3, 2, 1, 4, 5, 0]
df = df[df.columns[new_order]]
print(df)

print("\n -- Change order using reindex -- \n")
df = df.reindex(['State', 'Color', 'Age', 'Food', 'Score', 'Height'], axis=1)
print(df)

Output:

 -- Change order using columns --

           Height    Food  Color  Score State  Age
Jane          165   Steak   Blue    4.6    NY   30
Nick           70    Lamb  Green    8.3    TX   20
Aaron         120   Mango    Red    9.0    FL   22
Penelope       80   Apple  White    3.3    AL   40
Dean          180  Cheese   Gray    1.8    AK   32
Christina     172   Melon  Black    9.5    TX   28
Cornelia      150   Beans    Red    2.2    TX   39

 -- Change order using reindex --

          State  Color  Age    Food  Score  Height
Jane         NY   Blue   30   Steak    4.6     165
Nick         TX  Green   20    Lamb    8.3      70
Aaron        FL    Red   22   Mango    9.0     120
Penelope     AL  White   40   Apple    3.3      80
Dean         AK   Gray   32  Cheese    1.8     180
Christina    TX  Black   28   Melon    9.5     172
Cornelia     TX    Red   39   Beans    2.2     150

27检查 DataFrame 列的数据类型

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print(df.dtypes)

Output:

Age         int64
Color      object
Food       object
Height      int64
Score     float64
State      object
dtype: object

28更改 DataFrame 指定列的数据类型

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 32, 28, 39],
                   'Color': ['Blue', 'Green', 'Red', 'White', 'Gray', 'Black',
                             'Red'],
                   'Food': ['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese',
                            'Melon', 'Beans'],
                   'Height': [165, 70, 120, 80, 180, 172, 150],
                   'Score': [4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print(df.dtypes)

df['Age'] = df['Age'].astype(str)

print(df.dtypes)

Output:

Age         int64
Color      object
Food       object
Height      int64
Score     float64
State      object
dtype: object
Age        object
Color      object
Food       object
Height      int64
Score     float64
State      object
dtype: object

29如何将列的数据类型转换为 DateTime 类型

import pandas as pd

df = pd.DataFrame({'DateOFBirth': [1349720105, 1349806505, 1349892905,
                                   1349979305, 1350065705, 1349792905,
                                   1349730105],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n----------------Before---------------\n")
print(df.dtypes)
print(df)

df['DateOFBirth'] = pd.to_datetime(df['DateOFBirth'], unit='s')

print("\n----------------After----------------\n")
print(df.dtypes)
print(df)

Output:

----------------Before---------------

DateOFBirth     int64
State          object
dtype: object
           DateOFBirth State
Jane        1349720105    NY
Nick        1349806505    TX
Aaron       1349892905    FL
Penelope    1349979305    AL
Dean        1350065705    AK
Christina   1349792905    TX
Cornelia    1349730105    TX

----------------After----------------

DateOFBirth    datetime64[ns]
State                  object
dtype: object
                  DateOFBirth State
Jane      2012-10-08 18:15:05    NY
Nick      2012-10-09 18:15:05    TX
Aaron     2012-10-10 18:15:05    FL
Penelope  2012-10-11 18:15:05    AL
Dean      2012-10-12 18:15:05    AK
Christina 2012-10-09 14:28:25    TX
Cornelia  2012-10-08 21:01:45    TX

30将 DataFrame 列从 floats 转为 ints

import pandas as pd

df = pd.DataFrame({'DailyExp': [75.7, 56.69, 55.69, 96.5, 84.9, 110.5,
                                58.9],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n----------------Before---------------\n")
print(df.dtypes)
print(df)

df['DailyExp'] = df['DailyExp'].astype(int)

print("\n----------------After----------------\n")
print(df.dtypes)
print(df)

Output:

----------------Before---------------

DailyExp    float64
State        object
dtype: object
           DailyExp State
Jane          75.70    NY
Nick          56.69    TX
Aaron         55.69    FL
Penelope      96.50    AL
Dean          84.90    AK
Christina    110.50    TX
Cornelia      58.90    TX

----------------After----------------

DailyExp     int32
State       object
dtype: object
           DailyExp State
Jane             75    NY
Nick             56    TX
Aaron            55    FL
Penelope         96    AL
Dean             84    AK
Christina       110    TX
Cornelia         58    TX

31如何把 dates 列转换为 DateTime 类型

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],                   
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print("\n----------------Before---------------\n")
print(df.dtypes)

df['DateOfBirth'] = df['DateOfBirth'].astype('datetime64')

print("\n----------------After----------------\n")
print(df.dtypes)

Output:

----------------Before---------------

DateOfBirth    object
State          object
dtype: object

----------------After----------------

DateOfBirth    datetime64[ns]
State                  object
dtype: object

32两个 DataFrame 相加

import pandas as pd

df1 = pd.DataFrame({'Age': [30, 20, 22, 40], 'Height': [165, 70, 120, 80],
                    'Score': [4.6, 8.3, 9.0, 3.3], 'State': ['NY', 'TX',
                                                             'FL', 'AL']},
                   index=['Jane', 'Nick', 'Aaron', 'Penelope'])

df2 = pd.DataFrame({'Age': [32, 28, 39], 'Color': ['Gray', 'Black', 'Red'],
                    'Food': ['Cheese', 'Melon', 'Beans'],
                    'Score': [1.8, 9.5, 2.2], 'State': ['AK', 'TX', 'TX']},
                   index=['Dean', 'Christina', 'Cornelia'])

df3 = df1.append(df2, sort=True)

print(df3)

Output:

           Age  Color    Food  Height  Score State
Jane        30    NaN     NaN   165.0    4.6    NY
Nick        20    NaN     NaN    70.0    8.3    TX
Aaron       22    NaN     NaN   120.0    9.0    FL
Penelope    40    NaN     NaN    80.0    3.3    AL
Dean        32   Gray  Cheese     NaN    1.8    AK
Christina   28  Black   Melon     NaN    9.5    TX
Cornelia    39    Red   Beans     NaN    2.2    TX

33在 DataFrame 末尾添加额外的行

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print("\n------------ BEFORE ----------------\n")
print(employees)

employees.loc[len(employees)] = [45, '2018-01-25', 'Emp006', 'Sunny',
                                 'Programmer']

print("\n------------ AFTER ----------------\n")
print(employees)

Output:

------------ BEFORE ----------------

   Age Date Of Join EmpCode     Name    Occupation
0   23   2018-01-25  Emp001     John       Chemist
1   24   2018-01-26  Emp002      Doe  Statistician
2   34   2018-01-26  Emp003  William  Statistician
3   29   2018-02-26  Emp004    Spark  Statistician
4   40   2018-03-16  Emp005     Mark    Programmer

------------ AFTER ----------------

   Age Date Of Join EmpCode     Name    Occupation
0   23   2018-01-25  Emp001     John       Chemist
1   24   2018-01-26  Emp002      Doe  Statistician
2   34   2018-01-26  Emp003  William  Statistician
3   29   2018-02-26  Emp004    Spark  Statistician
4   40   2018-03-16  Emp005     Mark    Programmer
5   45   2018-01-25  Emp006    Sunny    Programmer

34为指定索引添加新行

import pandas as pd

employees = pd.DataFrame(
    data={'Name': ['John Doe', 'William Spark'],
          'Occupation': ['Chemist', 'Statistician'],
          'Date Of Join': ['2018-01-25', '2018-01-26'],
          'Age': [23, 24]},
    index=['Emp001', 'Emp002'],
    columns=['Name', 'Occupation', 'Date Of Join', 'Age'])

print("\n------------ BEFORE ----------------\n")
print(employees)

employees.loc['Emp003'] = ['Sunny', 'Programmer', '2018-01-25', 45]

print("\n------------ AFTER ----------------\n")
print(employees)

Output:

------------ BEFORE ----------------

                 Name    Occupation Date Of Join  Age
Emp001       John Doe       Chemist   2018-01-25   23
Emp002  William Spark  Statistician   2018-01-26   24

------------ AFTER ----------------

                 Name    Occupation Date Of Join  Age
Emp001       John Doe       Chemist   2018-01-25   23
Emp002  William Spark  Statistician   2018-01-26   24
Emp003          Sunny    Programmer   2018-01-25   45

35如何使用 for 循环添加行

import pandas as pd

cols = ['Zip']
lst = []
zip = 32100

for a in range(10):
    lst.append([zip])
    zip = zip + 1

df = pd.DataFrame(lst, columns=cols)

print(df)

Output:

36在 DataFrame 顶部添加一行

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp002', 'Emp003', 'Emp004'],
    'Name': ['John', 'Doe', 'William'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26'],
    'Age': [23, 24, 34]})

print("\n------------ BEFORE ----------------\n")
print(employees)

# New line
line = pd.DataFrame({'Name': 'Dean', 'Age': 45, 'EmpCode': 'Emp001',
                     'Date Of Join': '2018-02-26', 'Occupation': 'Chemist'
                     }, index=[0])

# Concatenate two dataframe
employees = pd.concat([line,employees.ix[:]]).reset_index(drop=True)

print("\n------------ AFTER ----------------\n")
print(employees)

Output:

------------ BEFORE ----------------

   Age Date Of Join EmpCode     Name    Occupation
0   23   2018-01-25  Emp002     John       Chemist
1   24   2018-01-26  Emp003      Doe  Statistician
2   34   2018-01-26  Emp004  William  Statistician

------------ AFTER ----------------

   Age Date Of Join EmpCode     Name    Occupation
0   45   2018-02-26  Emp001     Dean       Chemist
1   23   2018-01-25  Emp002     John       Chemist
2   24   2018-01-26  Emp003      Doe  Statistician
3   34   2018-01-26  Emp004  William  Statistician

37如何向 DataFrame 中动态添加行

import pandas as pd

df = pd.DataFrame(columns=['Name', 'Age'])

df.loc[1, 'Name'] = 'Rocky'
df.loc[1, 'Age'] = 23

df.loc[2, 'Name'] = 'Sunny'

print(df)

Output:

    Name  Age
1  Rocky   23
2  Sunny  NaN

38在任意位置插入行

import pandas as pd

df = pd.DataFrame(columns=['Name', 'Age'])

df.loc[1, 'Name'] = 'Rocky'
df.loc[1, 'Age'] = 21

df.loc[2, 'Name'] = 'Sunny'
df.loc[2, 'Age'] = 22

df.loc[3, 'Name'] = 'Mark'
df.loc[3, 'Age'] = 25

df.loc[4, 'Name'] = 'Taylor'
df.loc[4, 'Age'] = 28

print("\n------------ BEFORE ----------------\n")
print(df)

line = pd.DataFrame({"Name": "Jack", "Age": 24}, index=[2.5])
df = df.append(line, ignore_index=False)
df = df.sort_index().reset_index(drop=True)

df = df.reindex(['Name', 'Age'], axis=1)
print("\n------------ AFTER ----------------\n")
print(df)

Output:

------------ BEFORE ----------------

     Name Age
1   Rocky  21
2   Sunny  22
3    Mark  25
4  Taylor  28

------------ AFTER ----------------

     Name Age
0   Rocky  21
1   Sunny  22
2    Jack  24
3    Mark  25
4  Taylor  28

39使用时间戳索引向 DataFrame 中添加行

import pandas as pd

df = pd.DataFrame(columns=['Name', 'Age'])

df.loc['2014-05-01 18:47:05', 'Name'] = 'Rocky'
df.loc['2014-05-01 18:47:05', 'Age'] = 21

df.loc['2014-05-02 18:47:05', 'Name'] = 'Sunny'
df.loc['2014-05-02 18:47:05', 'Age'] = 22

df.loc['2014-05-03 18:47:05', 'Name'] = 'Mark'
df.loc['2014-05-03 18:47:05', 'Age'] = 25

print("\n------------ BEFORE ----------------\n")
print(df)

line = pd.to_datetime("2014-05-01 18:50:05", format="%Y-%m-%d %H:%M:%S")
new_row = pd.DataFrame([['Bunny', 26]], columns=['Name', 'Age'], index=[line])
df = pd.concat([df, pd.DataFrame(new_row)], ignore_index=False)

print("\n------------ AFTER ----------------\n")
print(df)

Output:

------------ BEFORE ----------------

                      Name Age
2014-05-01 18:47:05  Rocky  21
2014-05-02 18:47:05  Sunny  22
2014-05-03 18:47:05   Mark  25

------------ AFTER ----------------

                      Name Age
2014-05-01 18:47:05  Rocky  21
2014-05-02 18:47:05  Sunny  22
2014-05-03 18:47:05   Mark  25
2014-05-01 18:50:05  Bunny  26

40为不同的行填充缺失值

import pandas as pd

a = {'A': 10, 'B': 20}
b = {'B': 30, 'C': 40, 'D': 50}

df1 = pd.DataFrame(a, index=[0])
df2 = pd.DataFrame(b, index=[1])

df = pd.DataFrame()
df = df.append(df1)
df = df.append(df2).fillna(0)

print(df)

Output:

      A   B     C     D
0  10.0  20   0.0   0.0
1   0.0  30  40.0  50.0

41append, concat 和 combine_first 示例

import pandas as pd

a = {'A': 10, 'B': 20}
b = {'B': 30, 'C': 40, 'D': 50}

df1 = pd.DataFrame(a, index=[0])
df2 = pd.DataFrame(b, index=[1])

d1 = pd.DataFrame()
d1 = d1.append(df1)
d1 = d1.append(df2).fillna(0)
print("\n------------ append ----------------\n")
print(d1)

d2 = pd.concat([df1, df2]).fillna(0)
print("\n------------ concat ----------------\n")
print(d2)

d3 = pd.DataFrame()
d3 = d3.combine_first(df1).combine_first(df2).fillna(0)
print("\n------------ combine_first ----------------\n")
print(d3)

Output:

------------ append ----------------

      A   B     C     D
0  10.0  20   0.0   0.0
1   0.0  30  40.0  50.0

------------ concat ----------------

      A   B     C     D
0  10.0  20   0.0   0.0
1   0.0  30  40.0  50.0

------------ combine_first ----------------

      A     B     C     D
0  10.0  20.0   0.0   0.0
1   0.0  30.0  40.0  50.0

42获取行和列的平均值

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

df['Mean Basket'] = df.mean(axis=1)
df.loc['Mean Fruit'] = df.mean()

print(df)

Output:

                Apple  Orange  Banana       Pear  Mean Basket
Basket1     10.000000    20.0    30.0  40.000000         25.0
Basket2      7.000000    14.0    21.0  28.000000         17.5
Basket3      5.000000     5.0     0.0   0.000000          2.5
Mean Fruit   7.333333    13.0    17.0  22.666667         15.0

43计算行和列的总和

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

df['Sum Basket'] = df.sum(axis=1)
df.loc['Sum Fruit'] = df.sum()

print(df)

Output:

           Apple  Orange  Banana  Pear  Sum Basket
Basket1       10      20      30    40         100
Basket2        7      14      21    28          70
Basket3        5       5       0     0          10
Sum Fruit     22      39      51    68         180

44连接两列

import pandas as pd

df = pd.DataFrame(columns=['Name', 'Age'])

df.loc[1, 'Name'] = 'Rocky'
df.loc[1, 'Age'] = 21

df.loc[2, 'Name'] = 'Sunny'
df.loc[2, 'Age'] = 22

df.loc[3, 'Name'] = 'Mark'
df.loc[3, 'Age'] = 25

df.loc[4, 'Name'] = 'Taylor'
df.loc[4, 'Age'] = 28

print('\n------------ BEFORE ----------------\n')
print(df)

df['Employee'] = df['Name'].map(str) + ' - ' + df['Age'].map(str)
df = df.reindex(['Employee'], axis=1)

print('\n------------ AFTER ----------------\n')
print(df)

Output:

------------ BEFORE ----------------

     Name Age
1   Rocky  21
2   Sunny  22
3    Mark  25
4  Taylor  28

------------ AFTER ----------------

      Employee
1   Rocky - 21
2   Sunny - 22
3    Mark - 25
4  Taylor - 28

45过滤包含某字符串的行

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])
print(df)

print("\n---- Filter with State contains TX ----\n")
df1 = df[df['State'].str.contains("TX")]

print(df1)

Output:

          DateOfBirth State
Jane       1986-11-11    NY
Nick       1999-05-12    TX
Aaron      1976-01-01    FL
Penelope   1986-06-01    AL
Dean       1983-06-04    AK
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

---- Filter with State contains TX ----

          DateOfBirth State
Nick       1999-05-12    TX
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

46过滤索引中包含某字符串的行

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])
print(df)
print("\n---- Filter Index contains ane ----\n")
df.index = df.index.astype('str')
df1 = df[df.index.str.contains('ane')]

print(df1)

Output:

          DateOfBirth State
Jane       1986-11-11    NY
Pane       1999-05-12    TX
Aaron      1976-01-01    FL
Penelope   1986-06-01    AL
Frane      1983-06-04    AK
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

---- Filter Index contains ane ----

      DateOfBirth State
Jane   1986-11-11    NY
Pane   1999-05-12    TX
Frane  1983-06-04    AK

47使用 AND 运算符过滤包含特定字符串值的行

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])
print(df)

print("\n---- Filter DataFrame using & ----\n")

df.index = df.index.astype('str')
df1 = df[df.index.str.contains('ane') & df['State'].str.contains("TX")]

print(df1)

Output:

          DateOfBirth State
Jane       1986-11-11    NY
Pane       1999-05-12    TX
Aaron      1976-01-01    FL
Penelope   1986-06-01    AL
Frane      1983-06-04    AK
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

---- Filter DataFrame using & ----

     DateOfBirth State
Pane  1999-05-12    TX

48查找包含某字符串的所有行

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])
print(df)

print("\n---- Filter DataFrame using & ----\n")

df.index = df.index.astype('str')
df1 = df[df.index.str.contains('ane') | df['State'].str.contains("TX")]

print(df1)

Output:

          DateOfBirth State
Jane       1986-11-11    NY
Pane       1999-05-12    TX
Aaron      1976-01-01    FL
Penelope   1986-06-01    AL
Frane      1983-06-04    AK
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

---- Filter DataFrame using & ----

          DateOfBirth State
Jane       1986-11-11    NY
Pane       1999-05-12    TX
Frane      1983-06-04    AK
Christina  1990-03-07    TX
Cornelia   1999-07-09    TX

49如果行中的值包含字符串，则创建与字符串相等的另一列

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Accountant', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

df['Department'] = pd.np.where(df.Occupation.str.contains("Chemist"), "Science",
                               pd.np.where(df.Occupation.str.contains("Statistician"), "Economics",
                               pd.np.where(df.Occupation.str.contains("Programmer"), "Computer", "General")))

print(df)

Output:

   Age Date Of Join EmpCode     Name    Occupation Department
0   23   2018-01-25  Emp001     John       Chemist    Science
1   24   2018-01-26  Emp002      Doe    Accountant    General
2   34   2018-01-26  Emp003  William  Statistician  Economics
3   29   2018-02-26  Emp004    Spark  Statistician  Economics
4   40   2018-03-16  Emp005     Mark    Programmer   Computer

50计算 pandas group 中每组的行数

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0],
                   [6, 6, 6, 6], [8, 8, 8, 8], [5, 5, 0, 0]],
                  columns=['Apple', 'Orange', 'Rice', 'Oil'],
                  index=['Basket1', 'Basket2', 'Basket3',
                         'Basket4', 'Basket5', 'Basket6'])

print(df)
print("\n ----------------------------- \n")
print(df[['Apple', 'Orange', 'Rice', 'Oil']].
      groupby(['Apple']).agg(['mean', 'count']))

Output:

         Apple  Orange  Rice  Oil
Basket1     10      20    30   40
Basket2      7      14    21   28
Basket3      5       5     0    0
Basket4      6       6     6    6
Basket5      8       8     8    8
Basket6      5       5     0    0

 -----------------------------

      Orange       Rice        Oil
        mean count mean count mean count
Apple
5          5     2    0     2    0     2
6          6     1    6     1    6     1
7         14     1   21     1   28     1
8          8     1    8     1    8     1
10        20     1   30     1   40     1

51检查字符串是否在 DataFrme 中

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])

if df['State'].str.contains('TX').any():
    print("TX is there")

Output:

TX is there

52从 DataFrame 列中获取唯一行值

import pandas as pd

df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print(df)
print("\n----------------\n")

print(df["State"].unique())

Output:

          State
Jane         NY
Nick         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX

----------------

['NY' 'TX' 'FL' 'AL' 'AK']

53计算 DataFrame 列的不同值

import pandas as pd

df = pd.DataFrame({'Age': [30, 20, 22, 40, 20, 30, 20, 25],
                    'Height': [165, 70, 120, 80, 162, 72, 124, 81],
                    'Score': [4.6, 8.3, 9.0, 3.3, 4, 8, 9, 3],
                    'State': ['NY', 'TX', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                   index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Jaane', 'Nicky', 'Armour', 'Ponting'])

print(df.Age.value_counts())

Output:

20    3
30    2
25    1
22    1
40    1
Name: Age, dtype: int64

54删除具有重复索引的行

import pandas as pd

df = pd.DataFrame({'Age': [30, 30, 22, 40, 20, 30, 20, 25],
                   'Height': [165, 165, 120, 80, 162, 72, 124, 81],
                   'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
                   'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])

print("\n -------- Duplicate Rows ----------- \n")
print(df)

df1 = df.reset_index().drop_duplicates(subset='index',
                                       keep='first').set_index('index')

print("\n ------- Unique Rows ------------ \n")
print(df1)

Output:

 -------- Duplicate Rows -----------

          Age  Height  Score State
Jane       30     165    4.6    NY
Jane       30     165    4.6    NY
Aaron      22     120    9.0    FL
Penelope   40      80    3.3    AL
Jaane      20     162    4.0    NY
Nicky      30      72    8.0    TX
Armour     20     124    9.0    FL
Ponting    25      81    3.0    AL

 ------- Unique Rows ------------

          Age  Height  Score State
index
Jane       30     165    4.6    NY
Aaron      22     120    9.0    FL
Penelope   40      80    3.3    AL
Jaane      20     162    4.0    NY
Nicky      30      72    8.0    TX
Armour     20     124    9.0    FL
Ponting    25      81    3.0    AL

55删除某些列具有重复值的行

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81],
                   'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
                   'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])

print("\n -------- Duplicate Rows ----------- \n")
print(df)

df1 = df.reset_index().drop_duplicates(subset=['Age','Height'],
                                       keep='first').set_index('index')

print("\n ------- Unique Rows ------------ \n")
print(df1)

Output:

 -------- Duplicate Rows -----------

          Age  Height  Score State
Jane       30     120    4.6    NY
Jane       40     162    4.6    NY
Aaron      30     120    9.0    FL
Penelope   40     120    3.3    AL
Jaane      30     120    4.0    NY
Nicky      30      72    8.0    TX
Armour     20     120    9.0    FL
Ponting    25      81    3.0    AL

 ------- Unique Rows ------------

          Age  Height  Score State
index
Jane       30     120    4.6    NY
Jane       40     162    4.6    NY
Penelope   40     120    3.3    AL
Nicky      30      72    8.0    TX
Armour     20     120    9.0    FL
Ponting    25      81    3.0    AL

56从 DataFrame 单元格中获取值

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81],
                   'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
                   'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])

print(df.loc['Nicky', 'Age'])

Output:

57使用 DataFrame 中的条件索引获取单元格上的标量值

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81],
                   'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
                   'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])

print("\nGet Height where Age is 20")
print(df.loc[df['Age'] == 20, 'Height'].values[0])

print("\nGet State where Age is 30")
print(df.loc[df['Age'] == 30, 'State'].values[0])

Output:

Get Height where Age is 20
120

Get State where Age is 30
NY

58设置 DataFrame 的特定单元格值

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81]},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])
print("\n--------------Before------------\n")
print(df)

df.iat[0, 0] = 90
df.iat[0, 1] = 91
df.iat[1, 1] = 92
df.iat[2, 1] = 93
df.iat[7, 1] = 99

print("\n--------------After------------\n")
print(df)

Output:

--------------Before------------

          Age  Height
Jane       30     120
Jane       40     162
Aaron      30     120
Penelope   40     120
Jaane      30     120
Nicky      30      72
Armour     20     120
Ponting    25      81

--------------After------------

          Age  Height
Jane       90      91
Jane       40      92
Aaron      30      93
Penelope   40     120
Jaane      30     120
Nicky      30      72
Armour     20     120
Ponting    25      99

59从 DataFrame 行获取单元格值

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81]},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])


print(df.loc[df.Age == 30,'Height'].tolist())

Output:

[120, 120, 120, 72]

60用字典替换 DataFrame 列中的值

import pandas as pd

df = pd.DataFrame({'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print(df)

dict = {"NY": 1, "TX": 2, "FL": 3, "AL": 4, "AK": 5}
df1 = df.replace({"State": dict})

print("\n\n")
print(df1)

Output:

          State
Jane         NY
Nick         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX



           State
Jane           1
Nick           2
Aaron          3
Penelope       4
Dean           5
Christina      2
Cornelia       2

61统计基于某一列的一列的数值

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],                   
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean',
                         'Christina', 'Cornelia'])

print(df.groupby('State').DateOfBirth.nunique())

Output:

State
AK    1
AL    1
FL    1
NY    1
TX    3
Name: DateOfBirth, dtype: int64

62处理 DataFrame 中的缺失值

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print("\n--------- DataFrame ---------\n")
print(df)

print("\n--------- Use of isnull() ---------\n")
print(df.isnull())

print("\n--------- Use of notnull() ---------\n")
print(df.notnull())

Output:

--------- DataFrame ---------

         Apple  Orange  Banana  Pear
Basket1     10    20.0    30.0  40.0
Basket2      7    14.0    21.0  28.0
Basket3      5     NaN     NaN   NaN

--------- Use of isnull() ---------

         Apple  Orange  Banana   Pear
Basket1  False   False   False  False
Basket2  False   False   False  False
Basket3  False    True    True   True

--------- Use of notnull() ---------

         Apple  Orange  Banana   Pear
Basket1   True    True    True   True
Basket2   True    True    True   True
Basket3   True   False   False  False

63删除包含任何缺失数据的行

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print("\n--------- DataFrame ---------\n")
print(df)

print("\n--------- Use of dropna() ---------\n")
print(df.dropna())

Output:

--------- DataFrame ---------

         Apple  Orange  Banana  Pear
Basket1     10    20.0    30.0  40.0
Basket2      7    14.0    21.0  28.0
Basket3      5     NaN     NaN   NaN

--------- Use of dropna() ---------

         Apple  Orange  Banana  Pear
Basket1     10    20.0    30.0  40.0
Basket2      7    14.0    21.0  28.0

64删除 DataFrame 中缺失数据的列

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5,]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print("\n--------- DataFrame ---------\n")
print(df)

print("\n--------- Drop Columns) ---------\n")
print(df.dropna(1))

Output:

--------- DataFrame ---------

         Apple  Orange  Banana  Pear
Basket1     10    20.0    30.0  40.0
Basket2      7    14.0    21.0  28.0
Basket3      5     NaN     NaN   NaN

--------- Drop Columns) ---------

         Apple
Basket1     10
Basket2      7
Basket3      5

65按降序对索引值进行排序

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])

print(df.sort_index(ascending=False))

Output:

          DateOfBirth State
Penelope   1986-06-01    AL
Pane       1999-05-12    TX
Jane       1986-11-11    NY
Frane      1983-06-04    AK
Cornelia   1999-07-09    TX
Christina  1990-03-07    TX
Aaron      1976-01-01    FL

66按降序对列进行排序

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})


print(employees.sort_index(axis=1, ascending=False))

Output:

     Occupation     Name EmpCode Date Of Join  Age
0       Chemist     John  Emp001   2018-01-25   23
1  Statistician      Doe  Emp002   2018-01-26   24
2  Statistician  William  Emp003   2018-01-26   34
3  Statistician    Spark  Emp004   2018-02-26   29
4    Programmer     Mark  Emp005   2018-03-16   40

67使用 rank 方法查找 DataFrame 中元素的排名

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [5, 5, 0, 0]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print("\n--------- DataFrame Values--------\n")
print(df)

print("\n--------- DataFrame Values by Rank--------\n")
print(df.rank())

Output:

--------- DataFrame Values--------

         Apple  Orange  Banana  Pear
Basket1     10      20      30    40
Basket2      7      14      21    28
Basket3      5       5       0     0

--------- DataFrame Values by Rank--------

         Apple  Orange  Banana  Pear
Basket1    3.0     3.0     3.0   3.0
Basket2    2.0     2.0     2.0   2.0
Basket3    1.0     1.0     1.0   1.0

68在多列上设置索引

import pandas as pd

employees = pd.DataFrame({
    'EmpCode': ['Emp001', 'Emp002', 'Emp003', 'Emp004', 'Emp005'],
    'Name': ['John', 'Doe', 'William', 'Spark', 'Mark'],
    'Occupation': ['Chemist', 'Statistician', 'Statistician',
                   'Statistician', 'Programmer'],
    'Date Of Join': ['2018-01-25', '2018-01-26', '2018-01-26', '2018-02-26',
                     '2018-03-16'],
    'Age': [23, 24, 34, 29, 40]})

print("\n --------- Before Index ----------- \n")
print(employees)

print("\n --------- Multiple Indexing ----------- \n")
print(employees.set_index(['Occupation', 'Age']))

Output:

                 Date Of Join EmpCode     Name
Occupation   Age
Chemist      23    2018-01-25  Emp001     John
Statistician 24    2018-01-26  Emp002      Doe
             34    2018-01-26  Emp003  William
             29    2018-02-26  Emp004    Spark
Programmer   40    2018-03-16  Emp005     Mark

69确定 DataFrame 的周期索引和列

import pandas as pd

values = ["India", "Canada", "Australia",
          "Japan", "Germany", "France"]

pidx = pd.period_range('2015-01-01', periods=6)

df = pd.DataFrame(values, index=pidx, columns=['Country'])

print(df)

Output:

              Country
2015-01-01      India
2015-01-02     Canada
2015-01-03  Australia
2015-01-04      Japan
2015-01-05    Germany
2015-01-06     France

70导入 CSV 指定特定索引

import pandas as pd

df = pd.read_csv('test.csv', index_col="DateTime")
print(df)

Output:

             Wheat    Rice     Oil
DateTime
10/10/2016  10.500  12.500  16.500
10/11/2016  11.250  12.750  17.150
10/12/2016  10.000  13.150  15.500
10/13/2016  12.000  14.500  16.100
10/14/2016  13.000  14.825  15.600
10/15/2016  13.075  15.465  15.315
10/16/2016  13.650  16.105  15.030
10/17/2016  14.225  16.745  14.745
10/18/2016  14.800  17.385  14.460
10/19/2016  15.375  18.025  14.175

71将 DataFrame 写入 csv

import pandas as pd

df = pd.DataFrame({'DateOfBirth': ['1986-11-11', '1999-05-12', '1976-01-01',
                                   '1986-06-01', '1983-06-04', '1990-03-07',
                                   '1999-07-09'],
                   'State': ['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']
                   },
                  index=['Jane', 'Pane', 'Aaron', 'Penelope', 'Frane',
                         'Christina', 'Cornelia'])

df.to_csv('test.csv', encoding='utf-8', index=True)

Output:

检查本地文件

72使用 Pandas 读取 csv 文件的特定列

import pandas as pd

df = pd.read_csv("test.csv", usecols = ['Wheat','Oil'])
print(df)

73Pandas 获取 CSV 列的列表

import pandas as pd

cols = list(pd.read_csv("test.csv", nrows =1))
print(cols)

Output:

['DateTime', 'Wheat', 'Rice', 'Oil']

74找到列值最大的行

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print(df.ix[df['Apple'].idxmax()])

Output:

Apple     55
Orange    15
Banana     8
Pear      12
Name: Basket3, dtype: int64

75使用查询方法进行复杂条件选择

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

print(df)

print("\n ----------- Filter data using query method ------------- \n")
df1 = df.ix[df.query('Apple > 50 & Orange <= 15 & Banana < 15 & Pear == 12').index]
print(df1)

Output:

         Apple  Orange  Banana  Pear
Basket1     10      20      30    40
Basket2      7      14      21    28
Basket3     55      15       8    12

 ----------- Filter data using query method -------------

         Apple  Orange  Banana  Pear
Basket3     55      15       8    12

76检查 Pandas 中是否存在列

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])

if 'Apple' in df.columns:
    print("Yes")
else:
    print("No")


if set(['Apple','Orange']).issubset(df.columns):
    print("Yes")
else:
    print("No")

77为特定列从 DataFrame 中查找 n-smallest 和 n-largest 值

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- nsmallest -----------\n")
print(df.nsmallest(2, ['Apple']))

print("\n----------- nlargest -----------\n")
print(df.nlargest(2, ['Apple']))

Output:

----------- nsmallest -----------

         Apple  Orange  Banana  Pear
Basket6      5       4       9     2
Basket2      7      14      21    28

----------- nlargest -----------

         Apple  Orange  Banana  Pear
Basket3     55      15       8    12
Basket4     15      14       1     8

78从 DataFrame 中查找所有列的最小值和最大值

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Minimum -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].min())

print("\n----------- Maximum -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].max())

Output:

----------- Minimum -----------

Apple     5
Orange    1
Banana    1
Pear      2
dtype: int64

----------- Maximum -----------

Apple     55
Orange    20
Banana    30
Pear      40
dtype: int64

79在 DataFrame 中找到最小值和最大值所在的索引位置

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Minimum -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmin())

print("\n----------- Maximum -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].idxmax())

Output:

----------- Minimum -----------

Apple     Basket6
Orange    Basket5
Banana    Basket4
Pear      Basket6
dtype: object

----------- Maximum -----------

Apple     Basket3
Orange    Basket1
Banana    Basket1
Pear      Basket1
dtype: object

80计算 DataFrame Columns 的累积乘积和累积总和

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Cumulative Product -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumprod())

print("\n----------- Cumulative Sum -----------\n")
print(df[['Apple', 'Orange', 'Banana', 'Pear']].cumsum())

Output:

----------- Cumulative Product -----------

           Apple  Orange  Banana     Pear
Basket1       10      20      30       40
Basket2       70     280     630     1120
Basket3     3850    4200    5040    13440
Basket4    57750   58800    5040   107520
Basket5   404250   58800    5040   860160
Basket6  2021250  235200   45360  1720320

----------- Cumulative Sum -----------

         Apple  Orange  Banana  Pear
Basket1     10      20      30    40
Basket2     17      34      51    68
Basket3     72      49      59    80
Basket4     87      63      60    88
Basket5     94      64      61    96
Basket6     99      68      70    98

81汇总统计

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Describe DataFrame -----------\n")
print(df.describe())

print("\n----------- Describe Column -----------\n")
print(df[['Apple']].describe())

Output:

----------- Describe DataFrame -----------

           Apple     Orange     Banana       Pear
count   6.000000   6.000000   6.000000   6.000000
mean   16.500000  11.333333  11.666667  16.333333
std    19.180719   7.257180  11.587349  14.555640
min     5.000000   1.000000   1.000000   2.000000
25%     7.000000   6.500000   2.750000   8.000000
50%     8.500000  14.000000   8.500000  10.000000
75%    13.750000  14.750000  18.000000  24.000000
max    55.000000  20.000000  30.000000  40.000000

----------- Describe Column -----------

           Apple
count   6.000000
mean   16.500000
std    19.180719
min     5.000000
25%     7.000000
50%     8.500000
75%    13.750000
max    55.000000

82查找 DataFrame 的均值、中值和众数

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Calculate Mean -----------\n")
print(df.mean())

print("\n----------- Calculate Median -----------\n")
print(df.median())

print("\n----------- Calculate Mode -----------\n")
print(df.mode())

Output:

----------- Calculate Mean -----------

Apple     16.500000
Orange    11.333333
Banana    11.666667
Pear      16.333333
dtype: float64

----------- Calculate Median -----------

Apple      8.5
Orange    14.0
Banana     8.5
Pear      10.0
dtype: float64

----------- Calculate Mode -----------

   Apple  Orange  Banana  Pear
0      7      14       1     8

83测量 DataFrame 列的方差和标准偏差

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Calculate Mean -----------\n")
print(df.mean())

print("\n----------- Calculate Median -----------\n")
print(df.median())

print("\n----------- Calculate Mode -----------\n")
print(df.mode())

Output:

----------- Measure Variance -----------

Apple     367.900000
Orange     52.666667
Banana    134.266667
Pear      211.866667
dtype: float64

----------- Standard Deviation -----------

Apple     19.180719
Orange     7.257180
Banana    11.587349
Pear      14.555640
dtype: float64

84计算 DataFrame 列之间的协方差

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----------- Calculating Covariance -----------\n")
print(df.cov())

print("\n----------- Between 2 columns -----------\n")
# Covariance of Apple vs Orange
print(df.Apple.cov(df.Orange))

Output:

----------- Calculating Covariance -----------

        Apple     Orange      Banana        Pear
Apple   367.9  47.600000  -40.200000  -35.000000
Orange   47.6  52.666667   54.333333   77.866667
Banana  -40.2  54.333333  134.266667  154.933333
Pear    -35.0  77.866667  154.933333  211.866667

----------- Between 2 columns -----------

47.60000000000001

85计算 Pandas 中两个 DataFrame 对象之间的相关性

import pandas as pd

df1 = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n------ Calculating Correlation of one DataFrame Columns -----\n")
print(df1.corr())

df2 = pd.DataFrame([[52, 54, 58, 41], [14, 24, 51, 78], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 17, 18, 98], [15, 34, 29, 52]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n----- Calculating correlation between two DataFrame -------\n")
print(df2.corrwith(other=df1))

Output:

------ Calculating Correlation of one DataFrame Columns -----

           Apple    Orange    Banana      Pear
Apple   1.000000  0.341959 -0.180874 -0.125364
Orange  0.341959  1.000000  0.646122  0.737144
Banana -0.180874  0.646122  1.000000  0.918606
Pear   -0.125364  0.737144  0.918606  1.000000

----- Calculating correlation between two DataFrame -------

Apple     0.678775
Orange    0.354993
Banana    0.920872
Pear      0.076919
dtype: float64

86计算 DataFrame 列的每个单元格的百分比变化

import pandas as pd

df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
                   [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n------ Percent change at each cell of a Column -----\n")
print(df[['Apple']].pct_change()[:3])

print("\n------ Percent change at each cell of a DataFrame -----\n")
print(df.pct_change()[:5])

Output:

------ Percent change at each cell of a Column -----

            Apple
Basket1       NaN
Basket2 -0.300000
Basket3  6.857143

------ Percent change at each cell of a DataFrame -----

            Apple    Orange    Banana      Pear
Basket1       NaN       NaN       NaN       NaN
Basket2 -0.300000 -0.300000 -0.300000 -0.300000
Basket3  6.857143  0.071429 -0.619048 -0.571429
Basket4 -0.727273 -0.066667 -0.875000 -0.333333
Basket5 -0.533333 -0.928571  0.000000  0.000000

87在 Pandas 中向前和向后填充 DataFrame 列的缺失值

import pandas as pd

df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
                   [15, 14, 1, 8], [7, 8], [5, 4, 1]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n------ DataFrame with NaN -----\n")
print(df)

print("\n------ DataFrame with Forward Filling -----\n")
print(df.ffill())

print("\n------ DataFrame with Forward Filling -----\n")
print(df.bfill())

Output:

------ DataFrame with NaN -----

         Apple  Orange  Banana  Pear
Basket1   10.0    30.0    40.0   NaN
Basket2    NaN     NaN     NaN   NaN
Basket3   15.0     8.0    12.0   NaN
Basket4   15.0    14.0     1.0   8.0
Basket5    7.0     8.0     NaN   NaN
Basket6    5.0     4.0     1.0   NaN

------ DataFrame with Forward Filling -----

         Apple  Orange  Banana  Pear
Basket1   10.0    30.0    40.0   NaN
Basket2   10.0    30.0    40.0   NaN
Basket3   15.0     8.0    12.0   NaN
Basket4   15.0    14.0     1.0   8.0
Basket5    7.0     8.0     1.0   8.0
Basket6    5.0     4.0     1.0   8.0

------ DataFrame with Forward Filling -----

         Apple  Orange  Banana  Pear
Basket1   10.0    30.0    40.0   8.0
Basket2   15.0     8.0    12.0   8.0
Basket3   15.0     8.0    12.0   8.0
Basket4   15.0    14.0     1.0   8.0
Basket5    7.0     8.0     1.0   NaN
Basket6    5.0     4.0     1.0   NaN

88在 Pandas 中使用非分层索引使用 Stacking

import pandas as pd

df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
                   [15, 14, 1, 8], [7, 8], [5, 4, 1]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n------ DataFrame-----\n")
print(df)

print("\n------ Stacking DataFrame -----\n")
print(df.stack(level=-1))

Output:

------ DataFrame-----

         Apple  Orange  Banana  Pear
Basket1   10.0    30.0    40.0   NaN
Basket2    NaN     NaN     NaN   NaN
Basket3   15.0     8.0    12.0   NaN
Basket4   15.0    14.0     1.0   8.0
Basket5    7.0     8.0     NaN   NaN
Basket6    5.0     4.0     1.0   NaN

------ Stacking DataFrame -----

Basket1  Apple     10.0
         Orange    30.0
         Banana    40.0
Basket3  Apple     15.0
         Orange     8.0
         Banana    12.0
Basket4  Apple     15.0
         Orange    14.0
         Banana     1.0
         Pear       8.0
Basket5  Apple      7.0
         Orange     8.0
Basket6  Apple      5.0
         Orange     4.0
         Banana     1.0
dtype: float64

89使用分层索引对 Pandas 进行拆分

import pandas as pd

df = pd.DataFrame([[10, 30, 40], [], [15, 8, 12],
                   [15, 14, 1, 8], [7, 8], [5, 4, 1]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
                         'Basket5', 'Basket6'])

print("\n------ DataFrame-----\n")
print(df)

print("\n------ Unstacking DataFrame -----\n")
print(df.unstack(level=-1))

Output:

------ DataFrame-----

         Apple  Orange  Banana  Pear
Basket1   10.0    30.0    40.0   NaN
Basket2    NaN     NaN     NaN   NaN
Basket3   15.0     8.0    12.0   NaN
Basket4   15.0    14.0     1.0   8.0
Basket5    7.0     8.0     NaN   NaN
Basket6    5.0     4.0     1.0   NaN

------ Unstacking DataFrame -----

Apple   Basket1    10.0
        Basket2     NaN
        Basket3    15.0
        Basket4    15.0
        Basket5     7.0
        Basket6     5.0
Orange  Basket1    30.0
        Basket2     NaN
        Basket3     8.0
        Basket4    14.0
        Basket5     8.0
        Basket6     4.0
Banana  Basket1    40.0
        Basket2     NaN
        Basket3    12.0
        Basket4     1.0
        Basket5     NaN
        Basket6     1.0
Pear    Basket1     NaN
        Basket2     NaN
        Basket3     NaN
        Basket4     8.0
        Basket5     NaN
        Basket6     NaN
dtype: float64

90Pandas 获取 HTML 页面上 table 数据

import pandas as pd
df pd.read_html("url")

杰克成

https://jackhcc.github.io/posts/blog-python05.html

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源杰克成 !

Pandas

Python-GUI图像化开发PyQt5

PyQt5库相关使用指南

2020-03-06 Python

PyQt

Python-Numpy详解

Numpy库相关计算与使用指南

2020-03-04 Python

Numpy

Python-Pandas详解

Pandas 教程

Pandas 应用

数据结构

Pandas 安装

实例 - 查看 pandas 版本

实例 - 查看 pandas 版本

实例

Pandas 数据结构 - Series

Pandas 数据结构 - DataFrame

Pandas CSV 文件

数据处理

head()

tail()

info()

Pandas JSON

内嵌的 JSON 数据

读取内嵌数据中的一组数据

常用操作

一、生成数据表

二、数据表信息查看

三、数据表清洗

四、数据预处理

1、数据表合并

2、设置索引列

3、按照特定列的值排序：

4、按照索引列排序：

5、如果prince列的值>3000，group列显示high，否则显示low：

6、对复合多个条件的数据进行分组标记

7、对category字段的值依次进行分列，并创建数据表，索引值为df_inner的索引列，列名称为category和size

8、将完成分裂后的数据表和原df_inner数据表进行匹配

五、数据提取

六、数据筛选

七、数据汇总

八、数据统计

九、数据输出

100 个 Pandas 函数汇总

统计汇总函数

数据清洗函数

数据筛选函数

绘图与元素级运算函数

时间序列函数

其它函数

90个Pandas案例

1如何使用列表和字典创建 Series

使用列表创建 Series

使用 name 参数创建 Series

使用简写的列表创建 Series

使用字典创建 Series

2如何使用 Numpy 函数创建 Series

3如何获取 Series 的索引和值

4如何在创建 Series 时指定索引

5如何获取 Series 的大小和形状

6如何获取 Series 开始或末尾几行数据

Head()

Tail()

Take()

7使用切片获取 Series 子集

8如何创建 DataFrame

9如何设置 DataFrame 的索引和列信息

10如何重命名 DataFrame 的列名称

11如何根据 Pandas 列中的值从 DataFrame 中选择或过滤行

12在 DataFrame 中使用“isin”过滤多行

13迭代 DataFrame 的行和列

14如何通过名称或索引删除 DataFrame 的列

15向 DataFrame 中新增列

16如何从 DataFrame 中获取列标题列表

17如何随机生成 DataFrame

18如何选择 DataFrame 的多个列

19如何将字典转换为 DataFrame

20使用 ioc 进行切片

21检查 DataFrame 中是否是空的

22在创建 DataFrame 时指定索引和列名称

23使用 iloc 进行切片

24iloc 和 loc 的区别

25使用时间索引创建空 DataFrame

26如何改变 DataFrame 列的排序

27检查 DataFrame 列的数据类型

28更改 DataFrame 指定列的数据类型

29如何将列的数据类型转换为 DateTime 类型