解析日期
这里指的是将字符串或者object类型的日期,转换成panda或python的日期类型。
主要的是dtype的变化:object / str —> datetime64[ns]
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
# read in our data
landslides = pd.read_csv("../input/landslide-events/catalog.csv")
# set seed for reproducibility
np.random.seed(0)
# 直接转变,前提是这一列中的格式已经统一,都是“%m/%d/%y”的形式,如果出现一个2001-01-01那就会失败
# 因此在进行转变前,要先确保格式的统一
# create a new column, date_parsed, with the parsed dates
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format="%m/%d/%y")
landslides['date_parsed'].head()
0 2007-03-02
1 2007-03-22
2 2007-04-06
3 2007-04-14
4 2007-04-15
Name: date_parsed, dtype: datetime64[ns]
从日期中获取日
# get the day of the month from the date_parsed column
day_of_month_landslides = landslides['date_parsed'].dt.day
day_of_month_landslides.head()
0 2.0
1 22.0
2 6.0
3 14.0
4 15.0
Name: date_parsed, dtype: float64
绘图查看
# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()
# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)
Character Encodings
Avoid UnicoodeDecodeErrors when loading CSV files.
# decode解码变成str类型,str用encode,选择编码类型(utf-8等)变成bytes类型
# start with a string
before = "This is the euro symbol: €"
# check to see what datatype it is
type(before)
# 默认utf-8编码
# police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv")
# 判断编码
with open("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", 'rb') as rowdata:
result = charset_normalizer.detect(rowdata.read(10000))
print(result)
police_killings = pd.read_csv("../input/fatal-police-shootings-in-the-us/PoliceKillingsUS.csv", encoding='Windows-1252')
# Saving your files with UTF-8 encoding
police_killings.to_csv("my_file_utf8.csv")