Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化，python数据处理

文章由LinuxBoy分享于2019-03-27 08:03:04热评（311）

Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化，python数据处理

http://blog.csdn.net/pipisorry/article/details/44833603

在[1]:

%matplotlib inline

抓取的数据

一个简单的HTTP请求

在[2]:

import requests

print requests.get("http://example.com").text

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

与api交流

在[3]:

response = requests.get("https://www.googleapis.com/books/v1/volumes", params={"q":"machine learning"})
raw_data = response.json()
titles = [item['volumeInfo']['title'] for item in raw_data['items']]
titles

[3]:

[u'C4.5',
 u'Machine Learning',
 u'Machine Learning',
 u'Machine Learning',
 u'A First Course in Machine Learning',
 u'Machine Learning',
 u'Elements of Machine Learning',
 u'Introduction to Machine Learning',
 u'Pattern Recognition and Machine Learning',
 u'Machine Learning and Its Applications']

在[4]:

import lxml.html

page = lxml.html.parse("http://www.blocket.se/stockholm?q=apple")
# ^ This is probably illegal. Blocket, please don't sue me!
items_data = []
for el in page.getroot().find_class("item_row"):
    links = el.find_class("item_link")
    images = el.find_class("item_image")
    prices = el.find_class("list_price")
    if links and images and prices and prices[0].text:
        items_data.append({"name": links[0].text,
                           "image": images[0].attrib['src'],
                           "price": int(prices[0].text.split(":")[0].replace(" ", ""))})
items_data

[4]:

[{'image': 'http://cdn.blocket.com/static/2/lithumbs/98/9864322297.jpg',
  'name': 'Macbook laddare 60w',
  'price': 250},
 {'image': 'http://cdn.blocket.com/static/2/lithumbs/43/4338840758.jpg',
  'name': u'Apple iPhone 5S 16GB - Ol\xe5st - 12 m\xe5n garanti',
  'price': 3999},
 {'image': 'http://cdn.blocket.com/static/0/lithumbs/98/9838946223.jpg',
  'name': u'Ol\xe5st iPhone 5 64 GB med n\xe4stan nytt batteri',
  'price': 3000},
 {'image': 'http://cdn.blocket.com/static/1/lithumbs/79/7906971367.jpg',
  'name': u'Apple iPhone 5C 16GB - Ol\xe5st - 12 m\xe5n garanti',
  'price': 3099},
 {'image': 'http://cdn.blocket.com/static/0/lithumbs/79/7926951568.jpg',
  'name': u'HP Z620 Workstation - 1 \xe5rs garanti',
  'price': 12494},
 {'image': 'http://cdn.blocket.com/static/0/lithumbs/97/9798755036.jpg',
  'name': 'HP ProBook 6450b - Andrasortering',
  'price': 1699},
 {'image': 'http://cdn.blocket.com/static/1/lithumbs/98/9898462036.jpg',
  'name': 'Macbook pro 13 retina, 256 gb ssd',
  'price': 12000}]

阅读本地数据

在[5]:

import pandas

df = pandas.read_csv('sample.csv')

在[6]:

# Display the DataFrame
df

[6]:

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	南	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	南	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	南

在[7]:

# DataFrame's columns
df.columns

[7]:

Index([u'Year', u'Make', u'Model', u'Description', u'Price'], dtype='object')

在[8]:

# Values of a given column
df.Model

[8]。

0                                      E350
1                Venture "Extended Edition"
2    Venture "Extended Edition, Very Large"
3                            Grand Cherokee
Name: Model, dtype: object

分析了dataframe

在[9]:

# Any missing values?
df['Price']

[9]:

0    3000
1    4900
2    5000
3     NaN
Name: Price, dtype: float64

在[10]:

df['Description']

[10]。

0                         ac, abs, moon
1                                   NaN
2                                   NaN
3    MUST SELL!\nair, moon roof, loaded
Name: Description, dtype: object

在[11]:

# Fill missing prices by a linear interpolation
df['Description'] = df['Description'].fillna("No description is available.")
df['Price'] = df['Price'].interpolate()

df

[11]。

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	没有可用的描述。	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	没有可用的描述。	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	5000年

探索数据

在[12]:

import matplotlib.pyplot as plt

df = pandas.read_csv('sample2.csv')

df

[12]。

	办公室	一年	销售
0	斯德哥尔摩	2004年	200年
1	斯德哥尔摩	2005年	250年
2	斯德哥尔摩	2006年	255年
3	斯德哥尔摩	2007年	260年
4	斯德哥尔摩	2008年	264年
5	斯德哥尔摩	2009年	274年
6	斯德哥尔摩	2010年	330年
7	斯德哥尔摩	2011年	364年
8	纽约	2004年	432年
9	纽约	2005年	469年
10	纽约	2006年	480年
11	纽约	2007年	438年
12	纽约	2008年	330年
13	纽约	2009年	280年
14	纽约	2010年	299年
15	纽约	2011年	230年

在[13]:

# This table has 3 columns: Office, Year, Sales
print df.columns

# It's really easy to query data with Pandas:
print df[(df['Office'] == 'Stockholm') & (df['Sales'] > 260)]

# It's also easy to do aggregations...
aggregated_sales = df.groupby('Year').sum()
print aggregated_sales

Index([u'Office', u'Year', u'Sales'], dtype='object')
      Office  Year  Sales
4  Stockholm  2008    264
5  Stockholm  2009    274
6  Stockholm  2010    330
7  Stockholm  2011    364
      Sales
Year       
2004    632
2005    719
2006    735
2007    698
2008    594
2009    554
2010    629
2011    594

在[14]:

# ... and generate plots
%matplotlib inline
aggregated_sales.plot(kind='bar')

[14]。

<matplotlib.axes._subplots.AxesSubplot at 0x1089dcc10>

机器学习

特征提取

在[15]:

from sklearn import feature_extraction

从文本中提取特征

在[16]:

corpus = ['All the cats really are great.',
          'I like the cats but I still prefer the dogs.',
          'Dogs are the best.',
          'I like all the trains',
          ]

tfidf = feature_extraction.text.TfidfVectorizer()

print tfidf.fit_transform(corpus).toarray()
print tfidf.get_feature_names()

[[ 0.38761905  0.38761905  0.          0.          0.38761905  0.
   0.49164562  0.          0.          0.49164562  0.          0.25656108
   0.        ]
 [ 0.          0.          0.          0.4098205   0.32310719  0.32310719
   0.          0.32310719  0.4098205   0.          0.4098205   0.42772268
   0.        ]
 [ 0.          0.4970962   0.6305035   0.          0.          0.4970962
   0.          0.          0.          0.          0.          0.32902288
   0.        ]
 [ 0.4970962   0.          0.          0.          0.          0.          0.
   0.4970962   0.          0.          0.          0.32902288  0.6305035 ]]
[u'all', u'are', u'best', u'but', u'cats', u'dogs', u'great', u'like', u'prefer', u'really', u'still', u'the', u'trains']

Dict vectorizer

在[17]:

import json


data = [json.loads("""{"weight": 194.0, "sex": "female", "student": true}"""),
        {"weight": 60., "sex": 'female', "student": True},
        {"weight": 80.1, "sex": 'male', "student": False},
        {"weight": 65.3, "sex": 'male', "student": True},
        {"weight": 58.5, "sex": 'female', "student": False}]

vectorizer = feature_extraction.DictVectorizer(sparse=False)

vectors = vectorizer.fit_transform(data)
print vectors
print vectorizer.get_feature_names()

[[   1.     0.     1.   194. ]
 [   1.     0.     1.    60. ]
 [   0.     1.     0.    80.1]
 [   0.     1.     1.    65.3]
 [   1.     0.     0.    58.5]]
[u'sex=female', 'sex=male', u'student', u'weight']

在[18]:

class A:
    def __init__(self, x):
        self.x = x
        self.blabla = 'test'
        
a = A(20)
a.__dict__

出[18]:

{'blabla': 'test', 'x': 20}

预处理

扩展

在[19]:

from sklearn import preprocessing

data = [[10., 2345., 0., 2.],
        [3., -3490., 0.1, 1.99],
        [13., 3903., -0.2, 2.11]]

print preprocessing.normalize(data)

[[  4.26435200e-03   9.99990544e-01   0.00000000e+00   8.52870400e-04]
 [  8.59598396e-04  -9.99999468e-01   2.86532799e-05   5.70200269e-04]
 [  3.33075223e-03   9.99994306e-01  -5.12423421e-05   5.40606709e-04]]

降维

在[20]:

from sklearn import decomposition

data = [[0.3, 0.2, 0.4,  0.32],
        [0.3, 0.5, 1.0, 0.19],
        [0.3, -0.4, -0.8, 0.22]]

pca = decomposition.PCA()
print pca.fit_transform(data)
print pca.explained_variance_ratio_

[[ -2.23442295e-01  -7.71447891e-02   8.06250485e-17]
 [ -8.94539226e-01   5.14200202e-02   8.06250485e-17]
 [  1.11798152e+00   2.57247689e-02   8.06250485e-17]]
[  9.95611223e-01   4.38877684e-03   9.24548594e-33]

机器学习模型

分类(支持向量机)

在[21]:

from sklearn import datasets
from sklearn import svm

在[22]:

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

# Training the model
clf = svm.SVC(kernel='rbf')
clf.fit(X, y)

# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02]]
print clf.predict(new_data)

[0 1]

回归(线性回归)

在[23]:

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

def f(x):
    return x + np.random.random() * 3.

X = np.arange(0, 5, 0.5)
X = X.reshape((len(X), 1))
y = map(f, X)

clf = linear_model.LinearRegression()
clf.fit(X, y)

(23):

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

在[24]:

new_X = np.arange(0.2, 5.2, 0.3)
new_X = new_X.reshape((len(new_X), 1))
new_y = clf.predict(new_X)

plt.scatter(X, y, color='g', label='Training data')

plt.plot(new_X, new_y, '.-', label='Predicted')
plt.legend()

(24):

<matplotlib.legend.Legend at 0x10a38f290>

集群(DBScan)

在[25]:

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=200, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

在[26]:

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
db.labels_

出[26]:

array([-1,  0,  2,  1,  1,  2, -1,  0,  0, -1, -1,  0,  0,  2, -1, -1,  2,
        0,  1,  0,  0,  2, -1, -1,  0, -1, -1,  1, -1,  2,  1, -1,  1, -1,
        1,  0,  1,  0,  0,  2,  2, -1,  2,  1,  0,  1,  0,  1,  2,  1,  1,
        2, -1,  2,  1, -1,  0,  0, -1,  1,  0,  0,  1,  2,  0, -1,  2,  1,
       -1,  0,  0,  1,  1,  0, -1,  2, -1,  1,  2,  2,  0,  2,  1,  0, -1,
        0,  2,  1, -1,  2,  0, -1,  1,  1,  2,  0,  2,  1,  2,  1,  2,  2,
       -1,  2,  0,  1,  0, -1,  2,  0,  1,  0,  0, -1,  1,  0,  2,  2,  0,
        1,  0, -1,  1,  0,  1,  1,  1, -1,  1,  2,  1, -1, -1,  0,  0,  2,
        1,  1, -1,  0,  1,  2,  1,  0,  0, -1,  2,  1,  1,  1,  2,  2,  0,
        0,  2, -1,  1,  0,  1,  1,  2,  1,  2,  1,  0, -1,  2,  0,  2,  1,
        2,  1,  0,  1,  2,  0,  1, -1,  2,  0,  0,  1,  1,  1, -1,  0,  1,
        0,  1,  2, -1, -1,  2,  1,  0,  0,  2, -1,  2,  0])

在[27]:

import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=db.labels_)

(27):

<matplotlib.collections.PathCollection at 0x10a6bc110>

交叉验证

在[28]:

from sklearn import svm, cross_validation, datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

model = svm.SVC()
print cross_validation.cross_val_score(model, X, y, scoring='precision')
print cross_validation.cross_val_score(model, X, y, scoring='mean_squared_error')

[ 0.98148148  0.96491228  0.98039216]
[-0.01960784 -0.03921569 -0.02083333]

from:http://blog.csdn.net/pipisorry/article/details/44833603

ref:Data-processing and machine learning with Python

http://nbviewer.ipython.org/github/halflings/python-data-workshop/blob/master/data-workshop-notebook.ipynb

推荐文章：

Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化，python数据处理