Practical Machine Learning,practicallearning


http://blog.csdn.net/pipisorry/article/details/46490177

实用机器学习Practical Machine Learning courses学习笔记

Practical Machine Learning实用机器学习

1.1 Prediction motivation预测的动机

课程概览About this course

This course covers the basic ideas behind machine learning/prediction, What this course depends on What would be useful
·Study design training
vs. test sets
Conceptual issues out
of sample error, ROC curves
Practical implementation the
caret package
·The Data Scientist's Toolbox
R Programming
·Exploratory analysis
Reporting Data and Reproducible Research
Regression models


机器学习的用处

Local governments > pension(退休金) payments
Google >whether you will click on an ad
Amazon >what movies you will watch
Insurance companies >what your risk of death is
Johns Hopkins >who will succeed in their programs


推荐书目及资源

The elements of statistical learning

Machine learning (more advanced material)

List of machine learning resources on Quora
List of machine learning resources from Science
Advanced notes from MIT open courseware
Advanced notes from CMU
Kaggle machine learning competitions



1.2 什么是预测What is prediction

预测问题的中心教条dogma

predict for these dots whether they're red or blue:


choosing the right dataset and that knowing what the specific question is are again paramount(最重要的)


可能存在的问题

一个例子:Google Flu trends algorithm didn't realize the search terms that people would use would change over time.They might use different terms when they were searching, and so that would affect the algorithm's performance.And also, the way that those terms were actually being used in the algorithm wasn't very well understood.And so when the function of a particular search term changed in their algorithm, it can cause problems.

预测器的组件components of a predictor

question -> input data -> features -> algorithm -> parameters -> evaluation

Note: question: What are you trying to predict and what are you trying to predict it with?


预测的一个例子:垃圾邮件

question -> input data -> features -> algorithm -> parameters -> evaluation

Start with a general question

Can I automatically detect emails that are SPAM that are not?

Make it concrete

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

Note:try to make it as concrete as possible

question -> input data -> features -> algorithm -> parameters -> evaluation

rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html

question -> input data -> features -> algorithm -> parameters -> evaluation

library(kernlab)
data(spam)
head(spam)
question -> input data -> features -> algorithm -> parameters -> evaluation

Our simple algorithm

  • Find a value C.
  • frequency of 'your' > C predict "spam"

Note:best cut off is above 0.5 then we say that it's SPAM, and if it's below 0.5 we can say that it's HAM.

question -> input data -> features -> algorithm -> parameters -> evaluation


question -> input data -> features -> algorithm -> parameters -> evaluation



1.3 步骤的相对重要性Relative importance of steps

{about the tradeoffs and the different components of building a machine learning algorithm}

Relative order of importance:question > data > features > algorithms

...

Then creating features is an important component in that if you don't compress the data in the right way you might lose all of the relevant and valuable information.

And finally, in my experience it's been the algorithm is often the least important part of building a machine learning algorithm.It can be very important depending on the exact modality of the type of data that you're using.For example, image data and voice data can require certain kinds of prediction algorithms that might not necessarily be as.

An important point

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.--John Tukey 

In other words, an important component of knowing how to do prediction is to know when to give up,when the data that you have is just not sufficient to answer the question that you're trying to answer.

Garbage in = Garbage out

question -> input data -> features -> algorithm -> parameters -> evaluation The key point that you need to remember.When making that decision is garbage in garbage out.In other words if you have bad data that you collected.Or data that isn't very useful for performing predictions.No matter how good your machine learning algorithm is. You'll often get very bad results out.

Features matter!

question -> input data -> features -> algorithm -> parameters -> evaluation

Properties of good features

  • Lead to data compression
  • Retain relevant information
  • Are created based on expert application knowledge

Note:there's a debate in the community about whether it's better to create features automatically or whether it's better to use expert domain knowledge.And in general it seems that the, expert domain knowledge can help quite a bit in many, many applications and so should be consulted when building a features for machine learning algorithm.

Common mistakes

  • Trying to automate feature selection
  • Not paying attention to data-specific quirks
  • Throwing away information unnecessarily
Note:Some common mistake are trying to automate feature selection in a way that doesn't allow for you to understand how those features are actually.Being applied to make good predictions.Black box predictions can be very useful,can be very accurate but they can also change on a dime if we're not paying attention to how those features actually do predict the outcome.
from:http://blog.csdn.net/pipisorry/article/details/46490177


相关内容