Practical Machine Learning,practicallearning
Practical Machine Learning,practicallearning
http://blog.csdn.net/pipisorry/article/details/46490177
实用机器学习Practical Machine Learning courses学习笔记
Practical Machine Learning实用机器学习
1.1 Prediction motivation预测的动机
课程概览About this course
This course covers the basic ideas behind machine learning/prediction, What this course depends on What would be useful
·Study design training
vs. test sets
Conceptual issues out
of sample error, ROC curves
Practical implementation the
caret package
·The Data Scientist's Toolbox
R Programming
·Exploratory analysis
Reporting Data and Reproducible Research
Regression models
机器学习的用处
Local governments > pension(退休金) payments
Google >whether you will click on an ad
Amazon >what movies you will watch
Insurance companies >what your risk of death is
Johns Hopkins >who will succeed in their programs
推荐书目及资源
The elements of statistical learning
Machine learning (more advanced material)
List of machine learning resources on Quora
List of machine learning resources from Science
Advanced notes from MIT open courseware
Advanced notes from CMU
Kaggle machine learning competitions
1.2 什么是预测What is prediction
预测问题的中心教条dogma
predict for these dots whether they're red or blue:
choosing the right dataset and that knowing what the specific question is are again paramount(最重要的)
可能存在的问题
一个例子:Google Flu trends algorithm didn't realize the search terms that people would use would change over time.They might use different terms when they were searching, and so that would affect the algorithm's performance.And also, the way that those terms
were actually being used in the algorithm wasn't very well understood.And so when the function of a particular search term changed in their algorithm, it can cause problems.
预测器的组件components of a predictor
question -> input data -> features -> algorithm -> parameters -> evaluation
Note: question: What are you trying to predict and what are you trying to predict it with?
预测的一个例子:垃圾邮件
question -> input data -> features -> algorithm -> parameters -> evaluation
Start with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
Note:try to make it as concrete as possible
question -> input data -> features -> algorithm -> parameters -> evaluation
rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html
question -> input data -> features
-> algorithm -> parameters -> evaluation
library(kernlab)
data(spam)
head(spam)
question -> input data -> features -> algorithm
-> parameters -> evaluation
Our simple algorithm
- Find a value
C . - frequency of 'your'
> C predict "spam"
Note:best cut off is above 0.5 then we say that it's SPAM, and if it's below 0.5 we can say that it's HAM.
question -> input data -> features -> algorithm -> parameters -> evaluation
question -> input data -> features -> algorithm -> parameters -> evaluation
1.3 步骤的相对重要性Relative importance of steps
{about the tradeoffs and the different components of building a machine learning algorithm}
Relative order of importance:question > data > features > algorithms
...
Then creating features is an important component in that if you don't compress the data in the right way you might lose all of the relevant and valuable information.
And finally, in my experience it's been the algorithm
is often the least important part of building a machine learning algorithm.It can be very important depending on the exact modality of the type of data that you're using.For example, image data and voice data can require certain kinds of prediction
algorithms that might not necessarily be as.
An important point
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.--John Tukey
In other words, an important component of knowing how to do prediction is to know when to give up,when the data that you have is just not sufficient to answer the question that you're trying to answer.
Garbage in = Garbage out
question -> input data -> features -> algorithm -> parameters -> evaluation The key point that you need to remember.When making that decision is garbage in garbage out.In other words if you have bad data that you collected.Or data that isn't very useful for performing predictions.No matter how good your machine learning algorithm is. You'll often get very bad results out.Features matter!
question -> input data -> features -> algorithm -> parameters -> evaluationProperties of good features
- Lead to data compression
- Retain relevant information
- Are created based on expert application knowledge
Note:there's a debate in the community about whether it's better to create features automatically or whether it's better to use expert domain knowledge.And in general it seems that the, expert domain knowledge can help quite a bit in many,
many applications and so should be consulted when building a features for machine learning algorithm.
Common mistakes
- Trying to automate feature selection
- Not paying attention to data-specific quirks
- Throwing away information unnecessarily
from:http://blog.csdn.net/pipisorry/article/details/46490177
评论暂时关闭