Cơ sở dữ liệu - Chương 3: Hồi qui dữ liệu

52 trang vanle 5620

Download

Bạn đang xem 20 trang mẫu của tài liệu "Cơ sở dữ liệu - Chương 3: Hồi qui dữ liệu", để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

Tài liệu đính kèm:

co_so_du_lieu_chuong_3_hoi_qui_du_lieu.pdf

Nội dung text: Cơ sở dữ liệu - Chương 3: Hồi qui dữ liệu

Khoa Khoa Học & Kỹ Thuật Máy Tính Trường Đại Học Bách Khoa Tp. Hồ Chí Minh ChChươươngng 3:3: HHồồii quiqui ddữữ liliệệuu CaoCao HHọọcc NgànhNgành KhoaKhoa HHọọcc MáyMáy TínhTính GiáoGiáo trìnhtrình đđiiệệnn ttửử BiênBiên sosoạạnn bbởởii:: TS.TS. VõVõ ThThịị NgNgọọcc ChâuChâu ((chauvtn@cse.hcmut.edu.vnchauvtn@cse.hcmut.edu.vn)) 1 Họckỳ 1 – 2011-2012 1
Tài liệuthamkhảo [1] Jiawei Han, MichelineKamber, “Data Mining: Concepts and Techniques”, Second Edition, Morgan Kaufmann Publishers, 2006. [2] David Hand, Heikki Mannila, Padhraic Smyth, “Principles of Data Mining”, MIT Press, 2001. [3] David L. Olson, Dursun Delen, “Advanced Data Mining Techniques”, Springer-Verlag, 2008. [4] Graham J. Williams, Simeon J. Simoff, “Data Mining: Theory, Methodology, Techniques, and Applications”, Springer-Verlag, 2006. [5] Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar, “Next Generation of Data Mining”, Taylor & Francis Group, LLC, 2009. [6] Daniel T. Larose, “Data mining methods and models”, John Wiley & Sons, Inc, 2006. [7] Ian H.Witten, Eibe Frank, “Data mining : practical machine learning tools and techniques”, Second Edition, Elsevier Inc, 2005. [8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire, “Successes and new directions in data mining”, IGI Global, 2008. [9] Oded Maimon, Lior Rokach, “Data Mining and Knowledge Discovery Handbook”, Second Edition, Springer Science + Business Media, LLC 2005, 2010. 2 2
Nộidung Chương 1: Tổng quan về khai phá dữ liệu Chương 2: Các vấn đề tiềnxử lý dữ liệu Chương 3: Hồiqui dữ liệu Chương 4: Phân loạidữ liệu Chương 5: Gom cụmdữ liệu Chương 6: Luậtkếthợp Chương 7: Khai phá dữ liệuvàcôngnghệ cơ sở dữ liệu Chương 8: Ứng dụng khai phá dữ liệu Chương 9: Các đề tài nghiên cứutrongkhaiphádữ liệu Chương 10: Ôn tập 3 3
Chương 3: Hồiqui dữ liệu 3.1. Tổng quan về hồiqui 3.2. Hồiqui tuyếntính 3.3. Hồiqui phi tuyến 3.4. Ứng dụng 3.5. Các vấn đề vớihồiqui 3.6. Tóm tắt 4 4
3.0. Tình huống 1 Ngày mai giá cổ phiếu STB sẽ là bao nhiêu??? 5 5
3.0. Tình huống 2 y Y1 Y1’ y = x + 1 X1 x Mô hình phân bố dữ liệucủa y theo x??? 6 6
3.0. Tình huống 3 Bàitoánphântíchgiỏ hàng thị trường (market basket analysis) Æ sự kếthợpgiữacácmặthàng? 7 7
3.0. Tình huống 4 Khảosátcácyếutố tác động đếnxuhướng sử dụng quảng cáo trựctuyếntạiViệtNam Sự giảitrícảmnhận (+0.209) Chấtlượng thông tin (+0.261) Chấtlượng thông tin cảmnhận (+0.199) Sự khó chịucảmnhận (-0.175) Sự tin cậycảmnhận Thái độ về tính riêng tư Sự tương tác (+0.373) Chuẩnchủ quan (+0.254) Nhậnthứckiểm soát hành vi (+0.377) 8 8
3.0. Tình huống Hồi qui (regression) Khai phá dữ liệucótínhdự báo (Predictive data mining) Tình huống ??? Khai phá dữ liệucótínhmôtả (Descriptive data mining) Tình huống ??? 9 9
3.1. Tổng quan về hồiqui Định nghĩa-Hồi qui (regression) J. Han et al (2001, 2006): Hồiqui làkỹ thuậtthống kê cho phép dựđoán các trị (số) liên tục. Wiki (2009): Hồi qui (Phân tích hồiqui –regression analysis) là kỹ thuậtthống kê cho phép ướclượng các mốiliênkếtgiữacácbiến R. D. Snee (1977): Hồi qui (Phân tích hồiqui) làkỹ thuậtthống kê trong lĩnh vựcphântíchdữ liệuvà xây dựng các mô hình từ thựcnghiệm, cho phép mô hình hồiqui vừa được khám phá được dùng cho mục đích dự báo (prediction), điềukhiển (control), hay học (learn) cơ chếđãtạoradữ liệu. R. D. Snee, Validation of Regression Models: Methods and Examples, Technometrics, 10 Vol. 19, No. 4. (Nov., 1977), pp. 415-428. 10
3.1. Tổng quan về hồiqui Môhìnhhồi qui (regression model): mô hình mô tả mối liên kết (relationship) giữamộttậpcác biếndự báo (predictor variables/independent variables) và một hay nhiều đáp ứng (responses/dependent variables). Y = f(X, β) X: các biếndự báo (predictor/independent variables) Y: các đáp ứng (responses/dependent variables) β: các hệ số hồi qui (regression coefficients) 11 11
3.1. Tổng quan về hồiqui Phương trình hồiqui: Y = f(X, β) X: các biếndự báo (predictor/independent variables) Y: các đáp ứng (responses/dependent variables) β: các hệ số hồi qui (regression coefficients) Æ X dùng để giảithíchsự biến đổicủacácđáp ứng Y. Æ Y dùng đề mô tả các hiệntượng (phenomenon) được quan tâm/giảithích. Æ Quan hệ giữaY vàX đượcdiễntả bởisự phụ thuộc hàm củaY đốivớiX. Æ β mô tả sựảnh hưởng của X đối với Y. 12 12
3.1. Tổng quan về hồiqui Phân loại Hồi qui tuyến tính (linear) và phi tuyến (nonlinear) Hồi qui đơnbiến (single) và đabiến (multiple) Hồi qui có thông số (parametric), phi thông số (nonparametric), và thông số kếthợp (semiparametric) Hồi qui đốixứng (symmetric) và bất đốixứng (asymmetric) 13 13
3.1. Tổng quan về hồiqui Phân loại Hồi qui tuyến tính (linear) và phi tuyến (nonlinear) Linear in parameters: kếthợptuyến tính các thông số tạonênY Nonlinear in parameters: kếthợp phi tuyến các thông số tạonênY [Regression and Calibration.ppt] 14 14
3.1. Tổng quan về hồiqui Phân loại Hồi qui đơnbiến (single) và đabiến (multiple) Single: X = (X1) Multiple: X = (X1, X2, , Xk) yxxˆ =+6.3972 20.4921 + 0.2805 yxˆ = 26.89+ 4.06 15 12 [Chapter 6 Regression and Correlation.ppt] 15
3.1. Tổng quan về hồiqui Phân loại Hồiqui cóthôngsố (parametric), phi thông số (nonparametric), và thông số kếthợp (semiparametric) Parametric: mô hình hồiqui vớihữuhạn thông số Nonparametric: mô hình hồiqui vớivôhạn thông số Semiparametric: mô hình hồi qui vớihữuhạn thông sốđược quan tâm Types of (Additive) Model Mathematical Form Parametric Y = β0 + β1*X Nonparametric Y = β0 + f(X) Semiparametric Y = β0 + β1*X1 + f(X2) [Wikipedia] [GAM - nonparameteric regression technique.ppt] P. Giudici, Applied Data Mining – Statistical Methods for Business and Industry, John Wiley & Sons Ltd, 2003. 16 16
3.1. Tổng quan về hồiqui Phân loại Hồi qui đốixứng (symmetric) và bất đốixứng (asymmetric) Symmetric: mô hình hồiqui cótínhmôtả (descriptive) (eg. log-linear models) The objective of the analysis is descriptive – to describe the associative structure among the variables. Asymmetric: mô hình hồi qui có tính dự báo (predictive) (eg. linear regression models, logistic regression models ) The variables are divided in two groups, response and explanatory – to predict the responses on the basis of the explanatory variables. Æ Generalized linear models: symmetric vs. asymmetric 17 P. Giudici, Applied Data Mining – Statistical Methods for Business and Industry, John Wiley & Sons Ltd, 2003. 17
3.2. Hồi qui tuyếntính Hồiqui tuyếntínhđơnbiến Đường hồi qui (regression line) Hồiqui tuyếntínhđabiến Mặtphẳng hồi qui (regression plane) 18 18
3.2.1. Hồiqui tuyếntínhđơnbiến Cho N đốitượng đã được quan sát, mô hình hồiqui tuyến tính đơnbiến đượcchodướidạng sau với εi dùng giữ phần biếnthiêncủa đáp ứng Y không đượcgiảithíchtừ X: -Dạng đường thẳng -Dạng parabola 19 19
3.2.1. Hồiqui tuyếntínhđơnbiến •Y= β0 + β1*X1 → Y = 0.636 + 2.018*X •Dấucủa β1 cho biếtsựảnh hưởng củaX đốivớiY. 20 20
3.2.1. Hồiqui tuyếntínhđơnbiến Ướclượng bộ thông sốβ( ) để đạt đượcmô hình hồiqui tuyếntínhđơnbiến Thặng dư (residual) xi, yi: trị củax, y từ tậpdữ liệu huấnluyện Tổng thặng dư bình x, y: trị trung bình từ tậpdữ liệu phương (sum of huấnluyện squared residuals) Æ tốithiểuhóa ŷi: trịướclượng vớibộ thông số β Trịướclượng của β Giảđịnh (assumptions): thành phầnlỗicóphương sai (variance) là hằng số, tuân theo phân bố chuẩn (normal distribution). 21 21
3.2.2. Hồiqui tuyếntínhđabiến Hồiqui tuyếntínhđabiến: phân tích mối quan hệ giữabiếnphụ thuộc (response/dependent variable) và hai hay nhiềubiến độclập (independent variables) yi = b0 + b1xi1 + b2xi2 + + bkxik i = 1 n vớin làsốđốitượng đã quan sát k = số biến độclập(số thuộc tính/tiêu chí/yếutố ) Y = biếnphụ thuộc X = biến độclập b0 = trị củaY khiX = 0 b1 k = trị củacáchệ số hồiqui 22 22
3.2.2. Hồiqui tuyếntínhđabiến Trịướclượng củaY yˆ = bbxbx01122++ ++K bxkk −1 Trịướclượng của b = (XT X) XT Y bộ thông số b Yb101 xx1,1 1,2K x 1,k   Yb211 xx2,1 2,2K x 2,k YX== , , b =  MMMM M M   Ybnk1 xxnn,1 ,2K x nk , 23 23
3.2.2. Hồiqui tuyếntínhđabiến Example: a sales manager of Tackey Toys, needs to predict sales of Tackey products in selected market area. He believes that advertising expenditures and the population in each market area can be used to predict sales. He gathered sample of toy sales, advertising expenditures and the population as below. Find the linear multiple regression equation which the best fit to the data. 24 [Chapter 6 Regression and Correlation.ppt] 24
3.2.2. Hồiqui tuyếntínhđabiến Market Advertising Expenditures Population Toy sales Area (Thousands of Dollars) x1 (Thousands) x2 (Thousands of Dollars) y A 1.0 200 100 B 5.0 700 300 C 8.0 800 400 D 6.0 400 200 E 3.0 100 100 F 10.0 600 400 25 [Chapter 6 Regression and Correlation.ppt] 25
3.2.2. Hồiqui tuyếntínhđabiến yxxˆ = 6.3972++ 20.492112 0.2805 26 [Chapter 6 Regression and Correlation.ppt] 26
3.3. Hồi qui phi tuyến Y = f(X, β) Y là hàm phi tuyếnchoviệckếthợpcácthôngsố β. Ví dụ: hàm mũ, hàm logarit, hàm Gauss, Xác định bộ thông sốβtối ưu: các giảithuậttối ưu hóa Tối ưuhóacụcbộ Tối ưuhóatoàncụcchotổng thặng dư bình phương (sum of squared residuals) 27 27
3.4. Ứng dụng Quátrìnhkhaiphádữ liệu Giai đoạntiềnxử lý dữ liệu Giai đoạn khai phá dữ liệu Khai phá dữ liệucótínhmôtả Khai phá dữ liệucótínhdự báo Các lĩnh vực ứng dụng: sinh học(biology), nông nghiệp(agriculture), xãhội(social issues), kinh tế (economy), kinh doanh (business), P. Giudici, Applied Data Mining – Statistical Methods for Business and Industry, John Wiley & Sons Ltd, 2003. 28 28
3.5. Các vấn đề vớihồiqui Các giảđịnh (assumptions) đikèmvớibài toán hồiqui. Lượng dữ liệu đượcxử lý. Đánh giá mô hình hồiqui. Các kỹ thuậttiêntiếnchohồiqui: Artificial Neural Network (ANN) Support Vector Machine (SVM) 29 29
3.6. Tóm tắt Hồi qui Kỹ thuật thống kê, đượcápdụng cho các thuộc tính liên tục (continuous attributes/features) Có lịch sử phát triển lâu đời Đơn giản nhưng rất hữu dụng, được ứng dụng rộng rãi Cho thấy sựđóng góp đáng kể của lĩnh vực thống kê trong lĩnh vực khai phá dữ liệu Các dạng mô hình hồi qui: tuyến tính/phi tuyến, đơn biến/đa biến, có thông số/phi thông số/thông số kết hợp, đối xứng/bất đối xứng 30 30
HHỏỏii && ĐĐápáp 31 31
ChChươươngng 3:3: HHồồii quiqui ddữữ liliệệuu Phần2 32 32
Nộidung Generalized linear models Æ [2], section 11.3, pp. 384-390. Logistic regression Æ [2], section 10.7, pp. 354-355. Æ [9], section 25.3, pp. 529-532. Generalized additive models Æ [2], section 11.5.1, pp. 393-395. Projection pursuit regression Æ [2], section 11.5.2, pp. 395-397. 33 33
Generalized linear models Linear models: the response variable was decomposed into two parts a weighted sum of the predictor variables a random component: assumed that the ε(i) were independently distributed as N (0, σ2) The generalized linear model extends the ideas of linear models. 34 34
Generalized linear models Generalized linear model (i) The Y(i) are independent random variables, with distribution N(µ(i), σ2). Relax the requirement: random variables follow a normal distribution (ii) The parameters enter the model in a linear way via the sum v(i) = ∑ajxj(i). (iii) The v(i) and µ(i) are linked by v(i) = µ(i). Generalize: g(µ(i)) = v(i) relates the parameter of the distribution to the linear term v(i) = ∑ajxj(i) 35 35
Generalized linear models The generalized linear model has three main features (i) The Y(i), i=1, n, are independent random variables, with the same exponential family distribution The exponential family of distributions is an important family that includes the normal, the Poisson, the Bernoulli, and the binomial distributions. If ø is known, then θ is called the natural or canonical parameter. When, as is often the case, α(ø) = ø, ø is called the dispersion or scale parameter. (ii) The predictor variables are combined in a form v(i) = ∑ajxj(i) called the linear predictor, where the ajs are estimates of the αjs. (iii) The mean µ(i) of the distribution for a given predictor vector is related to the linear combination in (ii) through the link function g(µ(i)) = v(i) = ∑ajxj(i). 36 36
Generalized linear models Prediction from a generalized linear model requires the inversion of the relationship g(µ(i)) = ∑ajxj(i). The nonlinearity means that an iterative scheme has to be adopted. Maximum likelihood solution A measure of the goodness of fit of a generalized linear model, analogous to the sum of squares used for linear regression: the deviance D(M) of a model the sum of squares is the special case of deviance when it is applied to linear models the difference between the log likelihood of model M and the log likelihood of the largest model we are prepared to contemplate, M* 37 37
Logistic regression π(x) = (a). β > 0 (b). β 0: (x) increases as x increases. β < 0: π(x) decreases as x increases. β → 0: the curve tends to become a horizontal straight line. When β =0,Y is independent of X. 38 38
Logistic regression Logistic regression Æ logistic discriminant analysis Descriptive model a very powerful tool for classification problems in discriminant analysis Æ tends to have higher accuracy when training data is plenty as compared to Naïve Bayes applied in many medical and clinical research studies As a neural network model without hidden nodes and with a logistic activation function and softmax output function The yis are binary variables and thus not normally distributed. The distribution of yi given x is assumed to follow a Bernoulli distribution: Æ a linear function of x 39 39
Logistic regression Logistic regression Æ logistic discriminant analysis Estimate the β’s: maximum likelihood π(x) = p(y=1|x) = Æ find the smallest possible deviance between the observed and predicted values (kind of like finding the best fitting line) using calculus (derivatives specifically) Æ use different "iterations" in which it tries different solutions until it gets the smallest possible deviance or best fit Æ Once it has found the best solution, it provides a final value for the deviance D, which is usually referred to as "negative two log likelihood“ thought of as a Chi-square value. Likelihood of the reduced model =  likelihood of the reduced model  likelihood of predicted values (π(x)) D = −2ln   likelihood of the full model  Likelihood of the full model =   probabilities of observed values (y=1/0)40 40
Logistic regression The parameter estimates for the five variables selected in the final model, with the corresponding Wald statistics No variable appears to be not significant, using a significance level of 0.05. The variable Vdpflart indicates whether or not the price of the first purchase is paid in instalments; it is decisively estimated to be the variable most associated with the response variable. P. Giudici, Applied Data Mining – Statistical Methods for Business and Industry, John Wiley & Sons Ltd, 41 2003, p.166. 41
Generalized additive models Extension of the generalized linear model Replace the simple weighted sums of the predictor variables by weighted sums of transformed versions of the predictor variables The right-hand side is sometimes termed the additive predictor. The relationships between the response variable and the predictor variables are estimated nonparametrically. greater flexibility When some of the functions are estimated from the data and some are determined by the researcher, the generalized additive model is sometimes called “semiparametric.” 42 42
Generalized additive models The model retains the merits of linear and generalized linear models. How g changes with any particular predictor variable does not depend on how other predictor variables change. Interpretation is eased. This is at the cost of assuming that such an additive form does provide a good approximation to the “true” surface. The model can be readily generalized by including multiple predictor variables within individual f components of the sum. Relaxing the simple additive interpretation The additive form also means that we can examine each smoothed predictor variable separately, to see how well it fits the data. 43 43
Generalized additive models A GAM fitting algorithm Backfitting algorithm to estimate functions fj and constant α Proceed the following steps 1. Initialize 2. Cycle 3. Continue 2 until the individual functions do not change. [9], pp. 218-219. 44 44
Generalized additive models A GAM fitting algorithm 0 1. Initialize: α =yi, fj = fj , j = 1, , p. Each predictor is given an initial functional relationship to the response such as a linear one. The intercept is given an initial value of the mean of y. 2. Cycle: j = 1, , p,1, , p, A single predictor is selected. Fitted values are constructed using all of the other predictors. These fitted values are subtracted from the response. A smoother Sj is applied to the resulting “residuals,” taken to be a function of the single excluded predictor. The smoother updates the function for that predictor. Each of the other predictors is, in turn, subjected to the same process. 3. Continue 2 until the individual functions do not change. 45 45
Generalized additive models These “adaptive” methods seem to be most useful when the data have a high signal to noise ration, when the response function is highly nonlinear, when the variability in the response function changes dramatically from location to location. Æ Experience to date suggests that data from the engineering and physical sciences are most likely to meet these criteria. Æ Data from the social sciences are likely to be far too noisy. 46 46
Generalized additive models Neural networks are a special case of the generalized additive linear models. Multilayer feedforward neural networks with one hidden layer where m is the number of processing-units in the hidden layer. The family of functions that can be computed depends on the number of neurons in the hidden layer and the activation function σ . Note that a standard multilayer feedforward network with a smooth activation function σ can approximate any continuous function on a compact set to any degree of accuracy if and only if the network’s activation function σ is not a polynomial. 47 47
Projection pursuit regression The additive models essentially focus on individual variables (albeit transformed versions of these). The additive models can be extended so that each additive component involves several variables, but it is not clear how best to select such subsets. If the total number of available variables is large, then we may also be faced with a combinatorial explosion of possibilities. 48 48
Projection pursuit regression The basic projection pursuit regression model This is a linear combination of (potentially nonlinear) transformations of linear combinations of the raw variables. The f functions are not constrained (as in neural networks) to take a particular form, but are usually found by smoothing, as in generalized additive models. The term projection pursuit arises from the viewpoint that one is projecting X in direction αk, and then seeking directions of projection that are optimal for some purpose. optimal as components in a predictive model the model is fitted using standard iterative procedures to estimate the parameters in the αk vector. 49 49
Projection pursuit regression The projection pursuit regression model has obvious close similarities to the neural network model. A generalization of neural networks Projection pursuit regression models can be proven to have the same ability to estimate arbitrary functions as neural networks, but they are not as widely used. Estimating their parameters can have advantages over the neural network situation. Projection pursuit regression tends may not be practical for data sets that are massive (large n) and high-dimensional (large p). The fitting process is rather complex from a computational viewpoint. 50 50
Tóm tắt Regression Linear models Generalized linear model Logistic models Feedforward neural networks Back-propagration neural networks Generalized additive models Projection pursuit regression Æ Linearity to Nonlinearity Æ Descriptive vs. Predictive 51 51
Đọcthêm Predictive modeling for regression [2], chapter 11, pp. 367-398. Regression modeling, multiple regression and model building, logistic regression [6], chapter 2-4, pp. 33-203. Data mining within a regression framework [9], chapter 11, pp. 209-230. Statistical methods for data mining [9], chapter 25, pp. 523-540. Validation of regression models: methods and examples Ronald D. Snee, Technometrics, vol. 19, no. 4 (Nov, 1977), pp. 415-428. Choosing between logistic regression and discriminant analysis S. James Press, Sandra Wilson, Journal of the American Statistical Association, vol. 73, no. 364 (Dec, 1978), pp. 699-705. Fitting curves to data using nonlinear regression: a practical and nonmathematical review Harvey J. Motulsky, Lennart A. Ransnas, FASEB J., vol. 1 (1987), pp. 365-374. 52 52