Sas Enterprise Miner Decision Tree How to Continue Editing a Interactive Tree
The Three Most Common Data Mining Software Tools
Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009
Layout of the SAS-Enterprise Miner Window
The Enterprise Miner window, as shown in Figure 10.4, contains the following interface components:
Figure 10.4. Layout of SAS-EM workspace window.
- •
-
Toolbar and toolbar shortcut buttons: The Enterprise Miner Toolbar is a graphic set of node icons that are organized by SEMMA categories. Above the toolbar is a collection of toolbar shortcut buttons that are commonly used to build process flow diagrams in the diagram workspace. Move the mouse pointer over any node or shortcut button to see the text name. Drag a node into the diagram workspace to use it. The toolbar icon remains in place and the node in the diagram workspace is ready to be connected and configured for use in your process flow diagram. Click on a shortcut button to use it.
- •
-
Project panel: Use the Project panel to manage and view data sources, diagrams, model packages, and project users.
- •
-
Properties panel: Use the Properties panel to view and edit the settings of data sources, diagrams, nodes, and model packages.
- •
-
Diagram workspace: Use the diagram workspace to build, edit, run, and save process flow diagrams. This is the place where you graphically build, order, sequence, and connect the nodes that you use to mine your data and generate reports.
- •
-
Property Help panel: The Property Help panel displays a short description of the property that you select in the Properties panel. Extended help can be found in the Help Topics selection from the Help main menu or from the Help button on many windows.
- •
-
Status bar: The Status bar is a single pane at the bottom of the window that indicates the execution status of an SAS-Enterprise Miner task.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123747655000103
Predicting Micro Lending Loan Defaults Using SAS® Text Miner
Richard Foley , in Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, 2012
About SAS® Text Miner
SAS® Text Miner is a plug-in for the SAS® Enterprise Miner environment, which provides a rich set of data mining tools that facilitate the prediction aspect of text mining. The integration of SAS® Text Miner within SAS® Enterprise Miner combines textual data with traditional data mining variables. The integration provides the ability to add text mining nodes into SAS® Enterprise Miner process flow diagrams. SAS® Text Miner encompasses the parsing and exploration aspects of text mining and prepares data for predictive mining and further exploration using other SAS® Enterprise Miner nodes. SAS® Text Miner supports various sources of textual data: local text files, text as observations in SAS® data sets or external databases, and files on the web.
As part of SAS® Enterprise Miner, SAS® Text Miner, follows the concept of nodes, where each node provides a customizable task. SAS® Text Miner consists of four of these highly customizable nodes.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123869791000190
Summary
Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009
The Process Is More Important Than the Tool
To some degree, data mining tools lead you through the data mining process. SAS-Enterprise Miner organizes its top toolbar to present groups of operations performed in each of the major phases of Sample, Explore, Modify, Model, and Assess (SEMMA). This organization keeps the correct sequence of operations central in the mind of the data miner. STATISTICA Data Miner divides the modeling screen into four general phases of data mining: (1) data acquisition; (2) data cleaning, preparation, and transformation; (3) data analysis, modeling, classification, and forecasting; and (4) reports. This group of activities expands somewhat on the six phases in the SEMMA process flow. The STATISTICA Data Miner approach constrains you to do the appropriate group of operations in sequence when building a model in the visual programming diagram of a workspace. The CRISP-DM process presented in Chapter 3 is tool-independent and industry (or discipline)-independent. You can follow this process with any data mining tool. Whichever approach you take, it is wise to follow the Marine drill sergeant's appeal to "Get with the program!"
Many algorithms will perform similarly on the same data set, although one may be best. Your choice of algorithm will have much less impact on the quality of your model than the process steps you went through to get it. Focus on following the process correctly, and your model will be at least acceptable. Refinements can come later.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012374765500022X
Three Common Text Mining Software Tools
In Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, 2012
Prerequisites for this Scenario
Before you can perform the tasks in this chapter, administrators at your site must have installed and configured all of the necessary components of SAS-TM4.2. You must also do the following (Figure 6.13):
FIGURE 6.13. SAS-EM selection for starting a new project.
To create a project, do the following (Figure 6.14):
FIGURE 6.14. SAS-EM/TM "Create New Project" window is used for giving the project a unique name and selecting the SAS Server Directory pathway.
- 1.
-
Open SAS® Enterprise Miner.
- 2.
-
Click New Project in the SAS-EM window. The Select SAS® Server window opens.
- 3.
-
Click New Project. The specify Project Name and Server Directory page opens.
- 4.
-
Type a name for the project, such as Abstract Data Patterns in the Project Name dialog box.
- 5.
-
In the SAS® Server Directory dialog box, type the path to the location on the server where you want to store data for your project. Alternatively, browse to a folder to use for your project.
- 6.
-
Click Next. The Register the Project page opens.
- 7.
-
Click Next. The New Project Information page opens.
- 8.
-
Click Finish to create your project
To create a data source, do the following (Figure 6.15):
FIGURE 6.15. SAS-EM/TM window for selecting or creating a "data source."
- 1.
-
Right-click the Data Sources folder in the Project panel and select Create Data
Source to open the Data Source Wizard window (Figure 6.16).
FIGURE 6.16. SAS-EM/TM Data Source Wizard window for selecting a metadata source.
- 2.
-
Select SAS® Table in the Source drop-down menu of the Metadata Source window.
- 3.
-
Click Next. The Select a SAS® Table window opens.
- 4.
-
Click Browse.
- 5.
-
Click the SAS® library named Sampsio. The Sampsio library folder contents are displayed on the Select a SAS® Table dialog box.
- 6.
-
Select the Abstract table, and then click OK. The two-level name SAMPSIO.ABSTRACT is displayed in the Table box of the Select a SAS® Table page (Figure 6.17). Click Next.
FIGURE 6.17. SAS-EM/TM window for selecting a SAS Table.
- 7.
-
The Table Information page opens. The Table Properties panel displays metadata for you to review. Click Next.
- 8.
-
The Metadata Advisor Options page opens. Click Next.
- 9.
-
The Column Metadata page opens.
- 10.
-
The default variables for both TEXT and TITLE should be set to Text by default. If this is not the case, then you need to change their roles to Text using the drop-down list. Click Next.
- 11.
-
The Create Sample page opens. Click Next.
- 12.
-
The Data Source Attributes page opens. Click Next.
- 13.
-
The Summary page opens. Click Finish, and the ABSTRACT table is added to the Data Sources folder in the Project panel.
To create a diagram, complete the following steps:
- 1.
-
Right-click the Diagram folder in the Project Panel and select Create Diagram as illustrated in Figure 6.18 with the SAS-EM/TM Project Panel where the 'Diagram Folder' is highlighted to bring up the 'Create Diagram' which is selected.
FIGURE 6.18. The Create Diagram dialog selection.
- 2.
-
Type Abstract Data in the Diagram Name box as illustrated in Figure 6.19.
FIGURE 6.19. Create New Diagram dialog box.
- 3.
-
Click OK. The empty Abstract Data diagram opens in the diagram workspace.
- 4.
-
Choose all defaults by clicking Next in the Create Diagram Wizard.
Once finished, the Abstract Data Patterns diagram opens and is ready for your Text Miner Diagram to be created. First we need to add the data source. Drag and drop the ABSTRACT data source from the Data Sources list into the diagram workspace.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123869791000062
Basic Algorithms for Data Mining
Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009
Association Rules
The goal of association rules is to detect relationships or associations between specific values of categorical variables in large data sets. This technique allows analysts and researchers to uncover hidden patterns in large data sets. The classic example of an early association analysis found that beer tended to be sold with diapers, pointing to the co-occurrence of watching Monday night football and caring for family concerns at the same time. Variants like the a priori algorithm use predefined threshold values for detection of associations (see Agrawal et al., 1993; Agrawal and Srikant, 1994; Han et al., 2001; see also Witten and Frank, 2000 ). This algorithm is provided by SAS Enterprise Miner, SPSS Clementine, and STATISTICA Data Miner.
How Association Rules Work. Assuming you have a record of each customer transaction at a large bookstore, you can perform an association analysis to determine which other book purchases are associated with the purchase of a given book. With this information in hand at the time of purchase, you could recommend to the customer a list of other books the customer may wish to purchase. Such an application of association analysis is called a recommender engine. Such recommender engines are used at many online retail sites (like Amazon.com).
Association algorithms can be used to analyze simple categorical variables, dichotomous variables, and/or multiple target variables. The algorithm will determine association rules without requiring you to specify the number of distinct categories present in the data or any prior knowledge regarding the maximum factorial degree or complexity of the important associations (except in the a priori variant). A form of cross-tabulation table can be constructed without the need to specify the number of variables or categories. Hence, this technique is especially well suited for the analysis of huge data sets.
Table 7.1 shows an example of a tabular representation of results from an association rules algorithm.
TABLE 7.1. Word Correlations, with Their Support and Confidence Values. Support Is Expressed by the Joint Probability of Word 1 and Word 2 Occurring Together; Confidence Is the Conditional Probability of Word 1 Given Word 2.
| Summary of association rules (Scene 1.sta) Min. support=5.0%, Min. confidence=5.0%, Min. correlation=5.0% Max. size of body=10, Max. size of head=10 | ||||||
|---|---|---|---|---|---|---|
| Body | ==> | Head | Support(%) | Confidence(%) | Correlation(%) | |
| 154 | and, that | ==> | like | 6.94444 | 83.3333 | 91.28709 |
| 126 | like | ==> | and, that | 6.94444 | 100.0000 | 91.28709 |
| 163 | and, PAROLLES | ==> | will | 5.55556 | 80.0000 | 73.02967 |
| 148 | will | ==> | and, PAROLLES | 5.55556 | 66.6667 | 73.02967 |
| 155 | and, you | ==> | your | 5.55556 | 80.0000 | 67.61234 |
| 122 | your | ==> | and, virginity | 5.55556 | 57.1429 | 67.61234 |
| 164 | and, virginity | ==> | your | 5.55556 | 80.0000 | 67.61234 |
| 121 | your | ==> | and, you | 5.55556 | 57.1429 | 67.61234 |
| 73 | that | ==> | like | 6.94444 | 41.6667 | 64.54972 |
| 75 | that | ==> | and, like | 6.94444 | 41.6667 | 64.54972 |
| 161 | and, like | ==> | that | 6.94444 | 100.0000 | 64.54972 |
Note that the rules in the results spreadsheet shown were sorted by the Correlation column.
Graphical representations of association rules are shown in Figures 7.3 and 7.4.
Figure 7.3. Link graph for words spoken in All's Well That Ends Well. The thickness of the line linking words is a measure of the strength of the association.
(Source: StatSoft Inc.)
Figure 7.4. Link graph showing the strength of association by the thickness of the line connecting the Body and Head words of some association rules.
(Source: StatSoft Inc.)In Figure 7.4, the support values for the Body and Head portions of each association rule are indicated by the sizes and colors of each. The thickness of each line indicates the confidence value (conditional probability of Head given Body) for the respective association rule; the sizes and colors of the circles in the center, above the Implies label, indicate the joint support (for the co-occurrences) of the respective Body and Head components of the respective association rules.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123747655000073
Basic Algorithms for Data Mining: A Brief Overview
Robert Nisbet Ph.D. , ... Ken Yale D.D.S., J.D. , in Handbook of Statistical Analysis and Data Mining Applications (Second Edition), 2018
Basic Data Mining Algorithms
Association Rules
The goal of association rules is to detect relationships or associations between specific values of categorical variables in large data sets. This technique allows analysts and researchers to uncover hidden patterns in large data sets. The classic example of an early association analysis found that beer tended to be sold with diapers, pointing to the cooccurrence of watching Monday Night Football and caring for family concerns at the same time. Variants like the a priori algorithm use predefined threshold values for detection of associations (see Agrawal et al., 1993; Agrawal and Srikant, 1994; Han et al., 2001; see also Witten and Frank, 2000 ). This algorithm is provided by SAS Enterprise Miner, IBM SPSS Modeler, KNIME, and STATISTICA Data Miner.
How association rules work. Assuming you have a record of each customer transaction at a large book store, you can perform an association analysis to determine which other book purchases are associated with the purchase of a given book. With this information in hand at the time of purchase, you could recommend to the customer a list of other books the customer may wish to purchase. Such an application of association analysis is called a "recommender engine." Such recommender engines are used at many online retail sites (like https://www.amazon.com/).
Association algorithms can be used to analyze simple categorical variables, dichotomous variables, and/or multiple target variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data or any prior knowledge regarding the maximum factorial degree or complexity of the important associations (except in the a priori variant). A form of cross tabulation table can be constructed without the need to specify the number of variables or categories. Hence, this technique is especially well suited for the analysis of huge data sets.
Table 7.1 shows an example of a tabular representation of results from a STATISTICA Data Miner association rules algorithm
Table 7.1. Word Correlations, Provided With Their Support and Confidence Values
Support is expressed by the joint probability of word 1 and word 2 occurring together; confidence is the conditional probability of word 1 given word 2 (see Chapter 1 for more information on joint and conditional probabilities).
Note that the rules in the results spreadsheet shown were sorted by the correlation column.
Graphic representations of association rules are shown in Figs. 7.2 and 7.3.
Fig. 7.2. Link graph for words spoken in All is Well That Ends Well. The thickness of the line linking words is a measure of the strength of the association.
From the Statistica/StatSoft free on-line textbook: http://www.statsoft.com/Textbook.
Fig. 7.3. Link graph showing the strength of association by the thickness of the line connecting the "body" and "head" words of some association rules.
From the Statistica/StatSoft free on-line textbook: http://www.statsoft.com/Textbook.In Fig. 7.3, the support values for the body and head portions of each association rule are indicated by the sizes and colors of each. The thickness of each line indicates the confidence value (conditional probability of head given body) for the respective association rule; the sizes and colors of the circles in the center, above the implies label, indicate the joint support (for the cooccurrences) of the respective body and head components of the respective association rules.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780124166325000074
Classification
Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009
Analyzing Imbalanced Data Sets with Machine Learning Programs
Imbalanced Data Sets
Many problems in data mining involve the analysis of rare patterns of occurrence. For example, responses from a sales campaign can be very rare (typically, about 1%). You can just adjust the classification threshold to account for imbalanced data sets. Models built with many neural net and decision tree algorithms are very sensitive to imbalanced data sets. This imbalance between the rare category (customer response) and the common category (no response) can cause significant bias toward the common category in resulting models.
A neural net learns one case at a time. The error minimization routine (e.g., backpropagation, described in Chapter 7) adjusts the weights one case at a time. This adjustment process will be dominated by the most frequent class. If the most frequent class is "0" 99% of the time, the learning process will be 99% biased toward recognition of any data pattern as a "0." Balancing data sets is necessary to balance the bias in the learning process. Clementine (for example) provides the ability to change the classification threshold in the Expert Options, and it provides the ability to balance the learning bias in the Balance node. If you have Clementine, run the same data set through the net with the threshold set appropriately. Then run the data through a Balance node (easily generated by a Distribution node), using the default threshold. Compare the results.
Another way to accomplish this balance is to weight the input cases appropriately. If the neural net can accept weights and use them appropriately in the error minimization process, the results can be comparable to using balanced data sets. STATISTICA does this. A final way to balance the learning bias is to adjust the prior probabilities of the "0" and "1" categories. SAS-Enterprise Miner and STATISTICA Data Miner use this approach.
If your data mining tool doesn't have one of the preceding methods for removing the learning bias, you may have to physically balance the data by resampling. But resampling can be done two ways: by increasing the sample rate of the rare category (oversampling) or reducing the sample rate of the common category (undersampling). Undersampling the common category eliminates some of the common signal pattern. If the data set is not large, it is better to oversample the rare category. That approach retains all of the signal pattern of the common category and just duplicates the signal pattern of the rare category.
Decision Trees
Interactive trees and boosted trees were introduced in Chapter 7. For purposes of classification, some general information on decision trees is presented here to help you apply these algorithms successfully to classification problems.
A decision tree is a hierarchical group of relationships organized into a tree-like structure, starting with one variable (like the trunk of an oak tree) called the root node. This root node is split into two to many branches, representing separate classes of the root node (if it is categorical) or specific ranges along the scale of the node (if it is continuous). At each split, a question is "asked," which has an answer in terms of the classes or range of the variable being split. One question might be, "Is this case a male or a female?" Questions like this would be used to build a decision tree with binary splits. Decision trees can also be built with multiple splits. Questions asked at each split are defined in terms of some impurity measure, reflecting how uniform the resulting cases must be in the splits. Each branch is split further using the classes or ranges of other variables. At each split, the node that is split is referred to as the parent node, and the nodes into which it is split are called the child nodes. This process continues until some stopping rule is satisfied, such as the minimum number of cases resides in the final node (terminal leaf node) along one splitting pathway. This process is called recursive partitioning. Figure 11.1 shows an example of a decision tree for classifying colored balls.
Figure 11.1. A simple binary decision tree structure.
In the trees structure shown in the figure, the root node is the first parent. Child nodes 1a and 1b are the children of the first split, and they become the parents of children formed by the second split (e.g., Child 1a-1). Child nodes 1a-1, 1a-2, 1b-1, and 1b-2 are terminal leaf nodes. For simplicity, the terminal nodes of the other child nodes are not shown. To understand how this tree was built, we have to consider the questions asked and define the impurity measure and the stopping rule used.
The questions here might be
-
Q1 = What 2 groups would have at least 4 white balls and 4 black balls?
-
Q2 = What 2 groups would have at least 4 balls, among which 2 are white?
-
Q3 = What 2 groups would have balls of only two colors?
The impurity measure for this example might be the rule that no more than one ball may be of a different color. The stopping rule might be that there be a minimum of four balls in a group. The potential split at Child node 1b-2 would not be permitted due to the stopping rule. You might be thinking that it is possible to create multiple splits in this example, and you are right! But the structure of this decision tree is constrained to binary splits.
Decision trees are well suited for classification (particularly binary classification), but are also useful in difficult estimation problems, where their simple piecewise-constant response surface and lack of smoothness constraints make them very tolerant of outliers. Also, trees are also probably the easiest model form to interpret (so long as they are small). The primary problem with decision trees is that they require a data volume that increases exponentially as the depth of the tree increases. Therefore, large data sets are required to fit complex patterns in the data. The other major problem involves using multiple splits, namely that the decision of where to split a continuous variable range is very problematic. Much research has been done to propose various methods for making multiple splits.
There are many forms of decision tree algorithms included in data mining tool packages, the most common of which are CHAID and Classification and Regression Trees (C&RT or CART). These basic algorithms are described in Chapter 7. Newer tree-based algorithms composed of groups of trees (such as random forests and boosted trees) are described in greater detail in Chapter 8.
Classification and Regression Trees (C&RT)
For purposes of classification, we should review some of the basic information about C&RT to understand how to use this algorithm effectively. The C&RT algorithm, popularized by Brieman et al. (1984), grew out of the development of a method for identifying high-risk patients at the University of California, San Diego Medical Center. For that purpose, the basic design of C&RT was that of a binary decision tree. The C&RT algorithm is a form of decision tree that can be used for either classification or estimation problems (like regression). Predictor variables can be nominal (text strings), ordinal (ordered strings, like first, second), or continuous (like real numbers).
Some initial settings in most C&RT algorithms are common to most classification procedures. The first setting is the prior probability of the target variables (frequency of the classes). Often, this is done behind the scenes in a C&RT implementation. Another important setting (usually done manually) is to select the measure of the "impurity" to use in evaluating candidate split points. By default, this setting is the Gini score, but other methods such as twoing are available in many data mining packages. The Gini score is based on the relative frequency of subranges in the predictor variables. Twoing divides the cases into the best two subclasses and calculates the variance of each subclass used to define the split point. Sometimes, options for case weights and frequency weights are provided by the algorithm. Missing values in one variable are handled commonly by substituting the value of a surrogate variable, one with similar splitting characteristics as the variable with the missing value. The algorithm continues making splits along the ranges of the predictor variables, until some stopping function is satisfied, like the variance of a node declining below a threshold level. When the splitting stops along a given branch, the unsplit node is termed the terminal node. Terminal nodes are assigned the most frequent class among the cases in that node.
C&RT trees are pruned most commonly by cross-validation. In this case, cross-validation selects either the smallest subtree whose error rate is within 1 standard error unit of the tree with the lowest error rate, or the subtree with that lowest error rate. The quality of the final tree is determined either as the one with the lowest error rate, when compared to test data, or the one with a stable error rate for new data sets.
Example
The Adult data set from the UCI data mining data set archive was used to create a C&RT decision tree for classification http://archive.ics.uci.edu/ml/. The data set contains the following:
Target:
- •
-
Income <= $50,000/yr.; > $50,000/yr. (categorical)
Predictors:
- •
-
Age: - continuous
- •
-
Workclass: Private, Self-emp-not-inc … - categorical
- •
-
Final weight: Weights assigned within a State to correct for differences in demographic structure by age, race, and sex
- •
-
Education: Bachelors, Some-college … - continuous
- •
-
Education-years: - continuous
- •
-
Marital-status: Married, Single, Divorced … - categorical
- •
-
Occupation: Teller, Mechanic… - categorical
- •
-
Relationship: Wife, Husband … - categorical
- •
-
Race: White, Black, Other … - categorical
- •
-
Sex: Female, Male - categorical
- •
-
Capital-gain: - continuous
- •
-
Capital-loss: - continuous
- •
-
Hours-per-week: - continuous
- •
-
Native-country: Nationality - categorical
Figure 11.2 shows a decision tree for four of these variables:
Figure 11.2. Decision tree structure.
- •
-
Capital-gain
- •
-
Age
- •
-
Marital Status
- •
-
Educations Yrs.
From the decision tree shown in Figure 11.2, we can induce general rules to predict who is likely to have income:
- 1
-
Persons with IRS capital gains deductions are likely to have an income > $50,000/yr.
- 2
-
Married persons with no capital gains deductions and who have greater than 12 years of education (college work) are likely to have incomes > $50,000/yr.
- 3
-
All other persons are likely to have incomes <= $50,000/yr.
We can drill down a little deeper to evaluate how accurate the predictions are by constructing a classification matrix (Table 11.1). For the sake of this analysis, the >$50,000/yr category is considered the "positive" prediction.
TABLE 11.1. The Classification Matrix for the Prediction of Incomes > $50,000/yr. Using Age, Marital Status, Capital Gains, and Years of Education
| Classification matrix 1 (adult_train_data.sta) Dependent variable: TARGET Options: Categorical response, Analysis sample | ||||
|---|---|---|---|---|
| Observed | Predicted <=50K | Predicted >50K | Row Total | |
| Number | <=50K | 21929 | 2791 | 24720 |
| Column Percentage | 87.15% | 37.72% | ||
| Row Percentage | 88.71% | 11.29% | ||
| Total Percentage | 67.35% | 8.57% | 75.92% | |
| Number | >50K | 3232 | 4609 | 7841 |
| Column Percentage | 12.85% | 62.28% | ||
| Row Percentage | 41.22% | 58.78% | ||
| Total Percentage | 9.93% | 14.15% | 24.08% | |
| Count | All Groups | 25161 | 7400 | 32561 |
| Total Percent | 77.27% | 22.73% | ||
According to the model evaluation criteria presented in Chapter 6, the Sensitivity of the model is 21,929/(21,929 + 2,791) × 100 = 88.71%, and the Specificity of the model is 3,232/(3,232 + 4,609) × 100 = 41.22%. Remember, the Sensitivity of the model measures how well it predicts incomes > $50,000/yr., and the Specificity of the model measures how well it predicts incomes <= $50,000/yr. Ideally, the Specificity value of 41.22% should be closer to the Sensitivity value of 88.71%. This means that the model predicts high incomes better than it does low incomes. Nearly half (41.22%) of the incomes predicted as low were actually high. This model might work well if we wanted only to identify persons with a particularly high probability of having a high income. This model has only four predictor variables (Age, Capital Gains, Marital Status, and Years of Education). It appears that our short-list of variables is too short to do a good job of predicting low incomes. Maybe we should add more variables.
Table 11.2 shows the classification matrix for all the potential predictor variables in the data set (except for the Final_Weight variable).
TABLE 11.2. The Classification Matrix for the Prediction of Incomes > $50,000/yr. Using All Potential Predictor Variables
| Classification matrix 1 (adult_train_data.sta) Dependent variable: TARGET Options: Categorical response, Analysis sample | ||||
|---|---|---|---|---|
| Observed | Predicted <=50K | Predicted >50K | Row Total | |
| Number | <=50K | 22502 | 2218 | 24720 |
| Column Percentage | 85.98% | 34.72% | ||
| Row Percentage | 91.03% | 8.97% | ||
| Total Percentage | 69.11% | 6.81% | 75.92% | |
| Number | >50K | 3670 | 4171 | 7841 |
| Column Percentage | 14.02% | 65.28% | ||
| Row Percentage | 46.81% | 53.19% | ||
| Total Percentage | 11.27% | 12.81% | 24.08% | |
| Count | All Groups | 26172 | 6389 | 32561 |
| Total Percent | 80.38% | 19.62% | ||
The Sensitivity value for the model with all predictor variable of 85.98% is slightly lower than the 88.71 figure for the model with four variables, but the Specificity value of 46.81% is slightly higher. Which model is better? The best rule of thumb to follow in this situation is Occam's Razor: the simplest model is the best. Therefore, we might select the model with four predictor variables.
The lift chart for the simplest model is shown in Figure 11.3.
Figure 11.3. Lift chart for the simple model of four predictor variables.
The values in the cumulative lift chart shown in the figure have been normalized around the expected value of 1.00 (50:50 probability of high income). This form of the lift chart suggests that the model predicts better than random expectation (50:50) in the first eight deciles (up to 80% of the values).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123747655000115
Source: https://www.sciencedirect.com/topics/mathematics/enterprise-miner
0 Response to "Sas Enterprise Miner Decision Tree How to Continue Editing a Interactive Tree"
Post a Comment