Scenario: You work for an insurance company that has many policy holders, and many agents who sell insurance to new customers every day. You have been asked to use historical data about past and current policy holders to build a decision tree that will be used by sales agents to determine the insurability of potential new clients. You will use two data sets to do this.
The Policy Holders data set contains information about current and past auto insurance customers, such as whether or not they have a claim or ticket in the past 12 months, an accident in the past 36 months, how they pay for their policy, their gender and marital status, and the level of activity associated with their insurance account (this is Low, Moderate or High based on frequency of changes to the policy, frequency of late or partial payments, and other similar account activity). Note that the only variable in the Policy Holders data set that is not also in the Policy Buyers data set is Insurance Category variable. This is the dependent variable that you will predict using a decision tree model.
For the Policy Holders, you have the benefit of hindsight since your company did sell auto insurance policies to all of the people in this data set, and looking back on their activity as policy holders, they have each been assigned one of Insurance Category values: Insure-Best Terms, Insure-Risk Terms, Insure-High Premium, or Do Not Insure.
- The “Best Terms” customers are those who have paid their premiums and had no or few claims that have cost your company money. They are the lowest risk customers.
- The “Risk Terms” customers have been good for your company, but have had a few claims or incidents that have cost the company money. They are still a good risk for the company, but may have slightly higher premiums or lower coverage amounts in order to account for the higher risk to the company.
- The “High Premium” customers are those who have had a number of claims or other problems that have cost the company money (e.g., maybe they have not always paid their premiums on time or in full), but still have been worth insuring as long as they paid higher premiums than most of the other customers. They represent a higher risk for the company, and therefore must be sold policies at higher premiums and lower coverage.
- The “Do Not Insure” customers are those who have filed too many claims and/or claims that have cost more than what they have paid in premiums; or who have been unreliable in paying their premiums to the point where they cost the company more money than they pay in, and are therefore not a good risk for the company. They may have had their policies cancelled by the company due to excessive risk that the company cannot bear.
Complete the following steps:
- Download the PolicyHolders.csv and PolicyBuyers.csv files from Course Documents. In a Word document create a cover page for your Assignment, then provide evidence that you have imported both of these data sets into R with appropriate names.
- Use the rpart function in R to create a decision tree model for the Insurance Category dependent variable. Do not forget to load library(rpart). Provide evidence in Word that you have created the model.
- Using summary(<yourtreename>), identify the three most important independent variables used to predict Insurance Category. In Word, show evidence of the three top independent variables. Write a short explanation of your findings.
- Use Tools > Install Packages in the R Studio application menu to install the rpart.plot package. Once installed, load this package using library(rpart.plot). Then, use the prp function to visualize your decision tree. You may need to resize the Plots window in the lower right part of your R Studio application to make the tree large enough to read. In your prp function, include the following parameters: extra=4, faclen=0, varlen=0, cex=.75. The extra parameter includes the confidence percentages in each leaf of your tree; faclen causes the independent variable names to be spelled out in the tree; varlen causes the dependent variable values to be spelled out in the tree, and cex sets the font size (you can experiment with this if you would like). In your Word document, include a screen capture of your visualized decision tree. Write a short explanation of how the percentages in each tree leaf would be interpreted.
- Make predictions for each of the policy buyers by applying your decision tree model the Policy Buyers data set. When using the predict function in R, be sure to include the parameter type=”class” so that you will generate an Insurance Category for each policy buyer. Using the Filter feature in R Studio, report the number of policy buyers that you predict will fall into each of the four categories. Be sure to label these clearly in your Word document. If you have done this step correctly, the numbers predicted for each category should total to 473, which is the number of records in the Policy Buyers data set.
- Conduct research about how the insurance industry uses analytics to manage risk as they insure their customers. Write a brief summary of your research (1–2 paragraphs) discussing how the insurance industry uses analytics. Be sure to include discussion of both legal and ethical ramifications for the industry in their use of analytics. Cite your sources both in the text and in a references page.
Screen shots need to be included for each step in R Studio.