An example of predictive modelling using Decision trees, Python - step by step
.
The aim is to how accurately your system can predict the future. All around the globe, we are seeing organizations and enterprises are preparing applications that can read, and understand the psyche of the buyers and thereby predicting what they could come up to buy in another 3-4 months. This is exactly how you see 'Customers who have brought this, have also brought that' on Amazon or 'People similar to your profile' on LinkedIn. The idea is to feed your system with more and more data. As you pour in more data to train your machine, easier it gets for the machine to understand the pattern. And more accurate is the result.A typical machine learning process consists of the following steps:
Import the data: This is where we feed the data into the system.
Clean the data: the most essential part of the process that demands to clean the data beforehand to get rid of duplicate data, repetitive data, data with null values, etc. Each project has a different modus operandi for cleaing the data.
Split the data into training and test sets: This is where we are splitting the original datasets into two parts: one for training the system, and other for testing the data.
Create a model: Here we would be selecting which data modelling would be best for our data engineering. Luckily we have neural network, we have decision trees, etc. to choose from.
Train the model: here we are going to load the data with input and output data.
Make predictions: Here we will be making predictions, using the standard model/methods and store them on on variables for later perusals.
Evaluate and improvise: we are going to analyze if the outcome is just as per expectations. If not then we might need to change the model or fine tune the parameters.
In this example, we are considering a dataset like this:
Look at the data above. It's a survey from Amazon Prime, that ask your details during subscription: You gender, your age, your movie preferences. Let us assume that it storing all the data in a simple CSV file of three columns: Age, Gender (1 for male, 0 for female), and Movie Genre. Any male of 15 years would like Cartoon movies, while someone who is 30 years old would like to see Family Drama. A girl who is 20 years old has been found to like Cartoon movies, whereas someone around 30 likes to see Family Drama.
We would like to see what would a man of 28 would like to see. And a woman of 32 would love to watch.
Here we go:
a. Launch Jupyter Notebook by launching from Anaconda prompt: by the following command: Jupter Notebook.
b. You can import the Pandas module by: import pandas as pd. For the above examle, we would be using DecisionTrees. For that, we are importing from SkiKit-learn by saying: from sklearn.tree import DecisionTreeClassifier. Here DecisionTreeClassifier is the class which contains the algo.
c. Your CSV file should ideally reside in the same folder location as your ipynb file, as otherwise your have to give the full path:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
df = pd.read_csv('movie-genres.csv')
S = df.drop(columns=['genre'])
o = df['genre']
Here: evidently movie-genres.CSV is out file under inspection. df is the dataframe that is obtained from reading the file. 'S' is the part of the file's content which contains source columns which we are creating by dropping/removing the cloumn 'Genre' and o is the Output which is just the 'genre' column.
Now finally we are training the model by saying:
model = DecisionTreeClassifier()
model.fit(S, o)
Comments
Post a Comment