<

Starting Machine Learning

26 April 2020

Documenting my journey of learning machine learning. Starting off with a mini project I found online. The goal is to train a program to identify the species of an Iris based off its morphological traits. They will be separated into Iris-setosa, Iris-versicolor, and Iris-virginica.

Step 1: Make sure all modules are in order and import. Modules I use:

  • sklearn
  • scipy
  • pandas
  • matplotlib
  • numpy
1
2
3
import sklearn
print('Python: {}'.format(sklearn.version))
#and so forth with other modules

Step 2: Load Dataset.

1
2
3
names = ['sepal-length', 'sepal-width', \
'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

Step 3: Visualize Data:

1
2
scatter_matrix(dataset)
pyplot.show()

Which will appear like the following: Step 4: Split the Data Set into a training and test set.

1
2
3
4
x = array[:,0:4]
y = array[:,4]
x_train, x_test, y_train, y_test = train_test_split(x, y, \
  test_size=0.20, random_state=1)

Step 5: Create different models

Gaussian Naive Bayes (NB) uses prior probability to predict the probability of the hypothesis.

models = []
models.append(('NB', GaussianNB()))

Logistic Regression is conceptually similar to linear regression. A logistical model is fit to produce a categorical output.

models.append(('LR', LogisticRegression(solver='liblinear', \
multi_class='ovr')))

Linear Discrimination Analysis (LDA) is similar to PCA. LDA maximizes the difference of separation of known categories for identification.

models.append(('LDA', LinearDiscriminantAnalysis()))

K-nearest Neighbors uses the closest k number of neighbors to generate a prediction.

models.append(('KNN', KNeighborsClassifier()))

Decision Trees employ a tree of decision-nodes to create a prediction.

models.append(('CART', DecisionTreeClassifier()))

Support Vector Machines create a support vector classifier to create predictions.

models.append(('SVM', SVC(gamma='auto')))

Step 6: Test Models

for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1,\
        shuffle=True)
	cv_results = cross_val_score(model, X_train, Y_train,\
        cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)'%(name,cv_results.mean(),cv_results.std()))

Results:
NB: 0.951166 (0.052812)
LR: 0.955909 (0.044337)
LDA: 0.975641 (0.037246)
KNN: 0.950524 (0.040563)
CART: 0.958858 (0.053754)
SVM: 0.983333 (0.033333)