Random Forest (python)

What is random forest

Random  Forests  are  almost  any  prediction  problem (even non  - straight  parts  ) . It is a relatively new machine learning strategy _           ( Made  at Bell Labs in the 1990s   )   and  it  can be  used for almost  anything  .  it  belongs to  A large category of machine learning  algorithms  - ensemble learning  method  .

integrated learning

Integrated Learning PassBuild a combination of several models to solve a single prediction problem . It works by generating multiple classifiers / models , each independently learning andmake predictions. these predictionsFinally combined into a single prediction , so better than any single categorymake predictions.

process


In the process of building every decision tree, there are two things to pay attention to - sampling and full splitting. The first is two random sampling processes. Random forest needs to perform row and column sampling on the input data. For row sampling, the method with replacement is adopted, that is, there may be duplicate samples in the sample set obtained by sampling. Assuming that the input samples are N, then the sampled samples are also N. In this way, during training, the input samples of each tree are not all samples, making it relatively difficult to over-fitting. Then perform column sampling, and select m (m << M) from M features. After that, a decision tree is built by completely splitting the sampled data, so that a certain leaf node of the decision tree cannot be split further, or all the samples in it point to the same category. Generally, many decision tree algorithms have an important step - pruning, but this is not done here. Since the previous two random sampling processes ensure randomness, even if there is no pruning, there will be no over-fitting.

Each tree in the random forest obtained by this algorithm is very weak, but the combination of everyone is very powerful. I think the random forest algorithm can be compared like this: each decision tree is an expert who is proficient in a narrow field (because we choose m from M features and let each decision tree learn), so in the random forest, With many experts who are proficient in different fields, a new problem (new input data) can be viewed from different angles, and finally, the experts will vote to get the result.

advantage

  • perform well on the dataset

  • On many current data sets, it has a great advantage over other algorithms

  • It can handle very high-dimensional (many features) data without feature selection

  • After training, it can give which features are more important

  • When creating a random forest, an unbiased estimate is used for the generlization error

  • fast training

  • During the training process, the interaction between features can be detected

  • easy to parallelize

  • Implementation is relatively simple

Using python's sklearn package to implement the survival prediction of the Titanic on kaggle

  1. #coding:utf-8
  2. import pandas as pd
  3. import numpy as np
  4. import csv as csv
  5. from sklearn.ensemble import RandomForestClassifier
  6. train_df = pd.read_csv( r'D:\PythonDDD\shuju files\tantanic\train.csv' , header= 0 ) #no first line
  7. train_df[ 'Gender' ] = train_df[ 'Sex' ] .map ({ 'female' : 0 , 'male' : 1 }).astype( int ) #Convert boolean values, astype implements variable type conversion
  8. if len (train_df.Embarked[train_df.Embarked.isnull()]) > 0 : #ISNULL replace NULL with specified replacement value
  9. train_df.Embarked[train_df.Embarked.isnull()] = train_df.Embarked.dropna().mode().values ​​#Replace null values ​​with mode mode
  10. Ports = list ( enumerate (np.unique(train_df[ 'Embarked' ]))) #Find the first unique value and its index, enumerate the variables, and put it in a list
  11. print Ports
  12. Ports_dict = {name: i for i, name in Ports} #The front function, the back range, correspond to the value of each category
  13. # print Ports_dict # set up a dictionary in the form Ports : index
  14. train_df.Embarked = train_df.Embarked.map ( lambda x : Ports_dict[x]).astype( int ) #Index , corresponding to the string represented by the value
  15. median_age = train_df['Age'].dropna().median()#中值
  16. if len(train_df.Age[ train_df.Age.isnull()]) > 0:
  17. train_df.loc[ (train_df.Age.isnull()), 'Age' ] = median_age #median age
  18. train_df = train_df.drop([ 'Name' , 'Sex' , 'Ticket' , 'Cabin' , 'PassengerId' ], axis= 1 ) #Remove unnecessary attributes
  19. test_df = pd.read_csv(r'D:\PythonDDD\shuju files\tantanic\test.csv', header=0) # Load the test file into a dataframe
  20. test_df['Gender'] = test_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
  21. if len(test_df.Embarked[test_df.Embarked.isnull()]) > 0:
  22. test_df.Embarked[test_df.Embarked.isnull()] = test_df.Embarked.dropna().mode().values
  23. test_df.Embarked = test_df.Embarked.map(lambda x: Ports_dict[x]).astype(int)
  24. median_age = test_df['Age'].dropna().median()
  25. if len(test_df.Age[test_df.Age.isnull()]) > 0:
  26. test_df.loc[(test_df.Age.isnull()), 'Age'] = median_age
  27. if len(test_df.Fare[test_df.Fare.isnull()]) > 0:
  28. median_fare = np.zeros( 3 ) #0 matrix
  29. for f in range(0, 3): # loop 0 to 2
  30. median_fare[f] = test_df[ test_df.Pclass == f+1 ]['Fare'].dropna().median()
  31. for f in range(0, 3): # loop 0 to 2
  32. test_df.loc[(test_df.Fare.isnull()) & (test_df.Pclass == f+1 ), 'Fare'] = median_fare[f]
  33. ids = test_df['PassengerId'].values
  34. test_df = test_df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
  35. train_data = train_df.values
  36. test_data = test_df.values
  37. print 'Training...'
  38. forest = RandomForestClassifier(n_estimators=100)
  39. # print train_data[0::,1:]
  40. forest = forest.fit(train_data[ 0 ::, 1 ::], train_data[ 0 ::, 0 ]) #features and labels
  41. print 'Predicting...'
  42. # print train_data[0::, 1::]
  43. output = forest.predict(test_data).astype(int)
  44. predictions_file = open("D:\PythonDDD\shuju files\tantanic\myfirstforest.csv", "wb")
  45. open_file_object = csv.writer(predictions_file)
  46. open_file_object.writerow(["PassengerId", "Survived"])
  47. open_file_object.writerows(zip(ids, output))
  48. predictions_file.close()

Python implements the principle of random forest
  1. #coding:utf-8
  2. import csv
  3. from random import seed
  4. from random import randrange
  5. from math import sqrt
  6. def loadCSV ( filename ): #Load data, store the list line by line
  7. dataSet = []
  8. with open(filename, 'r') as file:
  9. csvReader = csv.reader(file)
  10. for line in csvReader:
  11. dataSet.append(line)
  12. return dataSet
  13. # Except for the label column, all other columns are converted to float type
  14. def column_to_float(dataSet):
  15. featLen = len ( dataSet[ 0 ]) - 1
  16. for data in dataSet:
  17. for column in range (featLen):
  18. data[column] = float(data[column].strip())
  19. # Randomly divide the data set into N blocks for cross-validation, one of which is the test set, and the other four are the training set
  20. def spiltDataSet(dataSet, n_folds):
  21. fold_size = int(len(dataSet) / n_folds)
  22. dataSet_copy = list(dataSet)
  23. dataSet_spilt = []
  24. for i in range(n_folds):
  25. fold = []
  26. while len (fold) < fold_size: # If can't be used here, if only works in the first judgment, while executes the loop until the condition does not hold
  27. index = randrange(len(dataSet_copy))
  28. fold.append(dataSet_copy.pop(index)) # The pop() function is used to remove an element in the list (the last element by default) and return the value of the element.
  29. dataSet_spilt.append(fold)
  30. return dataSet_played
  31. # construct data subset
  32. def get_subsample(dataSet, ratio):
  33. subdataSet = []
  34. lenSubdata = round ( len (dataSet) * ratio) #return float
  35. while len(subdataSet) < lenSubdata:
  36. index = randrange( len (dataSet) - 1 )
  37. subdataSet.append(dataSet[index])
  38. # print len(subdataSet)
  39. return subdataSet
  40. # split the dataset
  41. def data_spilt(dataSet, index, value):
  42. left = []
  43. right = []
  44. for row in dataSet:
  45. if row[index] < value:
  46. left.append(row)
  47. else:
  48. right.append(row)
  49. return left, right
  50. # Calculate the split cost
  51. def spilt_loss(left, right, class_values):
  52. loss = 0.0
  53. for class_value in class_values:
  54. left_size = len(left)
  55. if left_size != 0 : # prevent division by zero
  56. prop = [row[-1] for row in left].count(class_value) / float(left_size)
  57. loss += (prop * (1.0 - prop))
  58. right_size = len(right)
  59. if right_size != 0:
  60. prop = [row[-1] for row in right].count(class_value) / float(right_size)
  61. loss += (prop * (1.0 - prop))
  62. return loss
  63. # Select any n features, and among these n features, select the optimal feature for segmentation
  64. def get_best_spilt(dataSet, n_features):
  65. features = []
  66. class_values = list(set(row[-1] for row in dataSet))
  67. b_index, b_value, b_loss, b_left, b_right = 999, 999, 999, None, None
  68. while len(features) < n_features:
  69. index = randrange( len (dataSet[ 0 ]) - 1 )
  70. if index not in features:
  71. features.append(index)
  72. # print 'features:',features
  73. for index in features: #Find the most suitable index of the column to be the node, (minimum loss)
  74. for row in dataSet:
  75. left, right = data_spilt(dataSet, index, row[index]) #With it as a node, left and right branches
  76. loss = spilt_loss(left, right, class_values)
  77. if loss < b_loss: #Find the minimum segmentation cost
  78. b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right
  79. # print b_loss
  80. # print type(b_index)
  81. return {'index': b_index, 'value': b_value, 'left': b_left, 'right': b_right}
  82. # Determine output labels
  83. def decide_label(data):
  84. output = [row[-1] for row in data]
  85. return max(set(output), key=output.count)
  86. # Sub-segmentation, the process of continuously building leaf nodes pair-wise
  87. def sub_spilt(root, n_features, max_depth, min_size, depth):
  88. left = root['left']
  89. # print left
  90. right = root['right']
  91. del (root['left'])
  92. del (root['right'])
  93. # print depth
  94. if not left or not right:
  95. root['left'] = root['right'] = decide_label(left + right)
  96. # print 'testing'
  97. return
  98. if depth > max_depth:
  99. root['left'] = decide_label(left)
  100. root['right'] = decide_label(right)
  101. return
  102. if len(left) < min_size:
  103. root['left'] = decide_label(left)
  104. else:
  105. root['left'] = get_best_spilt(left, n_features)
  106. # print 'testing_left'
  107. sub_spilt(root['left'], n_features, max_depth, min_size, depth + 1)
  108. if len(right) < min_size:
  109. root['right'] = decide_label(right)
  110. else:
  111. root['right'] = get_best_spilt(right, n_features)
  112. # print 'testing_right'
  113. sub_spilt(root['right'], n_features, max_depth, min_size, depth + 1)
  114. # construct decision tree
  115. def build_tree(dataSet, n_features, max_depth, min_size):
  116. root = get_best_spilt(dataSet, n_features)
  117. sub_spilt(root, n_features, max_depth, min_size, 1)
  118. return root
  119. # Predict the test set result
  120. def predict(tree, row):
  121. predictions = []
  122. if row[tree['index']] < tree['value']:
  123. if isinstance(tree['left'], dict):
  124. return predict(tree['left'], row)
  125. else:
  126. return tree['left']
  127. else:
  128. if isinstance(tree['right'], dict):
  129. return predict(tree['right'], row)
  130. else:
  131. return tree['right']
  132. # predictions=set(predictions)
  133. def bagging_predict(trees, row):
  134. predictions = [predict(tree, row) for tree in trees]
  135. return max(set(predictions), key=predictions.count)
  136. # create random forest
  137. def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees):
  138. trees = []
  139. for i in range(n_trees):
  140. train = get_subsample(train, ratio) #Select a subset from the cut dataset
  141. tree = build_tree(train, n_features, max_depth, min_size)
  142. # print 'tree %d: '%i,tree
  143. trees.append(tree)
  144. # predict_values = [predict(trees,row) for row in test]
  145. predict_values = [bagging_predict(trees, row) for row in test]
  146. return predict_values
  147. # Calculate the accuracy
  148. def accuracy(predict_values, actual):
  149. correct = 0
  150. for i in range(len(actual)):
  151. if actual[i] == predict_values[i]:
  152. correct += 1
  153. return correct / float(len(actual))
  154. if __name__ == '__main__':
  155. seed(1)
  156. dataSet = loadCSV('sonar-all-data.csv')
  157. column_to_float(dataSet)
  158. n_folds = 5
  159. max_depth = 15
  160. min_size = 1
  161. ratio = 1.0
  162. # n_features=sqrt(len(dataSet)-1)
  163. n_features = 15
  164. n_trees = 10
  165. folds = spiltDataSet(dataSet, n_folds) #First cut the dataset
  166. scores = []
  167. for fold in folds:
  168. train_set = folds[
  169. :] # You can't simply use train_set=folds here, this is a reference, then when the value of train_set changes, the value of folds will also change, so use the copy form. (L[:]) can copy the sequence, D.copy() can copy the dictionary, list can generate a copy list(L)
  170. train_set.remove(fold) #Select the training set
  171. # print len(folds)
  172. train_set = sum (train_set, []) # Combine multiple fold lists into one train_set list
  173. # print len(train_set)
  174. test_set = []
  175. for row in fold:
  176. row_copy = list(row)
  177. row_copy[-1] = None
  178. test_set.append(row_copy)
  179. # for row in test_set:
  180. # print row[-1]
  181. actual = [row[-1] for row in fold]
  182. predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees)
  183. accur = accuracy(predict_values, actual)
  184. scores.append(accur)
  185. print ('Trees is %d' % n_trees)
  186. print ('scores:%s' % scores)
  187. print ('mean score:%s' % (sum(scores) / float(len(scores))))




Tags: Random Forest (python)

data analysis python machine learning data mining

Related: Random Forest (python)