Integrated Machine Learning Project¶

Completed by Sonya Wach for Practicum.Yandex¶

The project consists of analyzing company data of various metal concentrations at different stages of extraction and purification from ore. A machine learning model is built to predict the amount of gold recovered from gold ore which will therefore improve efficiency and help the company to optimize the production and eliminate unprofitable parameters.¶

1. Prepare the data¶

1.1. Open the files and look into the data.¶

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

gold_train = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_train.csv")
gold_train.head()

gold_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
date                                                  16860 non-null object
final.output.concentrate_ag                           16788 non-null float64
final.output.concentrate_pb                           16788 non-null float64
final.output.concentrate_sol                          16490 non-null float64
final.output.concentrate_au                           16789 non-null float64
final.output.recovery                                 15339 non-null float64
final.output.tail_ag                                  16794 non-null float64
final.output.tail_pb                                  16677 non-null float64
final.output.tail_sol                                 16715 non-null float64
final.output.tail_au                                  16794 non-null float64
primary_cleaner.input.sulfate                         15553 non-null float64
primary_cleaner.input.depressant                      15598 non-null float64
primary_cleaner.input.feed_size                       16860 non-null float64
primary_cleaner.input.xanthate                        15875 non-null float64
primary_cleaner.output.concentrate_ag                 16778 non-null float64
primary_cleaner.output.concentrate_pb                 16502 non-null float64
primary_cleaner.output.concentrate_sol                16224 non-null float64
primary_cleaner.output.concentrate_au                 16778 non-null float64
primary_cleaner.output.tail_ag                        16777 non-null float64
primary_cleaner.output.tail_pb                        16761 non-null float64
primary_cleaner.output.tail_sol                       16579 non-null float64
primary_cleaner.output.tail_au                        16777 non-null float64
primary_cleaner.state.floatbank8_a_air                16820 non-null float64
primary_cleaner.state.floatbank8_a_level              16827 non-null float64
primary_cleaner.state.floatbank8_b_air                16820 non-null float64
primary_cleaner.state.floatbank8_b_level              16833 non-null float64
primary_cleaner.state.floatbank8_c_air                16822 non-null float64
primary_cleaner.state.floatbank8_c_level              16833 non-null float64
primary_cleaner.state.floatbank8_d_air                16821 non-null float64
primary_cleaner.state.floatbank8_d_level              16833 non-null float64
rougher.calculation.sulfate_to_au_concentrate         16833 non-null float64
rougher.calculation.floatbank10_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.floatbank11_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.au_pb_ratio                       15618 non-null float64
rougher.input.feed_ag                                 16778 non-null float64
rougher.input.feed_pb                                 16632 non-null float64
rougher.input.feed_rate                               16347 non-null float64
rougher.input.feed_size                               16443 non-null float64
rougher.input.feed_sol                                16568 non-null float64
rougher.input.feed_au                                 16777 non-null float64
rougher.input.floatbank10_sulfate                     15816 non-null float64
rougher.input.floatbank10_xanthate                    16514 non-null float64
rougher.input.floatbank11_sulfate                     16237 non-null float64
rougher.input.floatbank11_xanthate                    14956 non-null float64
rougher.output.concentrate_ag                         16778 non-null float64
rougher.output.concentrate_pb                         16778 non-null float64
rougher.output.concentrate_sol                        16698 non-null float64
rougher.output.concentrate_au                         16778 non-null float64
rougher.output.recovery                               14287 non-null float64
rougher.output.tail_ag                                14610 non-null float64
rougher.output.tail_pb                                16778 non-null float64
rougher.output.tail_sol                               14611 non-null float64
rougher.output.tail_au                                14611 non-null float64
rougher.state.floatbank10_a_air                       16807 non-null float64
rougher.state.floatbank10_a_level                     16807 non-null float64
rougher.state.floatbank10_b_air                       16807 non-null float64
rougher.state.floatbank10_b_level                     16807 non-null float64
rougher.state.floatbank10_c_air                       16807 non-null float64
rougher.state.floatbank10_c_level                     16814 non-null float64
rougher.state.floatbank10_d_air                       16802 non-null float64
rougher.state.floatbank10_d_level                     16809 non-null float64
rougher.state.floatbank10_e_air                       16257 non-null float64
rougher.state.floatbank10_e_level                     16809 non-null float64
rougher.state.floatbank10_f_air                       16802 non-null float64
rougher.state.floatbank10_f_level                     16802 non-null float64
secondary_cleaner.output.tail_ag                      16776 non-null float64
secondary_cleaner.output.tail_pb                      16764 non-null float64
secondary_cleaner.output.tail_sol                     14874 non-null float64
secondary_cleaner.output.tail_au                      16778 non-null float64
secondary_cleaner.state.floatbank2_a_air              16497 non-null float64
secondary_cleaner.state.floatbank2_a_level            16751 non-null float64
secondary_cleaner.state.floatbank2_b_air              16705 non-null float64
secondary_cleaner.state.floatbank2_b_level            16748 non-null float64
secondary_cleaner.state.floatbank3_a_air              16763 non-null float64
secondary_cleaner.state.floatbank3_a_level            16747 non-null float64
secondary_cleaner.state.floatbank3_b_air              16752 non-null float64
secondary_cleaner.state.floatbank3_b_level            16750 non-null float64
secondary_cleaner.state.floatbank4_a_air              16731 non-null float64
secondary_cleaner.state.floatbank4_a_level            16747 non-null float64
secondary_cleaner.state.floatbank4_b_air              16768 non-null float64
secondary_cleaner.state.floatbank4_b_level            16767 non-null float64
secondary_cleaner.state.floatbank5_a_air              16775 non-null float64
secondary_cleaner.state.floatbank5_a_level            16775 non-null float64
secondary_cleaner.state.floatbank5_b_air              16775 non-null float64
secondary_cleaner.state.floatbank5_b_level            16776 non-null float64
secondary_cleaner.state.floatbank6_a_air              16757 non-null float64
secondary_cleaner.state.floatbank6_a_level            16775 non-null float64
dtypes: float64(86), object(1)
memory usage: 11.2+ MB

gold_test = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_test.csv")
gold_test.head()

gold_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
date                                          5856 non-null object
primary_cleaner.input.sulfate                 5554 non-null float64
primary_cleaner.input.depressant              5572 non-null float64
primary_cleaner.input.feed_size               5856 non-null float64
primary_cleaner.input.xanthate                5690 non-null float64
primary_cleaner.state.floatbank8_a_air        5840 non-null float64
primary_cleaner.state.floatbank8_a_level      5840 non-null float64
primary_cleaner.state.floatbank8_b_air        5840 non-null float64
primary_cleaner.state.floatbank8_b_level      5840 non-null float64
primary_cleaner.state.floatbank8_c_air        5840 non-null float64
primary_cleaner.state.floatbank8_c_level      5840 non-null float64
primary_cleaner.state.floatbank8_d_air        5840 non-null float64
primary_cleaner.state.floatbank8_d_level      5840 non-null float64
rougher.input.feed_ag                         5840 non-null float64
rougher.input.feed_pb                         5840 non-null float64
rougher.input.feed_rate                       5816 non-null float64
rougher.input.feed_size                       5834 non-null float64
rougher.input.feed_sol                        5789 non-null float64
rougher.input.feed_au                         5840 non-null float64
rougher.input.floatbank10_sulfate             5599 non-null float64
rougher.input.floatbank10_xanthate            5733 non-null float64
rougher.input.floatbank11_sulfate             5801 non-null float64
rougher.input.floatbank11_xanthate            5503 non-null float64
rougher.state.floatbank10_a_air               5839 non-null float64
rougher.state.floatbank10_a_level             5840 non-null float64
rougher.state.floatbank10_b_air               5839 non-null float64
rougher.state.floatbank10_b_level             5840 non-null float64
rougher.state.floatbank10_c_air               5839 non-null float64
rougher.state.floatbank10_c_level             5840 non-null float64
rougher.state.floatbank10_d_air               5839 non-null float64
rougher.state.floatbank10_d_level             5840 non-null float64
rougher.state.floatbank10_e_air               5839 non-null float64
rougher.state.floatbank10_e_level             5840 non-null float64
rougher.state.floatbank10_f_air               5839 non-null float64
rougher.state.floatbank10_f_level             5840 non-null float64
secondary_cleaner.state.floatbank2_a_air      5836 non-null float64
secondary_cleaner.state.floatbank2_a_level    5840 non-null float64
secondary_cleaner.state.floatbank2_b_air      5833 non-null float64
secondary_cleaner.state.floatbank2_b_level    5840 non-null float64
secondary_cleaner.state.floatbank3_a_air      5822 non-null float64
secondary_cleaner.state.floatbank3_a_level    5840 non-null float64
secondary_cleaner.state.floatbank3_b_air      5840 non-null float64
secondary_cleaner.state.floatbank3_b_level    5840 non-null float64
secondary_cleaner.state.floatbank4_a_air      5840 non-null float64
secondary_cleaner.state.floatbank4_a_level    5840 non-null float64
secondary_cleaner.state.floatbank4_b_air      5840 non-null float64
secondary_cleaner.state.floatbank4_b_level    5840 non-null float64
secondary_cleaner.state.floatbank5_a_air      5840 non-null float64
secondary_cleaner.state.floatbank5_a_level    5840 non-null float64
secondary_cleaner.state.floatbank5_b_air      5840 non-null float64
secondary_cleaner.state.floatbank5_b_level    5840 non-null float64
secondary_cleaner.state.floatbank6_a_air      5840 non-null float64
secondary_cleaner.state.floatbank6_a_level    5840 non-null float64
dtypes: float64(52), object(1)
memory usage: 2.4+ MB

gold_full = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_full.csv")
gold_full.head()

gold_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
date                                                  22716 non-null object
final.output.concentrate_ag                           22627 non-null float64
final.output.concentrate_pb                           22629 non-null float64
final.output.concentrate_sol                          22331 non-null float64
final.output.concentrate_au                           22630 non-null float64
final.output.recovery                                 20753 non-null float64
final.output.tail_ag                                  22633 non-null float64
final.output.tail_pb                                  22516 non-null float64
final.output.tail_sol                                 22445 non-null float64
final.output.tail_au                                  22635 non-null float64
primary_cleaner.input.sulfate                         21107 non-null float64
primary_cleaner.input.depressant                      21170 non-null float64
primary_cleaner.input.feed_size                       22716 non-null float64
primary_cleaner.input.xanthate                        21565 non-null float64
primary_cleaner.output.concentrate_ag                 22618 non-null float64
primary_cleaner.output.concentrate_pb                 22268 non-null float64
primary_cleaner.output.concentrate_sol                21918 non-null float64
primary_cleaner.output.concentrate_au                 22618 non-null float64
primary_cleaner.output.tail_ag                        22614 non-null float64
primary_cleaner.output.tail_pb                        22594 non-null float64
primary_cleaner.output.tail_sol                       22365 non-null float64
primary_cleaner.output.tail_au                        22617 non-null float64
primary_cleaner.state.floatbank8_a_air                22660 non-null float64
primary_cleaner.state.floatbank8_a_level              22667 non-null float64
primary_cleaner.state.floatbank8_b_air                22660 non-null float64
primary_cleaner.state.floatbank8_b_level              22673 non-null float64
primary_cleaner.state.floatbank8_c_air                22662 non-null float64
primary_cleaner.state.floatbank8_c_level              22673 non-null float64
primary_cleaner.state.floatbank8_d_air                22661 non-null float64
primary_cleaner.state.floatbank8_d_level              22673 non-null float64
rougher.calculation.sulfate_to_au_concentrate         22672 non-null float64
rougher.calculation.floatbank10_sulfate_to_au_feed    22672 non-null float64
rougher.calculation.floatbank11_sulfate_to_au_feed    22672 non-null float64
rougher.calculation.au_pb_ratio                       21089 non-null float64
rougher.input.feed_ag                                 22618 non-null float64
rougher.input.feed_pb                                 22472 non-null float64
rougher.input.feed_rate                               22163 non-null float64
rougher.input.feed_size                               22277 non-null float64
rougher.input.feed_sol                                22357 non-null float64
rougher.input.feed_au                                 22617 non-null float64
rougher.input.floatbank10_sulfate                     21415 non-null float64
rougher.input.floatbank10_xanthate                    22247 non-null float64
rougher.input.floatbank11_sulfate                     22038 non-null float64
rougher.input.floatbank11_xanthate                    20459 non-null float64
rougher.output.concentrate_ag                         22618 non-null float64
rougher.output.concentrate_pb                         22618 non-null float64
rougher.output.concentrate_sol                        22526 non-null float64
rougher.output.concentrate_au                         22618 non-null float64
rougher.output.recovery                               19597 non-null float64
rougher.output.tail_ag                                19979 non-null float64
rougher.output.tail_pb                                22618 non-null float64
rougher.output.tail_sol                               19980 non-null float64
rougher.output.tail_au                                19980 non-null float64
rougher.state.floatbank10_a_air                       22646 non-null float64
rougher.state.floatbank10_a_level                     22647 non-null float64
rougher.state.floatbank10_b_air                       22646 non-null float64
rougher.state.floatbank10_b_level                     22647 non-null float64
rougher.state.floatbank10_c_air                       22646 non-null float64
rougher.state.floatbank10_c_level                     22654 non-null float64
rougher.state.floatbank10_d_air                       22641 non-null float64
rougher.state.floatbank10_d_level                     22649 non-null float64
rougher.state.floatbank10_e_air                       22096 non-null float64
rougher.state.floatbank10_e_level                     22649 non-null float64
rougher.state.floatbank10_f_air                       22641 non-null float64
rougher.state.floatbank10_f_level                     22642 non-null float64
secondary_cleaner.output.tail_ag                      22616 non-null float64
secondary_cleaner.output.tail_pb                      22600 non-null float64
secondary_cleaner.output.tail_sol                     20501 non-null float64
secondary_cleaner.output.tail_au                      22618 non-null float64
secondary_cleaner.state.floatbank2_a_air              22333 non-null float64
secondary_cleaner.state.floatbank2_a_level            22591 non-null float64
secondary_cleaner.state.floatbank2_b_air              22538 non-null float64
secondary_cleaner.state.floatbank2_b_level            22588 non-null float64
secondary_cleaner.state.floatbank3_a_air              22585 non-null float64
secondary_cleaner.state.floatbank3_a_level            22587 non-null float64
secondary_cleaner.state.floatbank3_b_air              22592 non-null float64
secondary_cleaner.state.floatbank3_b_level            22590 non-null float64
secondary_cleaner.state.floatbank4_a_air              22571 non-null float64
secondary_cleaner.state.floatbank4_a_level            22587 non-null float64
secondary_cleaner.state.floatbank4_b_air              22608 non-null float64
secondary_cleaner.state.floatbank4_b_level            22607 non-null float64
secondary_cleaner.state.floatbank5_a_air              22615 non-null float64
secondary_cleaner.state.floatbank5_a_level            22615 non-null float64
secondary_cleaner.state.floatbank5_b_air              22615 non-null float64
secondary_cleaner.state.floatbank5_b_level            22616 non-null float64
secondary_cleaner.state.floatbank6_a_air              22597 non-null float64
secondary_cleaner.state.floatbank6_a_level            22615 non-null float64
dtypes: float64(86), object(1)
memory usage: 15.1+ MB

Steps¶

Imported the required libraries.
Saved the contents of the CSVs to dataframes.
Retrieved the information about the dataframes.

Conclusion¶

The data consists of information about metal concentrations at various stages.

1.2. Check that recovery is calculated correctly. Using the training set, calculate recovery for the rougher.output.recovery feature. Find the MAE between your calculations and the feature values. Provide findings.¶

C = gold_train["rougher.output.concentrate_au"]
F = gold_train["rougher.input.feed_au"]
T = gold_train["rougher.output.tail_au"]

gold_train["rougher.output.recovery.calculated"] = (C * (F - T)) / (F * (C - T)) * 100

mae = (gold_train["rougher.output.recovery.calculated"] - gold_train["rougher.output.recovery"]).abs().sum() / len(gold_train)
print("MAE", ":", mae)

MAE : 8.00350954615662e-15

Steps¶

Calculated recovery for rougher.output.recovery feature.
Found MAE of recovery calculated with feature values.

Conclusion¶

The mean absolute error is 8e-15 which is very small, indicating that the feature values in the data are very accurate.

1.3. Analyze the features not available in the test set. What are these parameters? What is their type¶

gold_train.columns.difference(gold_test.columns)

Index(['final.output.concentrate_ag', 'final.output.concentrate_au',
       'final.output.concentrate_pb', 'final.output.concentrate_sol',
       'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au',
       'final.output.tail_pb', 'final.output.tail_sol',
       'primary_cleaner.output.concentrate_ag',
       'primary_cleaner.output.concentrate_au',
       'primary_cleaner.output.concentrate_pb',
       'primary_cleaner.output.concentrate_sol',
       'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au',
       'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol',
       'rougher.calculation.au_pb_ratio',
       'rougher.calculation.floatbank10_sulfate_to_au_feed',
       'rougher.calculation.floatbank11_sulfate_to_au_feed',
       'rougher.calculation.sulfate_to_au_concentrate',
       'rougher.output.concentrate_ag', 'rougher.output.concentrate_au',
       'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol',
       'rougher.output.recovery', 'rougher.output.recovery.calculated',
       'rougher.output.tail_ag', 'rougher.output.tail_au',
       'rougher.output.tail_pb', 'rougher.output.tail_sol',
       'secondary_cleaner.output.tail_ag', 'secondary_cleaner.output.tail_au',
       'secondary_cleaner.output.tail_pb',
       'secondary_cleaner.output.tail_sol'],
      dtype='object')

Steps¶

Printed the columns which are present in the train dataframe but not in the test dataframe.

Conclusion¶

The parameters are output values at different stages for various metals. The output values are not necessary for the machine learning model training and testing.

1.4. Perform data preprocessing.¶

gold_train = gold_train.ffill()
gold_test = gold_test.ffill()
gold_full = gold_full.ffill()

gold_full_merge = gold_full[["date", "rougher.output.recovery", "final.output.recovery", "rougher.output.concentrate_au", "rougher.output.concentrate_ag", "rougher.output.concentrate_pb", "rougher.output.concentrate_sol", "final.output.concentrate_au", "final.output.concentrate_ag", "final.output.concentrate_pb", "final.output.concentrate_sol"]]

gold_test = gold_test.merge(gold_full_merge, on="date", how="left")
gold_full_merge = gold_full_merge.drop(['date', 'rougher.output.recovery', 'final.output.recovery'], axis=1)

gold_train = gold_train.drop("date", axis=1)
gold_test = gold_test.drop("date", axis=1)
gold_full = gold_full.drop("date", axis=1)

Steps¶

Filled the NaN values of the dataframes using ffill.
Merged necessary data in the full dataframe to the test dataframe.
Dropped unecessary date column in dataframes.

Conclusion¶

Filled the NaNs using ffill as it forward fills the NaNs with the last valid observation. Added required info to test dataframe from full dataframe and removed date columns which are not needed for the model fitting.

2. Analyze the data¶

2.1. Take note of how the concentrations of metals (Au, Ag, Pb) change depending on the purification stage.¶

metal_au = gold_full[["rougher.input.feed_au", "rougher.output.concentrate_au", "primary_cleaner.output.concentrate_au", "final.output.concentrate_au"]]
metal_ag = gold_full[["rougher.input.feed_ag", "rougher.output.concentrate_ag", "primary_cleaner.output.concentrate_ag", "final.output.concentrate_ag"]]
metal_pb = gold_full[["rougher.input.feed_pb", "rougher.output.concentrate_pb", "primary_cleaner.output.concentrate_pb", "final.output.concentrate_pb"]]

fig, axes = plt.subplots(3, 1, figsize=(10, 20))
for column in list(metal_au):
    sns.distplot(metal_au[column], ax=axes[0], kde=False)
axes[0].set(title="Au", xlabel="Concentration %", ylabel="Amount")
for column in list(metal_ag):
    sns.distplot(metal_ag[column], ax=axes[1], kde=False)
axes[1].set(title="Ag", xlabel="Concentration %", ylabel="Amount")
for column in list(metal_pb):
    sns.distplot(metal_pb[column], ax=axes[2], kde=False)
axes[2].set(title="Pb", xlabel="Concentration %", ylabel="Amount")
fig.suptitle("Metal Concentraions at Purification Stages")
fig.legend(["rougher.input.feed", "rougher.output.concentrate", "primary.cleaner.output concentrate", "final.output.concentrate",])
fig.show()

Steps¶

Plotted Au, Ag, and Pb metal concentrations at purification stage.

Conclusion¶

The concentration of Au increases unifomrly throughout the purification stage. The concentraion of Ag increases and decreases slightly throughout the stage resulting in a net decrease. The concnetration of Pb increases slightly throughout the stage resulting in a net increase similar to the primary cleaner output concentrate.

2.2. Compare the feed particle size distributions in the training set and in the test set. If the distributions vary significantly, the model evaluation will be incorrect.¶

fig, axes = plt.subplots(2,1, figsize=(10, 15))
axes[0].hist(gold_train["primary_cleaner.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[0].hist(gold_test["primary_cleaner.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[0].set(title="Primary Cleaner Input Feed Size", xlabel="Amount", ylabel="Size")
axes[1].hist(gold_train["rougher.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[1].hist(gold_test["rougher.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[1].set(title="Rougher Input Feed Size", xlabel="Amount", ylabel="Size")
fig.suptitle("Feed Particle Size Distribution")
fig.legend(["Train Set", "Test Set"])
fig.show()

Steps¶

Plotted feed particle size distribution in train and test set.

Conclusion¶

The particle size distribution shows that the particles sizes do not vary siginificantly in the train and test set. Therefore the model evaluation may be correct.

2.3. Consider the total concentrations of all substances at different stages: raw feed, rougher concentrate, and final concentrate. Do you notice any abnormal values in the total distribution? If you do, is it worth removing such values from both samples? Describe the findings and eliminate anomalies.¶

def raw_feed(df):
    return df["rougher.input.feed_au"] + df["rougher.input.feed_ag"] + df["rougher.input.feed_pb"] + df["rougher.input.feed_sol"]

def rougher_conc(df):
    return df["rougher.output.concentrate_au"] + df["rougher.output.concentrate_ag"] + df["rougher.output.concentrate_pb"] + df["rougher.output.concentrate_sol"]

def final_conc(df):
    return df["final.output.concentrate_au"] + df["final.output.concentrate_ag"] + df["final.output.concentrate_pb"] + df["final.output.concentrate_sol"]

gold_full["rougher.input.feed"] = raw_feed(gold_full)
gold_full["rougher.output.concentrate"] = rougher_conc(gold_full)
gold_full["final.output.concentrate"] = final_conc(gold_full)
total_conc = gold_full[["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"]]

fig = plt.figure(figsize=(10, 6))
for column in list(total_conc):
    sns.distplot(total_conc[column], kde=False)
plt.legend(list(total_conc))
plt.title("Total Concentration at Stages")
plt.xlabel("Concentration %")
plt.ylabel("Amount")
fig.show()

Steps¶

Plotted total metal concentation distribution at various stages.

Conclusion¶

The concentration distributions at various stages all show abnormal values at 0%. Therefore it is worth removing these values from both samples (< 20%) to ensure accuracy in the model.

3. Build the model¶

3.1. Write a function to calculate the final sMAPE value.¶

def final_smape(y, y_hat):
    y_rougher = y.iloc[:, 0]
    y_hat_rougher = y_hat.iloc[:, 0]

    rougher_num = np.abs(y_rougher - y_hat_rougher)
    rougher_den = (np.abs(y_rougher) + np.abs(y_hat_rougher)) / 2
    smape_rougher = np.mean(rougher_num / rougher_den) * 100

    y_final = y.iloc[:, 1]
    y_hat_final = y_hat.iloc[:, 1]

    final_num = np.abs(y_final - y_hat_final)
    final_den = (np.abs(y_final) + np.abs(y_hat_final)) / 2
    smape_final = np.mean(final_num / final_den) * 100

    final_smape = smape_rougher * 0.25 + smape_final * 0.75
    return final_smape

smape_scorer = make_scorer(final_smape, greater_is_better=False)

3.2. Train different models. Evaluate them using cross-validation. Pick the best model and test it using the test sample. Provide findings.¶

Use these formulas for evaluation metrics:

gold_train["rougher.input.feed"] = raw_feed(gold_train)
gold_train["rougher.output.concentrate"] = rougher_conc(gold_train)
gold_train["final.output.concentrate"] = final_conc(gold_train)

gold_train = gold_train[(gold_train["rougher.input.feed"] > 20) & (gold_train["rougher.output.concentrate"] > 20) & (gold_train["final.output.concentrate"] > 20)]
gold_train = gold_train.drop(["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"], axis=1)

gold_test["rougher.input.feed"] = raw_feed(gold_test)
gold_test["rougher.output.concentrate"] = rougher_conc(gold_test)
gold_test["final.output.concentrate"] = final_conc(gold_test)

gold_test = gold_test[(gold_test["rougher.input.feed"] > 20) & (gold_test["rougher.output.concentrate"] > 20) & (gold_test["final.output.concentrate"] > 20)]
gold_test = gold_test.drop(["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"], axis=1)
gold_test = gold_test.drop(list(gold_full_merge.columns.values), axis=1)

gold_train = gold_train.loc[:, list(gold_test.columns)]

features_train = gold_train.drop(columns=["rougher.output.recovery", "final.output.recovery"], axis=1)
target_train = gold_train[["rougher.output.recovery", "final.output.recovery"]]

feature_scaler = StandardScaler()
features_train = feature_scaler.fit_transform(features_train)

def smape_cv(model, features, target):
    return np.abs(np.average(cross_validate(model, features, target, scoring=smape_scorer)["test_score"]))

features_test = gold_test.drop(columns=["rougher.output.recovery", "final.output.recovery"], axis=1)
target_test = gold_test[["rougher.output.recovery", "final.output.recovery"]]
print(target_test)
features_test = feature_scaler.transform(features_test)

      rougher.output.recovery  final.output.recovery
0                   89.993421              70.273583
1                   88.089657              68.910432
2                   88.412756              68.143213
3                   87.360133              67.776393
4                   83.236367              61.467078
...                       ...                    ...
5851                95.172585              68.919891
5852                94.575036              68.440582
5853                93.018138              67.092759
5854                92.599042              68.061186
5855                91.177695              71.699976

[5244 rows x 2 columns]

state = np.random.RandomState(12345)

print("Decision Tree")
for depth in range(100, 501, 100):
    model = DecisionTreeRegressor(max_depth=depth, random_state=state)
    model.fit(features_train, target_train)
    final_smape = smape_cv(model, features_train, target_train)
    print("max_depth =", depth, ":", final_smape)

Decision Tree
max_depth = 100 : 16.33061630002678
max_depth = 200 : 15.160084168943747
max_depth = 300 : 15.239415582105158
max_depth = 400 : 16.74440426231997
max_depth = 500 : 16.366282098704215

print("Random Forest")
for estim in range(10, 51, 10):
    model = RandomForestRegressor(n_estimators=estim, random_state=state)
    model.fit(features_train, target_train)
    final_smape = smape_cv(model, features_train, target_train)
    print("n_estimators =", estim, ":", final_smape)

Random Forest
n_estimators = 10 : 11.731968152201986
n_estimators = 20 : 11.185989592288081
n_estimators = 30 : 10.65543643402293
n_estimators = 40 : 10.60491121643704
n_estimators = 50 : 11.114542644552488

model = LinearRegression()
model.fit(features_train, target_train)
final_smape = smape_cv(model, features_train, target_train)
print("Logistic Regression", ":", final_smape)

Logistic Regression : 14.19268741473465

model = RandomForestRegressor(n_estimators=40, random_state=state)
model.fit(features_train, target_train)
final_smape = smape_cv(model, features_test, target_test)
print("sMAPE :", final_smape)

sMAPE : 8.115869506726808

constant_model_mean = target_train.mean()
constant_model_test = pd.DataFrame(index=range(len(target_test)),columns=["rougher.output.recovery", "final.output.recovery"])
constant_model_test["rougher.output.recovery"] = constant_model_mean[0]
constant_model_test["final.output.recovery"] = constant_model_mean[1]

constant_smape = final_smape(constant_model_test, target_test)
print("Constant model sMAPE:", constant_smape)

Constant model sMAPE: 7.620660008047318

Steps¶

Dropped data in train and test set <= 20% concentration to remove abnormal values.
Trained and fit the model using DecisionTreeRegressor at various depths.
Trained and fit the model using RandomForestRegressor with various n_estimators.
Trained and fit the model using LinearRegression.
Trained and fit the final model using RandomForestRegressor with 40 n_estimators.
Determined the constant model prediciting the mean.

Conclusion¶

The best best model observed is using the RandomForestRegressor at 40 n_setimators. The final sMAPE found is 8%. The constant model prediciting the mean is 7.6% which is very close and indicates that the machine learning model is not significantly better than the constant model prediciting the mean. The constant model prediciting the mean is also better than any of the other models tested.

	date	final.output.concentrate_ag	final.output.concentrate_pb	final.output.concentrate_sol	final.output.concentrate_au	final.output.recovery	final.output.tail_ag	final.output.tail_pb	final.output.tail_sol	final.output.tail_au	...	secondary_cleaner.state.floatbank4_a_air	secondary_cleaner.state.floatbank4_a_level	secondary_cleaner.state.floatbank4_b_air	secondary_cleaner.state.floatbank4_b_level	secondary_cleaner.state.floatbank5_a_air	secondary_cleaner.state.floatbank5_a_level	secondary_cleaner.state.floatbank5_b_air	secondary_cleaner.state.floatbank5_b_level	secondary_cleaner.state.floatbank6_a_air	secondary_cleaner.state.floatbank6_a_level
0	2016-01-15 00:00:00	6.055403	9.889648	5.507324	42.192020	70.541216	10.411962	0.895447	16.904297	2.143149	...	14.016835	-502.488007	12.099931	-504.715942	9.925633	-498.310211	8.079666	-500.470978	14.151341	-605.841980
1	2016-01-15 01:00:00	6.029369	9.968944	5.257781	42.701629	69.266198	10.462676	0.927452	16.634514	2.224930	...	13.992281	-505.503262	11.950531	-501.331529	10.039245	-500.169983	7.984757	-500.582168	13.998353	-599.787184
2	2016-01-15 02:00:00	6.055926	10.213995	5.383759	42.657501	68.116445	10.507046	0.953716	16.208849	2.257889	...	14.015015	-502.520901	11.912783	-501.133383	10.070913	-500.129135	8.013877	-500.517572	14.028663	-601.427363
3	2016-01-15 03:00:00	6.047977	9.977019	4.858634	42.689819	68.347543	10.422762	0.883763	16.532835	2.146849	...	14.036510	-500.857308	11.999550	-501.193686	9.970366	-499.201640	7.977324	-500.255908	14.005551	-599.996129
4	2016-01-15 04:00:00	6.148599	10.142511	4.939416	42.774141	66.927016	10.360302	0.792826	16.525686	2.055292	...	14.027298	-499.838632	11.953070	-501.053894	9.925709	-501.686727	7.894242	-500.356035	13.996647	-601.496691

	date	primary_cleaner.input.sulfate	primary_cleaner.input.depressant	primary_cleaner.input.feed_size	primary_cleaner.input.xanthate	primary_cleaner.state.floatbank8_a_air	primary_cleaner.state.floatbank8_a_level	primary_cleaner.state.floatbank8_b_air	primary_cleaner.state.floatbank8_b_level	primary_cleaner.state.floatbank8_c_air	...	secondary_cleaner.state.floatbank4_a_air	secondary_cleaner.state.floatbank4_a_level	secondary_cleaner.state.floatbank4_b_air	secondary_cleaner.state.floatbank4_b_level	secondary_cleaner.state.floatbank5_a_air	secondary_cleaner.state.floatbank5_a_level	secondary_cleaner.state.floatbank5_b_air	secondary_cleaner.state.floatbank5_b_level	secondary_cleaner.state.floatbank6_a_air	secondary_cleaner.state.floatbank6_a_level
0	2016-09-01 00:59:59	210.800909	14.993118	8.080000	1.005021	1398.981301	-500.225577	1399.144926	-499.919735	1400.102998	...	12.023554	-497.795834	8.016656	-501.289139	7.946562	-432.317850	4.872511	-500.037437	26.705889	-499.709414
1	2016-09-01 01:59:59	215.392455	14.987471	8.080000	0.990469	1398.777912	-500.057435	1398.055362	-499.778182	1396.151033	...	12.058140	-498.695773	8.130979	-499.634209	7.958270	-525.839648	4.878850	-500.162375	25.019940	-499.819438
2	2016-09-01 02:59:59	215.259946	12.884934	7.786667	0.996043	1398.493666	-500.868360	1398.860436	-499.764529	1398.075709	...	11.962366	-498.767484	8.096893	-500.827423	8.071056	-500.801673	4.905125	-499.828510	24.994862	-500.622559
3	2016-09-01 03:59:59	215.336236	12.006805	7.640000	0.863514	1399.618111	-498.863574	1397.440120	-499.211024	1400.129303	...	12.033091	-498.350935	8.074946	-499.474407	7.897085	-500.868509	4.931400	-499.963623	24.948919	-498.709987
4	2016-09-01 04:59:59	199.099327	10.682530	7.530000	0.805575	1401.268123	-500.808305	1398.128818	-499.504543	1402.172226	...	12.025367	-500.786497	8.054678	-500.397500	8.107890	-509.526725	4.957674	-500.360026	25.003331	-500.856333

	date	final.output.concentrate_ag	final.output.concentrate_pb	final.output.concentrate_sol	final.output.concentrate_au	final.output.recovery	final.output.tail_ag	final.output.tail_pb	final.output.tail_sol	final.output.tail_au	...	secondary_cleaner.state.floatbank4_a_air	secondary_cleaner.state.floatbank4_a_level	secondary_cleaner.state.floatbank4_b_air	secondary_cleaner.state.floatbank4_b_level	secondary_cleaner.state.floatbank5_a_air	secondary_cleaner.state.floatbank5_a_level	secondary_cleaner.state.floatbank5_b_air	secondary_cleaner.state.floatbank5_b_level	secondary_cleaner.state.floatbank6_a_air	secondary_cleaner.state.floatbank6_a_level
0	2016-01-15 00:00:00	6.055403	9.889648	5.507324	42.192020	70.541216	10.411962	0.895447	16.904297	2.143149	...	14.016835	-502.488007	12.099931	-504.715942	9.925633	-498.310211	8.079666	-500.470978	14.151341	-605.841980
1	2016-01-15 01:00:00	6.029369	9.968944	5.257781	42.701629	69.266198	10.462676	0.927452	16.634514	2.224930	...	13.992281	-505.503262	11.950531	-501.331529	10.039245	-500.169983	7.984757	-500.582168	13.998353	-599.787184
2	2016-01-15 02:00:00	6.055926	10.213995	5.383759	42.657501	68.116445	10.507046	0.953716	16.208849	2.257889	...	14.015015	-502.520901	11.912783	-501.133383	10.070913	-500.129135	8.013877	-500.517572	14.028663	-601.427363
3	2016-01-15 03:00:00	6.047977	9.977019	4.858634	42.689819	68.347543	10.422762	0.883763	16.532835	2.146849	...	14.036510	-500.857308	11.999550	-501.193686	9.970366	-499.201640	7.977324	-500.255908	14.005551	-599.996129
4	2016-01-15 04:00:00	6.148599	10.142511	4.939416	42.774141	66.927016	10.360302	0.792826	16.525686	2.055292	...	14.027298	-499.838632	11.953070	-501.053894	9.925709	-501.686727	7.894242	-500.356035	13.996647	-601.496691