Integrated Machine Learning Project

Completed by Sonya Wach for Practicum.Yandex

The project consists of analyzing company data of various metal concentrations at different stages of extraction and purification from ore. A machine learning model is built to predict the amount of gold recovered from gold ore which will therefore improve efficiency and help the company to optimize the production and eliminate unprofitable parameters.

1. Prepare the data

1.1. Open the files and look into the data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
In [2]:
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)
In [3]:
gold_train = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_train.csv")
gold_train.head()
Out[3]:
date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.recovery final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 70.541216 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 69.266198 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 68.116445 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 68.347543 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 66.927016 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691

5 rows × 87 columns

In [4]:
gold_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
date                                                  16860 non-null object
final.output.concentrate_ag                           16788 non-null float64
final.output.concentrate_pb                           16788 non-null float64
final.output.concentrate_sol                          16490 non-null float64
final.output.concentrate_au                           16789 non-null float64
final.output.recovery                                 15339 non-null float64
final.output.tail_ag                                  16794 non-null float64
final.output.tail_pb                                  16677 non-null float64
final.output.tail_sol                                 16715 non-null float64
final.output.tail_au                                  16794 non-null float64
primary_cleaner.input.sulfate                         15553 non-null float64
primary_cleaner.input.depressant                      15598 non-null float64
primary_cleaner.input.feed_size                       16860 non-null float64
primary_cleaner.input.xanthate                        15875 non-null float64
primary_cleaner.output.concentrate_ag                 16778 non-null float64
primary_cleaner.output.concentrate_pb                 16502 non-null float64
primary_cleaner.output.concentrate_sol                16224 non-null float64
primary_cleaner.output.concentrate_au                 16778 non-null float64
primary_cleaner.output.tail_ag                        16777 non-null float64
primary_cleaner.output.tail_pb                        16761 non-null float64
primary_cleaner.output.tail_sol                       16579 non-null float64
primary_cleaner.output.tail_au                        16777 non-null float64
primary_cleaner.state.floatbank8_a_air                16820 non-null float64
primary_cleaner.state.floatbank8_a_level              16827 non-null float64
primary_cleaner.state.floatbank8_b_air                16820 non-null float64
primary_cleaner.state.floatbank8_b_level              16833 non-null float64
primary_cleaner.state.floatbank8_c_air                16822 non-null float64
primary_cleaner.state.floatbank8_c_level              16833 non-null float64
primary_cleaner.state.floatbank8_d_air                16821 non-null float64
primary_cleaner.state.floatbank8_d_level              16833 non-null float64
rougher.calculation.sulfate_to_au_concentrate         16833 non-null float64
rougher.calculation.floatbank10_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.floatbank11_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.au_pb_ratio                       15618 non-null float64
rougher.input.feed_ag                                 16778 non-null float64
rougher.input.feed_pb                                 16632 non-null float64
rougher.input.feed_rate                               16347 non-null float64
rougher.input.feed_size                               16443 non-null float64
rougher.input.feed_sol                                16568 non-null float64
rougher.input.feed_au                                 16777 non-null float64
rougher.input.floatbank10_sulfate                     15816 non-null float64
rougher.input.floatbank10_xanthate                    16514 non-null float64
rougher.input.floatbank11_sulfate                     16237 non-null float64
rougher.input.floatbank11_xanthate                    14956 non-null float64
rougher.output.concentrate_ag                         16778 non-null float64
rougher.output.concentrate_pb                         16778 non-null float64
rougher.output.concentrate_sol                        16698 non-null float64
rougher.output.concentrate_au                         16778 non-null float64
rougher.output.recovery                               14287 non-null float64
rougher.output.tail_ag                                14610 non-null float64
rougher.output.tail_pb                                16778 non-null float64
rougher.output.tail_sol                               14611 non-null float64
rougher.output.tail_au                                14611 non-null float64
rougher.state.floatbank10_a_air                       16807 non-null float64
rougher.state.floatbank10_a_level                     16807 non-null float64
rougher.state.floatbank10_b_air                       16807 non-null float64
rougher.state.floatbank10_b_level                     16807 non-null float64
rougher.state.floatbank10_c_air                       16807 non-null float64
rougher.state.floatbank10_c_level                     16814 non-null float64
rougher.state.floatbank10_d_air                       16802 non-null float64
rougher.state.floatbank10_d_level                     16809 non-null float64
rougher.state.floatbank10_e_air                       16257 non-null float64
rougher.state.floatbank10_e_level                     16809 non-null float64
rougher.state.floatbank10_f_air                       16802 non-null float64
rougher.state.floatbank10_f_level                     16802 non-null float64
secondary_cleaner.output.tail_ag                      16776 non-null float64
secondary_cleaner.output.tail_pb                      16764 non-null float64
secondary_cleaner.output.tail_sol                     14874 non-null float64
secondary_cleaner.output.tail_au                      16778 non-null float64
secondary_cleaner.state.floatbank2_a_air              16497 non-null float64
secondary_cleaner.state.floatbank2_a_level            16751 non-null float64
secondary_cleaner.state.floatbank2_b_air              16705 non-null float64
secondary_cleaner.state.floatbank2_b_level            16748 non-null float64
secondary_cleaner.state.floatbank3_a_air              16763 non-null float64
secondary_cleaner.state.floatbank3_a_level            16747 non-null float64
secondary_cleaner.state.floatbank3_b_air              16752 non-null float64
secondary_cleaner.state.floatbank3_b_level            16750 non-null float64
secondary_cleaner.state.floatbank4_a_air              16731 non-null float64
secondary_cleaner.state.floatbank4_a_level            16747 non-null float64
secondary_cleaner.state.floatbank4_b_air              16768 non-null float64
secondary_cleaner.state.floatbank4_b_level            16767 non-null float64
secondary_cleaner.state.floatbank5_a_air              16775 non-null float64
secondary_cleaner.state.floatbank5_a_level            16775 non-null float64
secondary_cleaner.state.floatbank5_b_air              16775 non-null float64
secondary_cleaner.state.floatbank5_b_level            16776 non-null float64
secondary_cleaner.state.floatbank6_a_air              16757 non-null float64
secondary_cleaner.state.floatbank6_a_level            16775 non-null float64
dtypes: float64(86), object(1)
memory usage: 11.2+ MB
In [5]:
gold_test = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_test.csv")
gold_test.head()
Out[5]:
date primary_cleaner.input.sulfate primary_cleaner.input.depressant primary_cleaner.input.feed_size primary_cleaner.input.xanthate primary_cleaner.state.floatbank8_a_air primary_cleaner.state.floatbank8_a_level primary_cleaner.state.floatbank8_b_air primary_cleaner.state.floatbank8_b_level primary_cleaner.state.floatbank8_c_air ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-09-01 00:59:59 210.800909 14.993118 8.080000 1.005021 1398.981301 -500.225577 1399.144926 -499.919735 1400.102998 ... 12.023554 -497.795834 8.016656 -501.289139 7.946562 -432.317850 4.872511 -500.037437 26.705889 -499.709414
1 2016-09-01 01:59:59 215.392455 14.987471 8.080000 0.990469 1398.777912 -500.057435 1398.055362 -499.778182 1396.151033 ... 12.058140 -498.695773 8.130979 -499.634209 7.958270 -525.839648 4.878850 -500.162375 25.019940 -499.819438
2 2016-09-01 02:59:59 215.259946 12.884934 7.786667 0.996043 1398.493666 -500.868360 1398.860436 -499.764529 1398.075709 ... 11.962366 -498.767484 8.096893 -500.827423 8.071056 -500.801673 4.905125 -499.828510 24.994862 -500.622559
3 2016-09-01 03:59:59 215.336236 12.006805 7.640000 0.863514 1399.618111 -498.863574 1397.440120 -499.211024 1400.129303 ... 12.033091 -498.350935 8.074946 -499.474407 7.897085 -500.868509 4.931400 -499.963623 24.948919 -498.709987
4 2016-09-01 04:59:59 199.099327 10.682530 7.530000 0.805575 1401.268123 -500.808305 1398.128818 -499.504543 1402.172226 ... 12.025367 -500.786497 8.054678 -500.397500 8.107890 -509.526725 4.957674 -500.360026 25.003331 -500.856333

5 rows × 53 columns

In [6]:
gold_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
date                                          5856 non-null object
primary_cleaner.input.sulfate                 5554 non-null float64
primary_cleaner.input.depressant              5572 non-null float64
primary_cleaner.input.feed_size               5856 non-null float64
primary_cleaner.input.xanthate                5690 non-null float64
primary_cleaner.state.floatbank8_a_air        5840 non-null float64
primary_cleaner.state.floatbank8_a_level      5840 non-null float64
primary_cleaner.state.floatbank8_b_air        5840 non-null float64
primary_cleaner.state.floatbank8_b_level      5840 non-null float64
primary_cleaner.state.floatbank8_c_air        5840 non-null float64
primary_cleaner.state.floatbank8_c_level      5840 non-null float64
primary_cleaner.state.floatbank8_d_air        5840 non-null float64
primary_cleaner.state.floatbank8_d_level      5840 non-null float64
rougher.input.feed_ag                         5840 non-null float64
rougher.input.feed_pb                         5840 non-null float64
rougher.input.feed_rate                       5816 non-null float64
rougher.input.feed_size                       5834 non-null float64
rougher.input.feed_sol                        5789 non-null float64
rougher.input.feed_au                         5840 non-null float64
rougher.input.floatbank10_sulfate             5599 non-null float64
rougher.input.floatbank10_xanthate            5733 non-null float64
rougher.input.floatbank11_sulfate             5801 non-null float64
rougher.input.floatbank11_xanthate            5503 non-null float64
rougher.state.floatbank10_a_air               5839 non-null float64
rougher.state.floatbank10_a_level             5840 non-null float64
rougher.state.floatbank10_b_air               5839 non-null float64
rougher.state.floatbank10_b_level             5840 non-null float64
rougher.state.floatbank10_c_air               5839 non-null float64
rougher.state.floatbank10_c_level             5840 non-null float64
rougher.state.floatbank10_d_air               5839 non-null float64
rougher.state.floatbank10_d_level             5840 non-null float64
rougher.state.floatbank10_e_air               5839 non-null float64
rougher.state.floatbank10_e_level             5840 non-null float64
rougher.state.floatbank10_f_air               5839 non-null float64
rougher.state.floatbank10_f_level             5840 non-null float64
secondary_cleaner.state.floatbank2_a_air      5836 non-null float64
secondary_cleaner.state.floatbank2_a_level    5840 non-null float64
secondary_cleaner.state.floatbank2_b_air      5833 non-null float64
secondary_cleaner.state.floatbank2_b_level    5840 non-null float64
secondary_cleaner.state.floatbank3_a_air      5822 non-null float64
secondary_cleaner.state.floatbank3_a_level    5840 non-null float64
secondary_cleaner.state.floatbank3_b_air      5840 non-null float64
secondary_cleaner.state.floatbank3_b_level    5840 non-null float64
secondary_cleaner.state.floatbank4_a_air      5840 non-null float64
secondary_cleaner.state.floatbank4_a_level    5840 non-null float64
secondary_cleaner.state.floatbank4_b_air      5840 non-null float64
secondary_cleaner.state.floatbank4_b_level    5840 non-null float64
secondary_cleaner.state.floatbank5_a_air      5840 non-null float64
secondary_cleaner.state.floatbank5_a_level    5840 non-null float64
secondary_cleaner.state.floatbank5_b_air      5840 non-null float64
secondary_cleaner.state.floatbank5_b_level    5840 non-null float64
secondary_cleaner.state.floatbank6_a_air      5840 non-null float64
secondary_cleaner.state.floatbank6_a_level    5840 non-null float64
dtypes: float64(52), object(1)
memory usage: 2.4+ MB
In [7]:
gold_full = pd.read_csv("https://code.s3.yandex.net/datasets/gold_recovery_full.csv")
gold_full.head()
Out[7]:
date final.output.concentrate_ag final.output.concentrate_pb final.output.concentrate_sol final.output.concentrate_au final.output.recovery final.output.tail_ag final.output.tail_pb final.output.tail_sol final.output.tail_au ... secondary_cleaner.state.floatbank4_a_air secondary_cleaner.state.floatbank4_a_level secondary_cleaner.state.floatbank4_b_air secondary_cleaner.state.floatbank4_b_level secondary_cleaner.state.floatbank5_a_air secondary_cleaner.state.floatbank5_a_level secondary_cleaner.state.floatbank5_b_air secondary_cleaner.state.floatbank5_b_level secondary_cleaner.state.floatbank6_a_air secondary_cleaner.state.floatbank6_a_level
0 2016-01-15 00:00:00 6.055403 9.889648 5.507324 42.192020 70.541216 10.411962 0.895447 16.904297 2.143149 ... 14.016835 -502.488007 12.099931 -504.715942 9.925633 -498.310211 8.079666 -500.470978 14.151341 -605.841980
1 2016-01-15 01:00:00 6.029369 9.968944 5.257781 42.701629 69.266198 10.462676 0.927452 16.634514 2.224930 ... 13.992281 -505.503262 11.950531 -501.331529 10.039245 -500.169983 7.984757 -500.582168 13.998353 -599.787184
2 2016-01-15 02:00:00 6.055926 10.213995 5.383759 42.657501 68.116445 10.507046 0.953716 16.208849 2.257889 ... 14.015015 -502.520901 11.912783 -501.133383 10.070913 -500.129135 8.013877 -500.517572 14.028663 -601.427363
3 2016-01-15 03:00:00 6.047977 9.977019 4.858634 42.689819 68.347543 10.422762 0.883763 16.532835 2.146849 ... 14.036510 -500.857308 11.999550 -501.193686 9.970366 -499.201640 7.977324 -500.255908 14.005551 -599.996129
4 2016-01-15 04:00:00 6.148599 10.142511 4.939416 42.774141 66.927016 10.360302 0.792826 16.525686 2.055292 ... 14.027298 -499.838632 11.953070 -501.053894 9.925709 -501.686727 7.894242 -500.356035 13.996647 -601.496691

5 rows × 87 columns

In [8]:
gold_full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22716 entries, 0 to 22715
Data columns (total 87 columns):
date                                                  22716 non-null object
final.output.concentrate_ag                           22627 non-null float64
final.output.concentrate_pb                           22629 non-null float64
final.output.concentrate_sol                          22331 non-null float64
final.output.concentrate_au                           22630 non-null float64
final.output.recovery                                 20753 non-null float64
final.output.tail_ag                                  22633 non-null float64
final.output.tail_pb                                  22516 non-null float64
final.output.tail_sol                                 22445 non-null float64
final.output.tail_au                                  22635 non-null float64
primary_cleaner.input.sulfate                         21107 non-null float64
primary_cleaner.input.depressant                      21170 non-null float64
primary_cleaner.input.feed_size                       22716 non-null float64
primary_cleaner.input.xanthate                        21565 non-null float64
primary_cleaner.output.concentrate_ag                 22618 non-null float64
primary_cleaner.output.concentrate_pb                 22268 non-null float64
primary_cleaner.output.concentrate_sol                21918 non-null float64
primary_cleaner.output.concentrate_au                 22618 non-null float64
primary_cleaner.output.tail_ag                        22614 non-null float64
primary_cleaner.output.tail_pb                        22594 non-null float64
primary_cleaner.output.tail_sol                       22365 non-null float64
primary_cleaner.output.tail_au                        22617 non-null float64
primary_cleaner.state.floatbank8_a_air                22660 non-null float64
primary_cleaner.state.floatbank8_a_level              22667 non-null float64
primary_cleaner.state.floatbank8_b_air                22660 non-null float64
primary_cleaner.state.floatbank8_b_level              22673 non-null float64
primary_cleaner.state.floatbank8_c_air                22662 non-null float64
primary_cleaner.state.floatbank8_c_level              22673 non-null float64
primary_cleaner.state.floatbank8_d_air                22661 non-null float64
primary_cleaner.state.floatbank8_d_level              22673 non-null float64
rougher.calculation.sulfate_to_au_concentrate         22672 non-null float64
rougher.calculation.floatbank10_sulfate_to_au_feed    22672 non-null float64
rougher.calculation.floatbank11_sulfate_to_au_feed    22672 non-null float64
rougher.calculation.au_pb_ratio                       21089 non-null float64
rougher.input.feed_ag                                 22618 non-null float64
rougher.input.feed_pb                                 22472 non-null float64
rougher.input.feed_rate                               22163 non-null float64
rougher.input.feed_size                               22277 non-null float64
rougher.input.feed_sol                                22357 non-null float64
rougher.input.feed_au                                 22617 non-null float64
rougher.input.floatbank10_sulfate                     21415 non-null float64
rougher.input.floatbank10_xanthate                    22247 non-null float64
rougher.input.floatbank11_sulfate                     22038 non-null float64
rougher.input.floatbank11_xanthate                    20459 non-null float64
rougher.output.concentrate_ag                         22618 non-null float64
rougher.output.concentrate_pb                         22618 non-null float64
rougher.output.concentrate_sol                        22526 non-null float64
rougher.output.concentrate_au                         22618 non-null float64
rougher.output.recovery                               19597 non-null float64
rougher.output.tail_ag                                19979 non-null float64
rougher.output.tail_pb                                22618 non-null float64
rougher.output.tail_sol                               19980 non-null float64
rougher.output.tail_au                                19980 non-null float64
rougher.state.floatbank10_a_air                       22646 non-null float64
rougher.state.floatbank10_a_level                     22647 non-null float64
rougher.state.floatbank10_b_air                       22646 non-null float64
rougher.state.floatbank10_b_level                     22647 non-null float64
rougher.state.floatbank10_c_air                       22646 non-null float64
rougher.state.floatbank10_c_level                     22654 non-null float64
rougher.state.floatbank10_d_air                       22641 non-null float64
rougher.state.floatbank10_d_level                     22649 non-null float64
rougher.state.floatbank10_e_air                       22096 non-null float64
rougher.state.floatbank10_e_level                     22649 non-null float64
rougher.state.floatbank10_f_air                       22641 non-null float64
rougher.state.floatbank10_f_level                     22642 non-null float64
secondary_cleaner.output.tail_ag                      22616 non-null float64
secondary_cleaner.output.tail_pb                      22600 non-null float64
secondary_cleaner.output.tail_sol                     20501 non-null float64
secondary_cleaner.output.tail_au                      22618 non-null float64
secondary_cleaner.state.floatbank2_a_air              22333 non-null float64
secondary_cleaner.state.floatbank2_a_level            22591 non-null float64
secondary_cleaner.state.floatbank2_b_air              22538 non-null float64
secondary_cleaner.state.floatbank2_b_level            22588 non-null float64
secondary_cleaner.state.floatbank3_a_air              22585 non-null float64
secondary_cleaner.state.floatbank3_a_level            22587 non-null float64
secondary_cleaner.state.floatbank3_b_air              22592 non-null float64
secondary_cleaner.state.floatbank3_b_level            22590 non-null float64
secondary_cleaner.state.floatbank4_a_air              22571 non-null float64
secondary_cleaner.state.floatbank4_a_level            22587 non-null float64
secondary_cleaner.state.floatbank4_b_air              22608 non-null float64
secondary_cleaner.state.floatbank4_b_level            22607 non-null float64
secondary_cleaner.state.floatbank5_a_air              22615 non-null float64
secondary_cleaner.state.floatbank5_a_level            22615 non-null float64
secondary_cleaner.state.floatbank5_b_air              22615 non-null float64
secondary_cleaner.state.floatbank5_b_level            22616 non-null float64
secondary_cleaner.state.floatbank6_a_air              22597 non-null float64
secondary_cleaner.state.floatbank6_a_level            22615 non-null float64
dtypes: float64(86), object(1)
memory usage: 15.1+ MB

Steps

Imported the required libraries.
Saved the contents of the CSVs to dataframes.
Retrieved the information about the dataframes.

Conclusion

The data consists of information about metal concentrations at various stages.

1.2. Check that recovery is calculated correctly. Using the training set, calculate recovery for the rougher.output.recovery feature. Find the MAE between your calculations and the feature values. Provide findings.

In [9]:
C = gold_train["rougher.output.concentrate_au"]
F = gold_train["rougher.input.feed_au"]
T = gold_train["rougher.output.tail_au"]
In [10]:
gold_train["rougher.output.recovery.calculated"] = (C * (F - T)) / (F * (C - T)) * 100
In [11]:
mae = (gold_train["rougher.output.recovery.calculated"] - gold_train["rougher.output.recovery"]).abs().sum() / len(gold_train)
print("MAE", ":", mae)
MAE : 8.00350954615662e-15

Steps

Calculated recovery for rougher.output.recovery feature.
Found MAE of recovery calculated with feature values.

Conclusion

The mean absolute error is 8e-15 which is very small, indicating that the feature values in the data are very accurate.

1.3. Analyze the features not available in the test set. What are these parameters? What is their type

In [12]:
gold_train.columns.difference(gold_test.columns)
Out[12]:
Index(['final.output.concentrate_ag', 'final.output.concentrate_au',
       'final.output.concentrate_pb', 'final.output.concentrate_sol',
       'final.output.recovery', 'final.output.tail_ag', 'final.output.tail_au',
       'final.output.tail_pb', 'final.output.tail_sol',
       'primary_cleaner.output.concentrate_ag',
       'primary_cleaner.output.concentrate_au',
       'primary_cleaner.output.concentrate_pb',
       'primary_cleaner.output.concentrate_sol',
       'primary_cleaner.output.tail_ag', 'primary_cleaner.output.tail_au',
       'primary_cleaner.output.tail_pb', 'primary_cleaner.output.tail_sol',
       'rougher.calculation.au_pb_ratio',
       'rougher.calculation.floatbank10_sulfate_to_au_feed',
       'rougher.calculation.floatbank11_sulfate_to_au_feed',
       'rougher.calculation.sulfate_to_au_concentrate',
       'rougher.output.concentrate_ag', 'rougher.output.concentrate_au',
       'rougher.output.concentrate_pb', 'rougher.output.concentrate_sol',
       'rougher.output.recovery', 'rougher.output.recovery.calculated',
       'rougher.output.tail_ag', 'rougher.output.tail_au',
       'rougher.output.tail_pb', 'rougher.output.tail_sol',
       'secondary_cleaner.output.tail_ag', 'secondary_cleaner.output.tail_au',
       'secondary_cleaner.output.tail_pb',
       'secondary_cleaner.output.tail_sol'],
      dtype='object')

Steps

Printed the columns which are present in the train dataframe but not in the test dataframe.

Conclusion

The parameters are output values at different stages for various metals. The output values are not necessary for the machine learning model training and testing.

1.4. Perform data preprocessing.

In [13]:
gold_train = gold_train.ffill()
gold_test = gold_test.ffill()
gold_full = gold_full.ffill()
In [14]:
gold_full_merge = gold_full[["date", "rougher.output.recovery", "final.output.recovery", "rougher.output.concentrate_au", "rougher.output.concentrate_ag", "rougher.output.concentrate_pb", "rougher.output.concentrate_sol", "final.output.concentrate_au", "final.output.concentrate_ag", "final.output.concentrate_pb", "final.output.concentrate_sol"]]
In [15]:
gold_test = gold_test.merge(gold_full_merge, on="date", how="left")
gold_full_merge = gold_full_merge.drop(['date', 'rougher.output.recovery', 'final.output.recovery'], axis=1)
In [16]:
gold_train = gold_train.drop("date", axis=1)
gold_test = gold_test.drop("date", axis=1)
gold_full = gold_full.drop("date", axis=1)

Steps

Filled the NaN values of the dataframes using ffill.
Merged necessary data in the full dataframe to the test dataframe.
Dropped unecessary date column in dataframes.

Conclusion

Filled the NaNs using ffill as it forward fills the NaNs with the last valid observation. Added required info to test dataframe from full dataframe and removed date columns which are not needed for the model fitting.

2. Analyze the data

2.1. Take note of how the concentrations of metals (Au, Ag, Pb) change depending on the purification stage.

In [17]:
metal_au = gold_full[["rougher.input.feed_au", "rougher.output.concentrate_au", "primary_cleaner.output.concentrate_au", "final.output.concentrate_au"]]
metal_ag = gold_full[["rougher.input.feed_ag", "rougher.output.concentrate_ag", "primary_cleaner.output.concentrate_ag", "final.output.concentrate_ag"]]
metal_pb = gold_full[["rougher.input.feed_pb", "rougher.output.concentrate_pb", "primary_cleaner.output.concentrate_pb", "final.output.concentrate_pb"]]
In [18]:
fig, axes = plt.subplots(3, 1, figsize=(10, 20))
for column in list(metal_au):
    sns.distplot(metal_au[column], ax=axes[0], kde=False)
axes[0].set(title="Au", xlabel="Concentration %", ylabel="Amount")
for column in list(metal_ag):
    sns.distplot(metal_ag[column], ax=axes[1], kde=False)
axes[1].set(title="Ag", xlabel="Concentration %", ylabel="Amount")
for column in list(metal_pb):
    sns.distplot(metal_pb[column], ax=axes[2], kde=False)
axes[2].set(title="Pb", xlabel="Concentration %", ylabel="Amount")
fig.suptitle("Metal Concentraions at Purification Stages")
fig.legend(["rougher.input.feed", "rougher.output.concentrate", "primary.cleaner.output concentrate", "final.output.concentrate",])
fig.show()

Steps

Plotted Au, Ag, and Pb metal concentrations at purification stage.

Conclusion

The concentration of Au increases unifomrly throughout the purification stage. The concentraion of Ag increases and decreases slightly throughout the stage resulting in a net decrease. The concnetration of Pb increases slightly throughout the stage resulting in a net increase similar to the primary cleaner output concentrate.

2.2. Compare the feed particle size distributions in the training set and in the test set. If the distributions vary significantly, the model evaluation will be incorrect.

In [19]:
fig, axes = plt.subplots(2,1, figsize=(10, 15))
axes[0].hist(gold_train["primary_cleaner.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[0].hist(gold_test["primary_cleaner.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[0].set(title="Primary Cleaner Input Feed Size", xlabel="Amount", ylabel="Size")
axes[1].hist(gold_train["rougher.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[1].hist(gold_test["rougher.input.feed_size"], density=True, alpha=0.5, bins=50)
axes[1].set(title="Rougher Input Feed Size", xlabel="Amount", ylabel="Size")
fig.suptitle("Feed Particle Size Distribution")
fig.legend(["Train Set", "Test Set"])
fig.show()

Steps

Plotted feed particle size distribution in train and test set.

Conclusion

The particle size distribution shows that the particles sizes do not vary siginificantly in the train and test set. Therefore the model evaluation may be correct.

2.3. Consider the total concentrations of all substances at different stages: raw feed, rougher concentrate, and final concentrate. Do you notice any abnormal values in the total distribution? If you do, is it worth removing such values from both samples? Describe the findings and eliminate anomalies.

In [20]:
def raw_feed(df):
    return df["rougher.input.feed_au"] + df["rougher.input.feed_ag"] + df["rougher.input.feed_pb"] + df["rougher.input.feed_sol"]
In [21]:
def rougher_conc(df):
    return df["rougher.output.concentrate_au"] + df["rougher.output.concentrate_ag"] + df["rougher.output.concentrate_pb"] + df["rougher.output.concentrate_sol"]
In [22]:
def final_conc(df):
    return df["final.output.concentrate_au"] + df["final.output.concentrate_ag"] + df["final.output.concentrate_pb"] + df["final.output.concentrate_sol"]
In [23]:
gold_full["rougher.input.feed"] = raw_feed(gold_full)
gold_full["rougher.output.concentrate"] = rougher_conc(gold_full)
gold_full["final.output.concentrate"] = final_conc(gold_full)
total_conc = gold_full[["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"]]
In [24]:
fig = plt.figure(figsize=(10, 6))
for column in list(total_conc):
    sns.distplot(total_conc[column], kde=False)
plt.legend(list(total_conc))
plt.title("Total Concentration at Stages")
plt.xlabel("Concentration %")
plt.ylabel("Amount")
fig.show()

Steps

Plotted total metal concentation distribution at various stages.

Conclusion

The concentration distributions at various stages all show abnormal values at 0%. Therefore it is worth removing these values from both samples (< 20%) to ensure accuracy in the model.

3. Build the model

3.1. Write a function to calculate the final sMAPE value.

In [43]:
def final_smape(y, y_hat):
    y_rougher = y.iloc[:, 0]
    y_hat_rougher = y_hat.iloc[:, 0]

    rougher_num = np.abs(y_rougher - y_hat_rougher)
    rougher_den = (np.abs(y_rougher) + np.abs(y_hat_rougher)) / 2
    smape_rougher = np.mean(rougher_num / rougher_den) * 100

    y_final = y.iloc[:, 1]
    y_hat_final = y_hat.iloc[:, 1]

    final_num = np.abs(y_final - y_hat_final)
    final_den = (np.abs(y_final) + np.abs(y_hat_final)) / 2
    smape_final = np.mean(final_num / final_den) * 100

    final_smape = smape_rougher * 0.25 + smape_final * 0.75
    return final_smape
In [26]:
smape_scorer = make_scorer(final_smape, greater_is_better=False)

3.2. Train different models. Evaluate them using cross-validation. Pick the best model and test it using the test sample. Provide findings.

Use these formulas for evaluation metrics: $$\mathit{sMAPE} = \frac{1}{N} \sum_{i=1}^{N} \frac{\left| y_i - \hat{y}_i \right|}{\left( \left| y_i \right| + \left|\hat{y}_i \right| \right) / 2} \times 100\%$$ $$\mathit{Final\ sMAPE} = 25\% \times \mathit{sMAPE}\left( \mathit{rougher} \right) + 75\% \times \mathit{sMAPE} \left( \mathit{final} \right)$$

In [27]:
gold_train["rougher.input.feed"] = raw_feed(gold_train)
gold_train["rougher.output.concentrate"] = rougher_conc(gold_train)
gold_train["final.output.concentrate"] = final_conc(gold_train)
In [28]:
gold_train = gold_train[(gold_train["rougher.input.feed"] > 20) & (gold_train["rougher.output.concentrate"] > 20) & (gold_train["final.output.concentrate"] > 20)]
gold_train = gold_train.drop(["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"], axis=1)
In [29]:
gold_test["rougher.input.feed"] = raw_feed(gold_test)
gold_test["rougher.output.concentrate"] = rougher_conc(gold_test)
gold_test["final.output.concentrate"] = final_conc(gold_test)
In [30]:
gold_test = gold_test[(gold_test["rougher.input.feed"] > 20) & (gold_test["rougher.output.concentrate"] > 20) & (gold_test["final.output.concentrate"] > 20)]
gold_test = gold_test.drop(["rougher.input.feed", "rougher.output.concentrate", "final.output.concentrate"], axis=1)
gold_test = gold_test.drop(list(gold_full_merge.columns.values), axis=1)
In [31]:
gold_train = gold_train.loc[:, list(gold_test.columns)]
In [32]:
features_train = gold_train.drop(columns=["rougher.output.recovery", "final.output.recovery"], axis=1)
target_train = gold_train[["rougher.output.recovery", "final.output.recovery"]]
In [33]:
feature_scaler = StandardScaler()
features_train = feature_scaler.fit_transform(features_train)
In [34]:
def smape_cv(model, features, target):
    return np.abs(np.average(cross_validate(model, features, target, scoring=smape_scorer)["test_score"]))
In [35]:
features_test = gold_test.drop(columns=["rougher.output.recovery", "final.output.recovery"], axis=1)
target_test = gold_test[["rougher.output.recovery", "final.output.recovery"]]
print(target_test)
features_test = feature_scaler.transform(features_test)
      rougher.output.recovery  final.output.recovery
0                   89.993421              70.273583
1                   88.089657              68.910432
2                   88.412756              68.143213
3                   87.360133              67.776393
4                   83.236367              61.467078
...                       ...                    ...
5851                95.172585              68.919891
5852                94.575036              68.440582
5853                93.018138              67.092759
5854                92.599042              68.061186
5855                91.177695              71.699976

[5244 rows x 2 columns]
In [36]:
state = np.random.RandomState(12345)
In [37]:
print("Decision Tree")
for depth in range(100, 501, 100):
    model = DecisionTreeRegressor(max_depth=depth, random_state=state)
    model.fit(features_train, target_train)
    final_smape = smape_cv(model, features_train, target_train)
    print("max_depth =", depth, ":", final_smape)
Decision Tree
max_depth = 100 : 16.33061630002678
max_depth = 200 : 15.160084168943747
max_depth = 300 : 15.239415582105158
max_depth = 400 : 16.74440426231997
max_depth = 500 : 16.366282098704215
In [38]:
print("Random Forest")
for estim in range(10, 51, 10):
    model = RandomForestRegressor(n_estimators=estim, random_state=state)
    model.fit(features_train, target_train)
    final_smape = smape_cv(model, features_train, target_train)
    print("n_estimators =", estim, ":", final_smape)
Random Forest
n_estimators = 10 : 11.731968152201986
n_estimators = 20 : 11.185989592288081
n_estimators = 30 : 10.65543643402293
n_estimators = 40 : 10.60491121643704
n_estimators = 50 : 11.114542644552488
In [39]:
model = LinearRegression()
model.fit(features_train, target_train)
final_smape = smape_cv(model, features_train, target_train)
print("Logistic Regression", ":", final_smape)
Logistic Regression : 14.19268741473465
In [40]:
model = RandomForestRegressor(n_estimators=40, random_state=state)
model.fit(features_train, target_train)
final_smape = smape_cv(model, features_test, target_test)
print("sMAPE :", final_smape)
sMAPE : 8.115869506726808
In [41]:
constant_model_mean = target_train.mean()
constant_model_test = pd.DataFrame(index=range(len(target_test)),columns=["rougher.output.recovery", "final.output.recovery"])
constant_model_test["rougher.output.recovery"] = constant_model_mean[0]
constant_model_test["final.output.recovery"] = constant_model_mean[1]
In [44]:
constant_smape = final_smape(constant_model_test, target_test)
print("Constant model sMAPE:", constant_smape)
Constant model sMAPE: 7.620660008047318

Steps

Dropped data in train and test set <= 20% concentration to remove abnormal values.
Trained and fit the model using DecisionTreeRegressor at various depths.
Trained and fit the model using RandomForestRegressor with various n_estimators.
Trained and fit the model using LinearRegression.
Trained and fit the final model using RandomForestRegressor with 40 n_estimators.
Determined the constant model prediciting the mean.

Conclusion

The best best model observed is using the RandomForestRegressor at 40 n_setimators. The final sMAPE found is 8%. The constant model prediciting the mean is 7.6% which is very close and indicates that the machine learning model is not significantly better than the constant model prediciting the mean. The constant model prediciting the mean is also better than any of the other models tested.