Air Pollution in Indian Cities and Comparison of MLR, ANN and CART Models for Predicting PM10 Concentrations in Guwahati, India

Indian cities are increasingly becoming susceptible to PM10 induced health hazards, thereby creating concern for the country’s policymakers. Air pollution is engulfing the comparatively smaller cities as the rapid pace of urbanization, and economic development seem not to lose steam. A review of air pollution of 28 cities of India, which includes tier-I, II, and III cities of India, found to have grossly violated both WHO (World Health Organisation) and NAAQS (National Ambient Air Quality Standard of India) in respect of acceptable daily average PM10 (particulate matter less than 10 µm in aerodynamic diameter) concentrations by a wide margin. Predicting the city level PM10 concentrations in advance and accordingly initiate prior actions is an acceptable solution to save the city dwellers from PM10 induced health hazards. Predictive ability of three models, linear Multiple Linear Regression (MLR), nonlinear Multi-Layer Perceptron class of Artificial Neural Network (MLP ANN), and nonlinear Classification and Regression Tree (CART), for one day ahead PM10 concentration forecasting of tier-II Guwahati city, were tested with 2016–2018 daily average observed climate data, PM10, and gaseous pollutants. The results show that the non-linear algorithm MLP with feedforward backpropagation network topologies of ANN class, gives the best prediction value compared with linear MLR and nonlinear CART model. Therefore, ANN (MLP) approach may be useful to effectively derive a predictive understanding of one day ahead PM10 concentration level and thus provide a tool to the policymakers for initiating in situ measures to curb air pollution and improve public health.


INTRODUCTION
Over the past years, airborne particulate matter (PM) concentrations in Indian cities have been rising and became a matter of concern for the policymakers in India.The effort towards air quality improvement is not easy for a country like India as the country policymakers cannot forego the objective of faster economic development to sustain its vast population.Different sources are continuously pouring pollutants in the city air, and notable amongst them are burning of fuels, industrial establishments, different constructions related to infrastructure, power plants both tion (WHO) and National Ambient Air Quality Standards (NAAQS) of India, respectively.Kolkata, a tier-1 city* in India, even clocked PM 10 concentration of as high as 445±21 μg m -3 during the wintertime (Das et al., 2015).Annual PM 10 concentrations in New Delhi were reported to be 222±14 μg m -3 while an earlier study reported summer and winter mean concentrations as 95.1±22.2μg m -3 and 182±32.5 μg m -3 , respectively (Tiwari et al., 2014;Singh et al., 2011).Bengaluru city also registered a high annual mean PM 10 concentration level of 349.8±205.8μg m -3 during the year 2015 (Guttikunda et al., 2019).In comparison, lower concentrations have been reported for Hyderabad and Mumbai where mean PM 10 concentrations for the period 2005-2012 were 174.4±86.6 μg m -3 and 54.4±25.2μg m -3 , respectively (Dholakia et al., 2014).
The tier-II cities are also not lagging far behind the India's tier-I cities in terms of PM 10 pollution.Raipur had mean PM 10 concentrations of 387.29±76.9μg m -3 during October 2008 to September 2009 while another city Kanpur recorded mean PM 10 concentrations of 277± 117.6 μg m -3 during October 2002 to February 2003 (Deshmukh et al., 2011;Sharma and Maloo, 2005).Amongst the tier-III cities, the reported mean PM 10 concentrations of some specific cities like Jharia and Sonipat were also on the higher side with 333.7±17.9μg m -3 and 213.7±151.5 μg m -3 during the period March 2011 to February 2012 and 03 December to 06 December 2016, respectively (Ravindra et al., 2019;Roy et al., 2019).
One option to the Indian policymakers to mitigate critical PM concentrations in the cities, vis a vis health effects, therefore, may be to correctly predict the concentrations at least one to two days in advance and accordingly initiate prior actions such as regulation of traffic in a planned way.However, predicting the air quality is not so straightforward job because of the complex interactions of different nonlinear parameters (Hooyberghs et al., 2005).Shahraiyni and Sodoudi (2016) reviewed 36 research studies executed in different cities of the world in the quest of achieving prediction accuracy in forecasting PM 10 .In these studies, 50% of researchers employed a multi-layer perceptron (MLP) with Feedforward Backpropagation Network (FFBN) topologies, a class of Artificial Neural Network (ANN) model.Around 28% (10 studies) depended on the widely used Multiple Linear Regression (MLR) technique for PM 10 forecasting in urban areas.Three studies (about 8%) used the Radial Basis Function (RBF) network of ANN class to forecast city-level PM 10 .The other five studies (14%) depended on different other techniques like PNN (Pruned Neural Networks), LL (Lazy Learning), MLP and MLR combo, Elman class of Recurrent Neural Networks (RNN), and PCRA (Principal Component Regression Analysis).ANN technique appears to be providing useful results to deal with nonlinear independent variables involved in environmental pollution prediction.Hence, more practitioners resort to ANN modeling type of data-driven approaches as alternatives to traditional deterministic or nonlinear models (Cabaneros et al., 2019;Jiang et al., 2017).Pollution researchers of China and elsewhere have used ANN techniques extensively to forecast airborne PM concentrations in the past.The use of MLR with stepwise inclusion of input variables has been the most used tool for temporal prediction of PM 2.5 and PM 10 in different urban areas of India.MLR has its limitation in terms of the linear representation of nonlinear systems.However, researchers have, in a limited way only, showed a preference for different data-driven predictive techniques for PM forecasting in the Indian context and comparatively judge their performances (Table S2).
Against the above background, this paper's primary objective is to assess the predictive ability of three contemporary statistical techniques namely MLR, ANN, and CART (Classification and Regression Tree) analyses for one day ahead PM 10 concentration prediction of an Indian city.The best-performed technique will be a useful tool for city authorities and air quality managers for initiating in situ measures to curb pollution.Unlike previous modeling efforts (Table S2), this is the first instance concerning applying CART analysis as a statistical procedure for the prediction of PM 10 in a comparative set up of an Indian city.In the recent past, Gocheva-Ilieva and Stoimenova (2018) employed CART in predicting PM 10 for the Pleven city of Bulgaria and claimed very accurate model performance.The CART technique as a method for analysis and forecasting of PM 10 claimed to have performed better than MLR (Slini et al., 2006).

LOCATION OF THE STUDY
The model development for forecasting PM 10 was atte-*Indian government classification of cities based on their population as tier-I, tier-II, and tier-III.mp ted in the north-eastern Indian tier-II city of Guwahati, capital city of the state of Assam, India.For the last 10-12 years, Guwahati has been recognized as one of India's most rapidly growing cities.Rapid urbanization and its contribution to air pollution have made smaller Indian cities like Guwahati vulnerable too.Vehicular growth (both light and heavy vehicles) in the city was notable in the past decade, with about a reported sharp rise of 87%.A recent study conducted in Guwahati, computed Hazard Quotient (HQ) based on NAAQS and WHO, indicated quite a high degree of health risk for the city dwellers (Dutta and Jinsart, 2020).There is black carbon pollution in the city air due to rapid urbanization and poor environmental quality control (Barman and Gokhale, 2019).Guwahati has a humid subtropical climate.The four major seasons of the city are winter (December to February), spring (March to May), summer ( June to August), and autumn (September to November), with differing meteorological conditions.Guwahati has six ambient air monitoring stations, set up under the National Air Quality Monitoring Programme (NAMP), to measure key pollutants (Pant et al., 2019).Only one of the NAMP stations can measure PM 2.5 , while the newly developed CAAQM (Continuous Ambient Air Quality Monitoring) station started functioning only during mid of 2019.The six NAMP stations' location and their monitoring type in the backdrop of Guwahati city can be seen in Fig. 1 and Table S3, respectively below.

1 Data Treatment
A few missing values were observed in respect of daily average concentration data for PM 10 , CO, NO 2 , and SO 2 for the 2016-2018 time-series data.As the observed values vary significantly, those few days were removed from the data set instead of the linear interpolation technique.The modified data set contained 1092 observations.Climate data (1096 data points) had no missing value but adjusted to have parity with pollutant data by removing the corresponding values.

2 Descriptive Statistics and Analysis of
Time Series Descriptive statistcs of the climate data, PM 10 , and gaseous pollutants for the period 2016-2018 (1092 data points) and time series analysis were also worked out in respect of air quality monitoring station 6 to understand the characteristics and correlation of different variables throughout the study.Station 6 was found to be a representative one out of six air quality monitoring stations of the city due to reasons like the completeness of data sets and common refection of land-use patterns of the city.Multiple time series charts were produced with time on the horizontal axis and PM 10 concentrations, climate variables, and gaseous variables (AT, RH, RF, WS, SO 2 , CO, NO 2 ) on the vertical coordinate axes.

3 Predictive Models Development and Validation
We have used MLR analysis, MLP class of ANN, and CART for forecasting of one day ahead PM 10 concentration for all the six air quality monitoring stations of Guwahati city.

1 Multiple Linear Regression (MLR)
In MLR analysis, the mathematical model was built up to forecast the dependent variable, i.e., next day PM 10 based on the inputs of independent variables comprising of climate variables and gaseous elements.In MLR, the coefficient of determination (R 2 ) indicates the overall capability of the model to handle variance in data.The regression model was composed following equation 1 (Abdullah et al., 2019;Vlachogianni et al., 2011). (1 where Y is the dependent variable, β i is the regression coefficients, X i is the independent variables and ε is a stochastic error associated with the regression.This relationship was used in this study to develop a mathematical equation model to predict the next day PM 10 concentrations of the six ambient air monitoring stations of Guwahati with input variables like meteorological parameters, PM 10 , and gaseous pollutants.MLR assumes that the residuals have a normal distribution with a zero mean, uncorrelated and constant variance.The stepwise multiple linear regression procedure was used here to derive the mathematical equation (Abdullah et al., 2019).Variance inflation (VIF) was used in this study to evaluate the multicollinearity effect on the variance of the estimated regression coefficient.The equation for VIF (Equation 2) is as follows:

2 Multi-Layer Perceptron (MLP) Model
ANN is a robust data modeling technique capable of handling the nonlinear relationship between variables and hence found suitable for the prediction of PM 10 which requires exploration of the complex relationship between particulate matters, meteorological variables, and gaseous pollutants present in the atmosphere (Feng et al., 2015).We have used MLP in this study to create predictive models for each of six ambient monitoring stations of Guwahati using nonlinear combinations of the input variables (meteorological parameters, PM 10 , PM 2.5 , and gaseous pollutants) to predict the next day PM 10 concentrations.MLP forms a network of functionally interconnected neurons, also known as perceptron (Vemuri, 1988).ANN scores more than MLR because of its ability to predict the dependent variable of a builtup model more accurately (Gardner and Dorling, 1998).MLP has a simple structure consisting of three layers: the input layer, hidden layer, and output layer.One hidden layer was considered in our study, as it was suggested to be sufficient to achieve the optimum model capacity (Bishop, 1995).The number of neurons or the nodes, in the input layer, was equal to the number of input variables introduced in the model.The relevant input variables, i.e., observed meteorological parameters, PM 10 , and gaseous pollutants, are fed in the model as signals to the input layer of the model, which is then passed on to the hidden layer.The neurons do the computations to detect features of the input variables and introduce them to the input layer with requisite weights.The weights are assigned to input variables based on their relative importance.The hidden layer does the critical function of nonlinear transformations of the inputs entered the network through a predefined activation function.The neuron sums up information, including bias, in the hidden layer.The bias does the job of providing a trainable constant value to every neuron in addition to its normal value.The mathematical formulation of the MLP model is as shown below in equation 3: (3) where Y = output, F = transfer function, W kj. = weights between hidden and output layers, W ji = weights between input and hidden layers, X i = input variables, m = number of neurons in a hidden layer, n = number of neurons in an input layer, B j = bias values of the neurons in the hidden, and B k = bias values of the neurons in the output layers.Fig. 2 depicts the basic structure of the MLP framework.

3 Classification and Regression Trees (CART)
CART is a non-parametric regression technique that can be employed for the prediction of an independent variable when the distribution of independents variables is not known.Typically, therefore, the CART method tries to ascertain the distribution pattern of the outcome (dependent) variable using the independent variables through their linear or nonlinear relationship with the outcome variable.CART builds up a decision tree through a hierarchy of binary decisions.Each binary deci sion will involve splitting a target variable into two alternative and mutually exclusive branches (groups) depending upon the variation/values of the explanatory variable leading to the most considerable possible reduction in post-split variations/values of the target variable.In other words, splitting stops when there is no additional gain by further splitting can be achieved (Mckenney and Pedlar, 2003;Moisen and Frescino, 2002).Predictive CART models have been built up in this study for each of the ambient air quality monitoring stations with observed independent predictor variables like meteorological parameters, PM 10 , and gaseous pollutants of the respective stations to predict the respective dependent variables i.e., next day PM 10 concentrations of the city.

4 Model Validation
MLR, MLP, and CART equations have been validated by computing net absolute error (NAE), mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), index of agreement (IA), coefficient of determination (R 2 ) (Grzesiak and Zaborski, 2012;Jinsart et al., 2010;Willmott et al., 2009).Table 2 provides the performance indicators for model validation.SPSS 25 has been used for computation of MLR and MLP while computation for CART SPSS modeler 18 has been used in this study.The mean values and standard deviations of the meteorological parameters, PM 10 , and gaseous pollutants of the respective air quality monitoring stations of the city under consideration are provided in Table 3. High variability was observed in the PM 10 level.During 2016-2018 the daily average PM 10 concentration varied.Across the six air quality monitoring stations, the maximum and minimum mean PM 10 concentration was 133.32 μg m -3 and 51.41 μg m -3 , respectively.The highest daily average PM 10 recorded was 259.39 μg m -3 , while the lowest was 40.67 μg m -3 during the period 2016-2018.The average RH level of the city was found to be on the higher side while wind speed on the lower side.The time-series data reveals the maximum temperature of 34°C recorded during the summer season while the minimum was 14°C during the winter season.Guwahati received rainfall due to the southwest monsoon and the highest rainfall occurred from June to August.

1. 2 Correlation of PM 10 Concentration,
Climate Variables, and Gaseous Variables In Fig. 3(A)-3(G), the time series of the observed meteorological parameters, PM 10 , and gaseous pollutants are reported in respect of air quality monitoring station 6 of the Guwahati city.It can be observed from Fig. 3(A) that the site is characterized by relatively high humidity throughout the year.The time series considered in this study shows that the concentration of PM 10 has maintained almost a negative correlation with relative humidity.PM 10 concentration behavior of the city shows a pattern of annual cycle with high concentrations during winter (December to February), possibly due to lower planetary boundary layer height, and a higher level of concentrations seems to continue up to the months of March-April as well, i.e., beyond winter.
Another peculiarity of the site is that CO, NO 2 , and SO 2 have a positive correlation with PM 10 concentrations suggesting a common source for these compounds but the correlation with SO 2 is stronger as shown in Fig. 3(B)-3(D), and Table S4.Fig. 3(E) indicates negative correlation of PM 10 with temperature and very mild positive correlation with rainfall and windspeed as shown in Fig. 3(F) and 3(G), respectively.

2 Multiple Linear Regression Model for
PM 10 Forecasting The MLR model summary, developed for all six ambient air quality monitoring stations located at Guwahati, has been placed in Table 4.The range of the Variance Inflation Factor (VIF) for the independent variables of all the six MLR models is found in order as they are below 10, showing the non-existence of multicollinearity issues in the models.Durbin Watson (D-W) statistics show that the models can accommodate the autocorrelation, as the values were in the range of 2.103-2.239.The residual (error) is critical in choosing the robustness of the factual model as linear regression is sensitive to outlier effects.Fig. 4(A)-4(F) shows the histogram plot, which indicates that the residuals are also normally distributed with zero mean and constant variance.Fig. S1(A)-S1(F) show the observation and prediction of the MLR models in scatter plots.

3 Multi-Layer Perceptron Model
The normalized input variables PM 10 , RF, T, RH, WS, NO 2 , SO 2 , CO of the respective air monitoring stations were fed into the six different ANN models using the normalizing data conversion facility of the ANN module of SPSS software.For ANN training 70% of the data set and testing 30% of the data set were used.The training data set is propagated in the forward phase, through the hidden layer, which comes out through the output layer.The error, i.e., the difference between output values and actual target output values are propagated back toward the hidden layer until the errors are reduced in succes-sive cycles (Ul-Saufie et al., 2013).In the process, the neural network learnt and changed weights during forward and backward phases.We, in this study, engaged different combinations of transfer functions like sigmoid/hyperbolic tangent, sigmoid/linear, sigmoid/sigmoid, and hyperbolic tangent/linear functions for each of the six monitoring stations to compare and pick up the optimum R 2 values as shown in Table 5.The network structure, transfer functions of each of the models, and performance indicators (IA, R 2 , NAE, MAE, MSE, and RMSE) can be seen in Table 5 below.The optimum R 2 values (0.651 for station S1, 0.637 for station S2,

(E) (F) (G)
0.688 for station S3, 0.636 for station S4, 0.641 for station S5 and 0.693 for station S6) are also marked 'bold' in Table 5.The respective values of the performance indicators like IA, NAE, MAE, MSE, and RMSE, for each of six monitoring stations, against the optimum R 2 values can also be seen in Table 5. Fig. S2(A)-S2(F) show the observation and prediction of the ANN models in scatter plots.

4 Predictive CART Model
By using CART analysis, several decision trees were developed based on different combinations of observed meteorological parameters, PM 10 , and gaseous pollutants for the three years (2016)(2017)(2018).As typical in machine learning, out of the total data points of the respective independent and dependent variables, 70% used as trained set while 30% as the test set.The optimum models were produced for each of the six air quality monitoring stations of Guwahati when they had the least relative errors in respective cases given by equation 4 below.

Relative error of CART
(4) where S(K) is equal to the sum of the squared residuals at the terminal node and S(O) is the sum of squared errors of the dependent error around its mean in the root node.The predictive CART models and performance indicators (like R 2 , IA, NAE, MAE, MSE, and RMSE) are given in Table 6.Fig. S3(A)-S3(F) show the deci-sion trees of the CART models.

5 Model Comparison
All six performance indicators were put to use for comparing the one-day ahead PM 10 prediction performances of three methods, i.e., MLR, ANN (MLP), and CART to isolate the best model, as shown in Table 7. NAE, MAE, MSE, and RMSE were used to find the error of the model, where a value closer to 0 indicated a better model.The other two performance indicators, namely, IA and R 2 , were used to check the accuracy of the model result, where higher accuracy is given by a value closer to 1.The values for performance indicators provide specific information regarding predictive performance efficiencies (Singh et al., 2013).RMSE wise comparison between models is best desired when the objective is to avoid large prediction errors.
On the other hand, MAE casts light on the average magnitude of the error without considering their direction.The advantage of the linear score of MAE lies in the fact that all individual differences between predictions and corresponding observed values are given equal weight in the average.However, amongst all six performance indicators, R 2 can be regarded as the single most important measure in deciding the prediction accuracy (Yoo et al., 2018).
In this study, the prediction of one day ahead PM 10 for all the six air quality monitoring stations displayed relatively good fits through the use of MLP methods (R 2 = 0.64-0.69;IA = 0.88-0.95) and smallest errors    Hence, the results obtained from the ANN (MLP) models were more suitable for Guwahati than those of the constructed MLR and CART models.

CONCLUSIONS AND RECOMMENDATIONS
The comprehensive and comparative review of PM 10 concentration status of 28 different categories of Indian cities (tier-I, tier-II, and tier-III cities) and alarming levels of PM 10 concentrations thereof indicate the urgent need The tier-II city Guwahati recorded high variability in the observed in PM 10 level due to the rapid urbanization.The highest daily average PM 10 recorded was 259.39 μg m -3 , while the lowest was 40.67 μg m -3 during the period 2016-2018.The mean PM 10 concentration for the city of 133.22 μg m -3 , as found in this study, violated WHO standard by 6.66 times and NAAQS by 2.22 times which were 4.54 times and 1.51 times respectively, during 2013-2014 (Table 1, Table S1).The average daily NO 2 , CO, and SO 2 concentrations of Guwahati were found to be in correlation with PM 10 concentrations during 2016-2018 and thereby suggesting a common source for these compounds.
In different cities of the world, different predictive modeling techniques have been used to predict PM 10 in advance.However, the use of MLR with stepwise inclusion of input variable was found to be the most widely used tool for temporal prediction of PM 10 in different urban areas of India, and that too mostly applied in bigger cities of the country (Table S2).This study found that the next day's PM 10 concentrations, in a tier-II city Guwahati, can be better forecasted using non-linear algorithm MLP with FFBN topologies of ANN class in com-parison to linear MLR and non-linear CART model.These three models were critically assessed through a comparative evaluation of performance indicators keeping in mind the end goal is to choose the best-fitted model for accurate forecasting PM 10 at the city level.The result of the study reveals that the one day ahead PM 10 for all the six-air quality monitoring stations of Guwahati, prediction ability has been relatively better using MLP methods (R 2 = 0. 0.64-0.69;IA = 0.88-0.95) and with smallest errors (NAE = 0.15-0.22;MAE = 16-26.11;MSE = 497.86-1408.45;and RMSE = 22.31-37.53 in comparison to its closest performer MLR method (R 2 = 0.61-0.68;IA = 0.87-0.90;NAE = 0.20-0.23;MAE = 21.37-26.26;MSE = 859.23-1461.48;and RMSE = 29.31-38.23).It is interesting to note that CART as predictive method for one day ahead PM 10 were close to MLR but not equal in terms of model evaluation indicators with R 2 = 0.52-0.63;IA = 0.84-0.88;NAE = 0.21-0.26;MAE = 21.97-27.35;MSE = 922.12-1700.95;and RMSE = 30.37-41.24.The relatively low R 2 value is quite common in the case of time series dependent nonlinear atmospheric variables with their known confounding effects.
An attempt was made to further validate the predictive performance of the MLP model with respect to the observed PM 10 data of the (STN6_193) NAMP monitoring station of Guwahati beyond the period of collected data ( January-March, 2019) used in developing the MLP model.The predicted PM 10 concentrations obtained using MLP model have been matched with the same period's actual data.Fig. S4 shows that the MLP model performed well for the post-study period as well and the model performance indicators (R 2 = 0.69, IA = 0.89, NAE = 0.07, MAE = 13.02,MSE = 287.95and RMSE = 16.97) were also in line with the original model (Table S5).
In the backdrop of CPCB's acknowledgment that comparatively smaller tier-II cities are also facing severe air pollution, city authorities are contemplating initiating several steps for curtailing air pollution and health hazards thereof.We recommend the local authority to use the non-linear algorithm MLP (ANN) with FFBN topologies for forecasting PM 10 concentration in the smaller Indian cities like Guwahati too for avoiding PMinduced health hazards to a great extent.'Predict pollution and defeat concentration' could be another approach to fight the air pollution menace in addition to the odd-even rule, which few Indian cities are enforcing pre-sently to rein on air pollution through curtailment of vehicular pollution.The advance prediction approach seems to be more applicable to Guwahati city as this study found PM 10 concentration built up had a positive correlation with gaseous pollutants and hence likely to have a common source, i.e., vehicular pollution.Moreover, with this model, the local SPCB authorities can caution city dwellers of impending dangerous levels of PM 10 , so that they can lessen their outdoor activities for those days and thereby avoiding exposure to unhealthy levels of air quality.

Fig. 2 .
Fig. 2. The architecture of the MLP network.

Fig. S2 .
Fig. S2.Observations and prediction of the ANN models in scatter plots.
Fig. S3.Decision tree structures from CART for station 1 to 6.

Table 2 .
Performance indicators for model validation.

Table 4 .
Summary of the Multiple Linear Regression (MLR) models for PM 10 forecasting.

Table 5 .
Predictive MLP models with network structure, transfer functions and performance indicators.

Table 6 .
Predictive CART models with input variable PM 10 , RF, T, RH, WS, NO 2 , SO 2 , CO and performance indicators.

Table 7 .
PM 10 prediction model performance statistics: NAE MAE, MSE, RMSE IA, and R 2 between measured and estimated values for six air quality monitoring stations.Therefore, it is high time for the initiation of some requisite actions for diminishing or preventing the build-up of the high ambient PM 10 concentration level in the cities.One way out is abatement action through short term traffic reduction in cities based on predicted PM 10 concentration level in advance.Therefore, it entails correct prediction of the city level PM 10 concentrations at least one or two days in advance by the local air quality managers through analysis of data routinely gathered by city authorities and predictive modeling thereof.

Table S2 .
Different data driven predictive techniques used for PM forecasting in Indian context.