Your Career Platform for Big Data

Be part of the digital revolution in Germany

 

View all Jobs

DataCareer Blog

  For many machine learning problems with a large number of features or a low number of observations, a linear model tends to overfit and variable selection is tricky. Models that use shrinkage such as Lasso and Ridge can improve the prediction accuracy as they reduce the estimation variance while providing an interpretable final model. In this tutorial, we will examine Ridge and Lasso regressions, compare it to the classical linear regression and apply it to a dataset in Python. Ridge and Lasso build on the linear model, but their fundamental peculiarity is regularization. The goal of these methods is to improve the loss function so that it depends not only on the sum of the squared differences but also on the regression coefficients. One of the main problems in the construction of such models is the correct selection of the regularization parameter. Сomparing to linear regression, Ridge and Lasso models are more resistant to outliers and the spread of data. Overall, their main purpose is to prevent overfitting.   The main difference between Ridge regression and Lasso is how they assign a penalty term to the coefficients. We will explore this with our example, so let's start.   We will work with the Diamonds dataset, which is freely available online: http://vincentarelbundock.github.io/Rdatasets/datasets.html . It contains the prices and other attributes of almost 54,000 diamonds. We will be predicting the price using the available attributes and compare the results for Ridge, Lasso, and OLS. In [1]: # Import libraries import numpy as np import pandas as pd # Upload the dataset diamonds = pd . read_csv ( 'diamonds.csv' ) diamonds . head () Out[1]:     Unnamed: 0 carat cut color clarity depth table price x y z 0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 In [2]: # Drop the index diamonds = diamonds . drop ([ 'Unnamed: 0' ], axis = 1 ) diamonds . head () Out[2]:   tbody tr th :only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } -->   carat cut color clarity depth table price x y z 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 In [3]: # Print unique values of text features print ( diamonds . cut . unique ()) print ( diamonds . clarity . unique ()) print ( diamonds . color . unique ())   ['Ideal' 'Premium' 'Good' 'Very Good' 'Fair'] ['SI2' 'SI1' 'VS1' 'VS2' 'VVS2' 'VVS1' 'I1' 'IF'] ['E' 'I' 'J' 'H' 'F' 'G' 'D']   As you can see, there are a finite number of variables, so we can transform these categorical variables to numerical variables. In [4]: # Import label encoder from sklearn.preprocessing import LabelEncoder categorical_features = [ 'cut' , 'color' , 'clarity' ] le = LabelEncoder () # Convert the variables to numerical for i in range ( 3 ): new = le . fit_transform ( diamonds [ categorical_features [ i ]]) diamonds [ categorical_features [ i ]] = new diamonds . head () Out[4]:     carat cut color clarity depth table price x y z 0 0.23 2 1 3 61.5 55.0 326 3.95 3.98 2.43 1 0.21 3 1 2 59.8 61.0 326 3.89 3.84 2.31 2 0.23 1 1 4 56.9 65.0 327 4.05 4.07 2.31 3 0.29 3 5 5 62.4 58.0 334 4.20 4.23 2.63 4 0.31 1 6 3 63.3 58.0 335 4.34 4.35 2.75   Before building the models, let's first scale data. Lasso and Ridge put constraints on the size of the coefficients associated to each variable. But, this value depends on the magnitude of each variable and it is therefore necessary to center and reduce, or standardize, the variables. In [5]: # Import StandardScaler from sklearn.preprocessing import StandardScaler # Create features and target matrixes X = diamonds [[ 'carat' , 'depth' , 'table' , 'x' , 'y' , 'z' , 'clarity' , 'cut' , 'color' ]] y = diamonds [[ 'price' ]] # Scale data scaler = StandardScaler () scaler . fit ( X ) X = scaler . transform ( X )   Now, we can basically build the Lasso and Ridge models. But for now, we will train it on the whole dataset and look at an R-squared score and on the model coefficients. Note, that we are not setting the alpha, it is defined as 1. In [6]: # Import linear models from sklearn import linear_model from sklearn.metrics import mean_squared_error # Create lasso and ridge objects lasso = linear_model . Lasso () ridge = linear_model . Ridge () # Fit the models lasso . fit ( X , y ) ridge . fit ( X , y ) # Print scores, MSE, and coefficients print ( "lasso score:" , lasso . score ( X , y )) print ( "ridge score:" , ridge . score ( X , y )) print ( "lasso MSE:" , mean_squared_error ( y , lasso . predict ( X ))) print ( "ridge MSE:" , mean_squared_error ( y , ridge . predict ( X ))) print ( "lasso coef:" , lasso . coef_ ) print ( "ridge coef:" , ridge . coef_ )   lasso score: 0.8850606039595762 ridge score: 0.8850713120355513 lasso MSE: 1829298.919415987 ridge MSE: 1829128.4968064611 lasso coef: [ 5159.45245224 -217.84225841 -207.20956411 -1250.0126333 16.16031486 -0. 496.17780105 72.11296318 -451.28351376] ridge coef: [[ 5.20114712e+03 -2.20844296e+02 -2.08496831e+02 -1.32579812e+03 5.36297456e+01 -1.67310953e+00 4.96434236e+02 7.26648505e+01 -4.53187286e+02]]   These two models give very similar results. So, we will split the data into training and test sets, build Ridge and Lasso, and choose the regularization parameter with the help of GridSearch. For that, we have to define the set of parameters for GridSearch. In this case, the models with the highest R-squared score will give us the best parameters. In [7]: # Make necessary imports, split data into training and test sets, and choose a set of parameters from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV import warnings warnings . filterwarnings ( "ignore" ) X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.25 , random_state = 101 ) parameters = { 'alpha' : np . concatenate (( np . arange ( 0.1 , 2 , 0.1 ), np . arange ( 2 , 5 , 0.5 ), np . arange ( 5 , 25 , 1 )))} linear = linear_model . LinearRegression () lasso = linear_model . Lasso () ridge = linear_model . Ridge () gridlasso = GridSearchCV ( lasso , parameters , scoring = 'r2' ) gridridge = GridSearchCV ( ridge , parameters , scoring = 'r2' ) # Fit models and print the best parameters, R-squared scores, MSE, and coefficients gridlasso . fit ( X_train , y_train ) gridridge . fit ( X_train , y_train ) linear . fit ( X_train , y_train ) print ( "ridge best parameters:" , gridridge . best_params_ ) print ( "lasso best parameters:" , gridlasso . best_params_ ) print ( "ridge score:" , gridridge . score ( X_test , y_test )) print ( "lasso score:" , gridlasso . score ( X_test , y_test )) print ( "linear score:" , linear . score ( X_test , y_test )) print ( "ridge MSE:" , mean_squared_error ( y_test , gridridge . predict ( X_test ))) print ( "lasso MSE:" , mean_squared_error ( y_test , gridlasso . predict ( X_test ))) print ( "linear MSE:" , mean_squared_error ( y_test , linear . predict ( X_test ))) print ( "ridge best estimator coef:" , gridridge . best_estimator_ . coef_ ) print ( "lasso best estimator coef:" , gridlasso . best_estimator_ . coef_ ) print ( "linear coef:" , linear . coef_ )   ridge best parameters: {'alpha': 24.0} lasso best parameters: {'alpha': 1.5000000000000002} ridge score: 0.885943268935384 lasso score: 0.8863841649966073 linear score: 0.8859249267960946 ridge MSE: 1812127.0709091045 lasso MSE: 1805122.1385342914 linear MSE: 1812418.4898094584 ridge best estimator coef: [[ 5077.4918518 -196.88661067 -208.02757232 -1267.11393653 208.75255168 -91.36220706 502.04325405 74.65191115 -457.7374841 ]] lasso best estimator coef: [ 5093.28644126 -207.62814092 -206.99498254 -1234.01364376 67.46833024 -0. 501.11520439 73.73625478 -457.08145762] linear coef: [[ 5155.92874335 -208.70209498 -208.16287626 -1439.0942139 243.82503796 -28.79983655 501.31962765 73.93030707 -459.94636759]]   Our score raises a little, but with these values of alpha, there is only a small difference. Let's build coefficient plots to see how the value of alpha influences the coefficients of both models. In [9]: # Import library for visualization import matplotlib.pyplot as plt coefsLasso = [] coefsRidge = [] # Build Ridge and Lasso for 200 values of alpha and write the coefficients into array alphasLasso = np . arange ( 0 , 20 , 0.1 ) alphasRidge = np . arange ( 0 , 200 , 1 ) for i in range ( 200 ): lasso = linear_model . Lasso ( alpha = alphasLasso [ i ]) lasso . fit ( X_train , y_train ) coefsLasso . append ( lasso . coef_ ) ridge = linear_model . Ridge ( alpha = alphasRidge [ i ]) ridge . fit ( X_train , y_train ) coefsRidge . append ( ridge . coef_ [ 0 ]) # Build Lasso and Ridge coefficient plots plt . figure ( figsize = ( 16 , 7 )) plt . subplot ( 121 ) plt . plot ( alphasLasso , coefsLasso ) plt . title ( 'Lasso coefficients' ) plt . xlabel ( 'alpha' ) plt . ylabel ( 'coefs' ) plt . subplot ( 122 ) plt . plot ( alphasRidge , coefsRidge ) plt . title ( 'Ridge coefficients' ) plt . xlabel ( 'alpha' ) plt . ylabel ( 'coefs' ) plt . show ()     As a result, you can see that when we raise the alpha in Ridge regression, the magnitude of the coefficients decreases, but never attains zero. The same scenario in Lasso influences less on the large coefficients, but the small ones Lasso reduces to zeroes. Therefore Lasso can also be used to determine which features are important to us and keeps the features that may influence the target variable, while Ridge regression gives uniform penalties to all the features and in such way reduces the model complexity and prevents multicollinearity.   Now, it’s your turn! li>a { padding-top: 15px; padding-bottom: 15px; } } .navbar .container { position: relative; max-width: 1130px !important; } @media (min-width: 768px) { .container { width: 750px; } } @media (min-width: 992px) { .container { width: 970px; } } @media (min-width: 1200px) { .container { width: 1170px; } } .container { padding-right: 15px; padding-left: 15px; margin-right: auto; margin-left: auto; } @media (min-width: 768px) { .navbar>.container .navbar-brand, .navbar>.container-fluid .navbar-brand { margin-left: -15px; } } .navbar-nav>li>a { padding-top: 10px; padding-bottom: 10px; line-height: 20px; } @media (min-width: 768px) { .navbar-nav>li>a { padding-top: 15px; padding-bottom: 15px; } } .dropdown-menu { position: absolute; top: 100%; left: 0; z-index: 1000; display: none; float: left; min-width: 160px; padding: 5px 0; margin: 2px 0 0; font-size: 14px; text-align: left; list-style: none; background-color: #fff; -webkit-background-clip: padding-box; background-clip: padding-box; border: 1px solid #ccc; border: 1px solid rgba(0, 0, 0, .15); border-radius: 4px; -webkit-box-shadow: 0 6px 12px rgba(0, 0, 0, .175); box-shadow: 0 6px 12px rgba(0, 0, 0, .175); } .dropdown-menu { background-color: #00a7de; } .body__inner .container .container { width: 100%; padding-left: 0px; padding-right: 0px; } .prompt { min-width: 11ex; margin-left: -13ex !important; padding: 6px 6px 6px 0px; line-height: 1 !important; display: none !important; } div.cell { margin: 0 !important; padding-left: 0px; } .border-box-sizing { outline: none; } .blog__full-article .static-pages__blog .blog__content div, .blog__full-article .static-pages__blog .blog__content p { margin: 0; } #notebook-container { padding: 0; min-height: 0; -webkit-box-shadow: none; box-shadow: none; } div#notebook { padding-top: 0px; } div.input_area>div.highlight { padding-left: 6px; margin: 1px !important; } .cell div.input { margin-bottom: 5px !important; } .text_cell_render h1 { text-align: left; } .anchor-link { display: none; } div.output_area .rendered_html table { margin-top: 10px; } .text_cell_render, .text_cell.rendered .rendered_html{ padding-left:0px; } .inner_cell {margin-left: 5px;} @media (min-width: 541px) { .navbar-collapse.collapse { display: none !important; } } @media(max-width: 800px) { div.output_subarea { overflow-x: auto; padding: 0.4em; -webkit-box-flex: 1; -moz-box-flex: 1; box-flex: 1; flex: 1; max-width: calc(100% - 2ex); } .prompt { margin-left: 0px !important; } } @media (max-width: 991px) { .navbar-collapse.collapse { background-color: #8c8585 !important; } } .navbar { min-height: 77px; font-size: 16px; border: none; background: none; z-index: 20; background: #fff; border-radius: 0; margin-bottom: 0; background: transparent; position: absolute; top: 0; left: 0; width: 100%; } @media all and (max-width: 992px) { .navbar { min-height: 72px; } } .navbar-collapse.collapsing { position: absolute; } @media (min-width: 994px) { .navbar-collapse.collapse { display: block !important; } } .blog__content h1, .blog__content h2, .blog__content h3, .blog__content h4 { padding: 30px 0 10px; } div.output_wrapper { margin-bottom: 15px !important; } -->
   A common and very challenging problem in machine learning is overfitting, and it comes in many different appearances. It is one of the major aspects of training the model. Overfitting occurs when the model is capturing too much noise in the training data set which leads to bad predication accuracy when applying the model to new data. One of the ways to avoid overfitting is regularization technique. In this tutorial, we will examine Ridge regression and Lasso which extend the classical linear regression. Earlier, we have shown how to work with Ridge and Lasso in Python, and this time we will build and train our model using R and the caret package. Like classical linear regression, Ridge and Lasso also build the linear model, but their fundamental peculiarity is regularization. The goal of these methods is to improve the loss function so that it depends not only on the sum of the squared differences but also on the regression coefficients. One of the main problems in the construction of such models is the correct selection of the regularization parameter. Comparing to linear regression, Ridge and Lasso models are more resistant to outliers and the spread of data. Overall, their main purpose is to prevent overfitting. The main difference between Ridge regression and Lasso is how they assign a penalty to the coefficients. We will explore this with our example, so let’s start. We will work with the Diamonds dataset, which you can download from the next website:  http://vincentarelbundock.github.io/Rdatasets/datasets.html . It contains the prices and other attributes of almost 54,000 diamonds. We will be predicting the price of diamonds using other attributes and compare the results for Ridge, Lasso, and OLS. # Upload the dataset diamonds <-read.csv('diamonds.csv', header = TRUE, sep = ',') head(diamonds) ## X carat cut color clarity depth table price x y z ## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # Drop the index column diamonds<- diamonds[ , -which(names(diamonds)=='X')] head(diamonds) ## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # Print unique values of text features print(levels(diamonds$cut)) ## [1] "Fair" "Good" "Ideal" "Premium" "Very Good" print(levels(diamonds$clarity)) ## [1] "I1" "IF" "SI1" "SI2" "VS1" "VS2" "VVS1" "VVS2" print(levels(diamonds$color)) ## [1] "D" "E" "F" "G" "H" "I" "J" As you can see, there is a finite number of variables, so we can change these categorical variables to numerical. diamonds$cut <- as.integer(diamonds$cut) diamonds$color <-as.integer(diamonds$color) diamonds$clarity <- as.integer(diamonds$clarity) head(diamonds) ## carat cut color clarity depth table price x y z ## 1 0.23 3 2 4 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 4 6 6 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 2 7 4 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 5 7 8 62.8 57 336 3.94 3.96 2.48 Before building the models, let’s first scale data. Scaling will put the ranges of our features from -1 till 1. Thus a feature with a larger scale won’t have a larger impact on the target variable. To scale a feature we need to subtract the mean from every value and divide it by standard deviation. For scaling data and training the models we’ll use caret package. The caret package (short for Classification And REgression Training) is a wrapper around a number of other packages and provides a unified interface for data preparation, models training, and metrics evaluation. # Create features and target matrixes X <- diamonds %>% select(carat, depth, table, x, y, z, clarity, cut, color) y <- diamonds$price # Scale data preprocessParams<-preProcess(X, method = c("center", "scale")) X <- predict(preprocessParams, X) Now, we can basically build Lasso and Ridge. We’ll split the data into a train and a test dataset but for now we won’t set the regularization parameter lambda. It is set to 1. The “glmnet” method in caret has an alpha argument that determines what type of model is fit. If alpha = 0 then a ridge regression model is fit, and if alpha = 1 then a lasso model is fit. Here we’ll use caret as a wrapper for glment. # Spliting training set into two parts based on outcome: 75% and 25% index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] # Create and fit Lasso and Ridge objects lasso<-train(y= y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 1, lambda = 1) ) ridge<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 0, lambda = 1) ) # Make the predictions predictions_lasso <- lasso %>% predict(X_test) predictions_ridge <- ridge %>% predict(X_test) # Print R squared scores data.frame( Ridge_R2 = R2(predictions_ridge, y_test), Lasso_R2 = R2(predictions_lasso, y_test) ) ## Ridge_R2 Lasso_R2 ## 1 0.8584854 0.8813328 # Print RMSE data.frame( Ridge_RMSE = RMSE(predictions_ridge, y_test) , Lasso_RMSE = RMSE(predictions_lasso, y_test) ) ## Ridge_RMSE Lasso_RMSE ## 1 1505.114 1371.852 # Print coeficients data.frame( as.data.frame.matrix(coef(lasso$finalModel, lasso$bestTune$lambda)), as.data.frame.matrix(coef(ridge$finalModel, ridge$bestTune$lambda)) ) %>% rename(Lasso_coef = X1, Ridge_coef = X1.1) ## Lasso_coef Ridge_coef ## (Intercept) 3934.27842 3934.23903 ## carat 5109.77597 2194.54696 ## depth -210.68494 -92.16124 ## table -207.38199 -156.89671 ## x -1178.40159 709.68312 ## y 0.00000 430.43936 ## z 0.00000 423.90948 ## clarity 499.77400 465.56366 ## cut 74.61968 83.61722 ## color -448.67545 -317.53366 These two models give very similar results. Now let’s choose the regularization parameter with the help of tuneGrid. The models with the highest R-squared score will give us the best parameters. parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1)) lasso<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 1, lambda = parameters) , metric = "Rsquared" ) ridge<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 0, lambda = parameters), metric = "Rsquared" ) linear<-train(y = y_train, x = X_train, method = 'lm', metric = "Rsquared" ) print(paste0('Lasso best parameters: ' , lasso$finalModel$lambdaOpt)) ## [1] "Lasso best parameters: 2.5" print(paste0('Ridge best parameters: ' , ridge$finalModel$lambdaOpt)) ## [1] "Ridge best parameters: 25" predictions_lasso <- lasso %>% predict(X_test) predictions_ridge <- ridge %>% predict(X_test) predictions_lin <- linear %>% predict(X_test) data.frame( Ridge_R2 = R2(predictions_ridge, y_test), Lasso_R2 = R2(predictions_lasso, y_test), Linear_R2 = R2(predictions_lin, y_test) ) ## Ridge_R2 Lasso_R2 Linear_R2 ## 1 0.8584854 0.8813317 0.8813157 data.frame( Ridge_RMSE = RMSE(predictions_ridge, y_test) , Lasso_RMSE = RMSE(predictions_lasso, y_test) , Linear_RMSE = RMSE(predictions_ridge, y_test) ) ## Ridge_RMSE Lasso_RMSE Linear_RMSE ## 1 1505.114 1371.852 1505.114 print('Best estimator coefficients') ## [1] "Best estimator coefficients" data.frame( ridge = as.data.frame.matrix(coef(ridge$finalModel, ridge$finalModel$lambdaOpt)), lasso = as.data.frame.matrix(coef(lasso$finalModel, lasso$finalModel$lambdaOpt)), linear = (linear$finalModel$coefficients) ) %>% rename(lasso = X1, ridge = X1.1) ## lasso ridge linear ## (Intercept) 3934.23903 3934.27808 3934.27526 ## carat 2194.54696 5103.53988 5238.57024 ## depth -92.16124 -210.18689 -222.97958 ## table -156.89671 -207.17217 -210.64667 ## x 709.68312 -1172.30125 -1348.94469 ## y 430.43936 0.00000 23.65170 ## z 423.90948 0.00000 22.01107 ## clarity 465.56366 499.73923 500.13593 ## cut 83.61722 74.53004 75.81861 ## color -317.53366 -448.40340 -453.92366 Our score raised a little, but with these values of lambda, there is not much difference. Let’s build coefficient plots to see how the value of lambda influences the coefficients of both models. We will use glmnet function to train the models and then we’ll use plot() function that produces a coefficient profile plot of the coefficient paths for a fitted “glmnet” object. Xvar variable of plot() defines what is on the X-axis, and there are 3 possible values it can take: “norm”, “lambda”, or “dev”, where “norm” is for L1-norm of the coefficients, “lambda” for the log-lambda sequence, and “dev” is the percent deviance explained. We’ll set it to lambda. To train glment, we need to convert our X_train object to a matrix. # Set lambda coefficients paramLasso <- seq(0, 1000, 10) paramRidge <- seq(0, 1000, 10) # Convert X_train to matrix for using it with glmnet function X_train_m <- as.matrix(X_train) # Build Ridge and Lasso for 200 values of lambda rridge <- glmnet( x = X_train_m, y = y_train, alpha = 0, #Ridge lambda = paramRidge ) llaso <- glmnet( x = X_train_m, y = y_train, alpha = 1, #Lasso lambda = paramLasso ) ## Warning: Transformation introduced infinite values in continuous x-axis ## Warning: Transformation introduced infinite values in continuous x-axis As a result, you can see that when we raise the lambda in the Ridge regression, the magnitude of the coefficients decreases, but never attains zero. The same scenario in Lasso influences less on the large coefficients, but the small coefficients are reduced to zeroes. We can conclude from the plot that the “carat”" feature has the most impact on the target variable. Other important features are “clarity” and “color” while features like “z” and “cut” have barely any impact. Therefore Lasso can also be used to determine which features are important to us and have strong predictive power. The Ridge regression gives uniform penalties to all the features and in such way reduces the model complexity and prevents multicollinearity. Now, it’s your turn!
Today, data science specialists are among the most sought-after in the labor market. Being able to find significant insights in a huge amount of information, they help companies and organizations to optimize the structure's work.  The field of data science is rapidly developing and the demand for talent is changing. We use job offerings to analyze the current demand for talent. After performing a  first analysis in 2017 , this article provides an update and more details on job openings, roles, required skills, locations and employers. We collect data over the last month from Indeed through their API. Indeed aggregates job openings from various websites and therefore provides a good starting point for an analysis.  Data science job offers in Switzerland: first sight We collect job openings for the search queries Data Analyst, Data Scientist, Machine Learning and Big Data. At the time of writing, there were 458 vacancies in the field of data science listed on Indeed for Switzerland. Most likely, not all jobs are captured by Indeed. Nevertheless, the dataset offers clear and structured information that allow us to analyze the market.  In order to get a better idea of different job types offered in the market, we will analyse the required level of experience, the required skills as well as the geographical distribution in the country. The bar plot shows the different job titles. Senior Data Scientist and Data Scientist are the most common job titles in this area. However, many jobs are listed with a job specific title rather than a general one.  Based on the job title,  we group jobs by the required experience level. The following plots points out that there is a high demand for experienced talents. More than 60% of job openings require middle-level specialists, while around quarter offers look for seniors. Interns and juniors make up less than 10% of the total number of vacancies.   Zurich is the Swiss data science center In a second step, we analyze the regional distribution of jobs. Not surprisingly, Zurich offers most positions, followed by Basel, Geneva (Genf), and Lausanne. The distribution by seniority seems equally distributed among cities - generally there is a high demand for people with some years of on-the-job experience. The majority of job offers for interns and juniors are found in Zürich. Both senior and lead data scientists are in high demand in Zürich and Basel, on the second stage follow Geneva and Lausanne. Strong demand for Python skills Let’s review the demand for specific skills, namely the number of occurrences of skills.Python is clearly leading among the programming languages with more than 400 mentions followed by R, Java, Scala, JavaScript, and C/C++. AWS, Spark, and Azure are the top 3 big data technologie, while Tableau and SAS head the visualization and analysis tools. Academia, Tech, Pharma and Finance hire data skills   There are about 260 companies in Switzerland offering jobs in the sphere of data science, that we captured in our dataset. Most of them belong to the IT, FMCG and pharmaceutical areas, as well as financial and digital spheres. We extracted the top companies with the most job offerings, and here they are. The size of the company name represents the number of job offerings.  Google, Siemens, and EPAM companies offer many data science jobs in IT and digital areas. Roche and Novartis are large pharmaceutical firms.  Credit Suisse is a famous financial holding. There are also hiring agencies like Myitjob or Oxygen Digital Recruitment. Johnson&Johnson is a giant FMCG company. EPFL, University of Basel, University of Bern and other educational establishments are also among the top employers of data talents. myitjobs was captured in the analysis - but it is just a job website and was misclassified by Indeed.  Summing up our analysis, we can see that there is substantial demand for individuals with experience, while fewer opportunities exist to acquire experience though entry positions or internships. Strong Python skills seem to be a must-have these days in the data science world. The domination of Python has amplified compared to 2 years ago when R or Matlab were mentioned more often.     
View all blog posts