Some interview questions

What’s the difference between boosting and bagging

Bagging

attempts to reduce the chance overfitting complex models.
Bagging, also called bootstrap, is to create subsets from sample with replacement.
It trains a large number of “strong” learners in parallel
aim to reduce overfitting from CART
uses complex base models and tries to “smooth out” their predictions
However the results will be dominated by strong features

boosting

What is regularization and how can it be achieved?

coefficient: whether there is a positive or negative correlation between each independent variable the dependent variable
P-value: whether the relationships that you observe in your sample also exist in the larger population
The p-value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable
a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor’s value are related to changes in the response variable.

classification problem, yes or no.
dataset and feature engineering:
historical purchase log of the same person
person profile: gender, education, position, location, etc.
product profile: categories, historical sold, etc
some feature engineerings
pay attention to time series analysis, use past data to predict new
Machine learning:
binary classification problem
simple, easy problem, with not so many categorical features: Logistic regression with regularization
try naive bayes just in case.
many sparse features/ large dataset/ many categorical features: lightgbm or xgboost

imbalanced dataset means highly skewed dataset.
if binary classification problem, choose f1 score instead of accuracy , check confusion matrix
if labels/ features skewed from normal distribution, then transfor (log) before applying ml algorithms
common techniques:
- up-sample: randomly duplicating observations from the minority class in order to reinforce its signal
- dwon-sample: randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm
- use penalized learning algorithms that increase the cost of classification mistakes on the minority class
- use tree based algorithm

do EDA and data plotting, check if the outliers are reasonable/ridiculous
if ridiculous, drop it or fix it.
if reasonable, try ways to handle them
- normalization/binning, eg, robustscaler in sklearn.
- imputing it by mean/medium/etc.
- trust it, use a robust machine learning algorithms