Some interview questions
What’s the difference between boosting and bagging
Bagging
- attempts to reduce the chance overfitting complex models.
- Bagging, also called bootstrap, is to create subsets from sample with replacement.
- It trains a large number of “strong” learners in parallel
- aim to reduce overfitting from CART
- uses complex base models and tries to “smooth out” their predictions
- However the results will be dominated by strong features
boosting
- attempts to improve the predictive flexibility of simple models
- build a strong learner from a bunch of weak learners in sequence
- learn from already learned trees
- multiple variations including Ada, gradientboosting, etc
- uses simple base models and tries to “boost” their aggregate complexity
What is regularization and how can it be achieved?
- regularization is introduced to reduce overfitting
- add a penalty term to minimize number of features.
What are precision and recall?
- from confusion matrix
- precision: TP from all corrected predict samples
- recall: tp from all true examples
What is the difference between exploratory data analysis and feature engineering?
- EDA is to explore the trend in the data
- FE is to modify the features/combine features to build more reliable features.
How to Interpret P-values and Coefficients in Regression Analysis
- coefficient: whether there is a positive or negative correlation between each independent variable the dependent variable
- P-value: whether the relationships that you observe in your sample also exist in the larger population
- The p-value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable
- a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor’s value are related to changes in the response variable.
What steps would you take when trying to predict if someone is going to purchase a product?
Identify the problem:
- classification problem, yes or no.
dataset and feature engineering:
- historical purchase log of the same person
- person profile: gender, education, position, location, etc.
- product profile: categories, historical sold, etc
- some feature engineerings
- pay attention to time series analysis, use past data to predict new
Machine learning:
- binary classification problem
- simple, easy problem, with not so many categorical features: Logistic regression with regularization
- try naive bayes just in case.
- many sparse features/ large dataset/ many categorical features: lightgbm or xgboost
What would you do if you had an imbalanced dataset
- imbalanced dataset means highly skewed dataset.
- if binary classification problem, choose f1 score instead of accuracy , check confusion matrix
- if labels/ features skewed from normal distribution, then transfor (log) before applying ml algorithms
- common techniques:
- up-sample: randomly duplicating observations from the minority class in order to reinforce its signal
- dwon-sample: randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm
- use penalized learning algorithms that increase the cost of classification mistakes on the minority class
- use tree based algorithm
How to handle outliers?
- do EDA and data plotting, check if the outliers are reasonable/ridiculous
- if ridiculous, drop it or fix it.
- if reasonable, try ways to handle them
- normalization/binning, eg, robustscaler in sklearn.
- imputing it by mean/medium/etc.
- trust it, use a robust machine learning algorithms