gravatar for Mensur Dlakic

3 hours ago by


The answer to this depends on several considerations: 1) are your normalizations for different technologies internally consistent, if not between each other? 2) are you doing classification or regression in your ML pipeline? (assuming classification).

When using tree-based techniques, differences in scale between different columns (different technologies) are unimportant. These methods can simultaneously use categorical and continuous data, which are by definition not on the same scale. As a general rule, tree splits will be adjusted during the learning process and have no problem with scale. Still, if you are doing classification, it may help to reduce the cardinality (number of unique values per feature/column) by discretizing data (also called binning).There are many ways of doing this: uniform range width, uniform number of elements per range, etc. I like minimum description length principle which is entropy-based and easy to understand, but it is slow. It is easy to find more information by Googling, and there are several implementations on GitHub.

Source link