Python/pandas エンコードされたone-hotデータをデコードする方法

f:id:kaeken:20180201203956p:plain

機械学習の学習用データでよく使われるone-hotエンコーディングされたデータがあります。

f:id:kaeken:20180201202817p:plain

one-hotエンコーディング処理は、さまざまなライブラリで実装されています。

sklearn.preprocessing.OneHotEncoder — scikit-learn 0.19.1 documentation http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

ただ、one-hotデコーディング、つまり、もとのデータに戻す処理が見つかりませんでした。

f:id:kaeken:20180201202951p:plain

そこで、以下のようにpandasのデータフレームを使って、one-hotデコーディングする処理を作成してみました。

def onehot_decoder(df):
    colname_list = []
    for index, row in df.iterrows():#各行を取得
        for k,v in enumerate(row):#各列の値を取得
            if int(v) == 1:#ワンホットになっている列のカラム名を取得
                colname_list.append(df.columns[k])
                break

    #デコード済みリストを新規列として追加
    df_add = pd.DataFrame(colname_list,columns=['decoded'])
    return df_add

res = onehot_decoder(df_sample)

もう少し簡潔に書けるとは思いますが、わかりやすさのため、このままにしておきます。

前処理方法は、データによって多数のパターンがあり、 scikit-learnのpreprocessingモジュールでさまざまなものが作られています。

API Reference — scikit-learn 0.19.1 documentation http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

preprocessing.Binarizer([threshold, copy])   Binarize data (set feature values to 0 or 1) according to a threshold
preprocessing.FunctionTransformer([func, …])  Constructs a transformer from an arbitrary callable.
preprocessing.Imputer([missing_values, …])    Imputation transformer for completing missing values.
preprocessing.KernelCenterer    Center a kernel matrix
preprocessing.LabelBinarizer([neg_label, …])  Binarize labels in a one-vs-all fashion
preprocessing.LabelEncoder  Encode labels with value between 0 and n_classes-1.
preprocessing.MultiLabelBinarizer([classes, …])   Transform between iterable of iterables and a multilabel format
preprocessing.MaxAbsScaler([copy])  Scale each feature by its maximum absolute value.
preprocessing.MinMaxScaler([feature_range, copy])   Transforms features by scaling each feature to a given range.
preprocessing.Normalizer([norm, copy])  Normalize samples individually to unit norm.
preprocessing.OneHotEncoder([n_values, …])    Encode categorical integer features using a one-hot aka one-of-K scheme.
preprocessing.PolynomialFeatures([degree, …]) Generate polynomial and interaction features.
preprocessing.QuantileTransformer([…])    Transform features using quantiles information.
preprocessing.RobustScaler([with_centering, …])   Scale features using statistics that are robust to outliers.
preprocessing.StandardScaler([copy, …])   Standardize features by removing the mean and scaling to unit variance
preprocessing.add_dummy_feature(X[, value]) Augment dataset with an additional dummy feature.
preprocessing.binarize(X[, threshold, copy])    Boolean thresholding of array-like or scipy.sparse matrix
preprocessing.label_binarize(y, classes[, …]) Binarize labels in a one-vs-all fashion
preprocessing.maxabs_scale(X[, axis, copy]) Scale each feature to the [-1, 1] range without breaking the sparsity.
preprocessing.minmax_scale(X[, …])    Transforms features by scaling each feature to a given range.
preprocessing.normalize(X[, norm, axis, …])   Scale input vectors individually to unit norm (vector length).
preprocessing.quantile_transform(X[, axis, …])    Transform features using quantiles information.
preprocessing.robust_scale(X[, axis, …])  Standardize a dataset along any axis
preprocessing.scale(X[, axis, with_mean, …])  Standardize a dataset along any axis

上記も後日紹介していきます。

kaeken(嘉永島健司)のTech探究ブログ

主に情報科学/情報技術全般に関する知見をポストします。（最近は、特にData Science、機械学習、深層学習、統計学、Python、数学、ビッグデータ）

Python/pandas エンコードされたone-hotデータをデコードする方法