In data science data cleaning stage, we may encounter this situation:

In one column you have multiple features, each feature has multiple values, they all stacked in one column.

So how to process this kind of data can be a challenge and worth spend some time on.

Here I show an example of such data and how to process it.

Data

1 2	`with pd.option_context('display.max_colwidth', 220): crowd.iloc[0:1000:200,].to_frame()`

	Crowd
0	area:7572
200	age:217,601,79,202,837,942,638,394,347,731,739,393,829,366,400,844,787,229,333,361,741,608,479,433,1,728,753,522,988,819,340
400	age:217,601,79,202,837,942,638,394,347,731,739,393,829,366,400,844,787,229,333,361,741,608,479,433,1,728,753,522,988,819,340
600	age:217,601,202,837,942,638,287,5,394,347,731,739,393,366,400,844,787,517,229,333,361,741,608,714,479,433,1,728,753,522,988,819,340,972\|area:12780\|status:13,9
800	area:1741

This is a column called “Crowd” from a dataset. This column describe the features of the users. Different features are separated by ‘|’ and the numbers after ‘feature_name:’ means different values for this feature.

For example, for the first record, the crowd is users who’s area is 7572.

For the 600th record, the crowd is useres who’s age is either 217, or 601, or…, and status is either 13 or 9.

Next I will show how to encode this feature.

How many features in total?

First, we need to know how many features are used to describe the “crowd”. We can get this by:

get records which contains multiple features
for each record, get the feature names and extend it(add it) to a list
use ‘set(list)’ to create a set of feature names

Codes below:

multifield = crowd[crowd.str.contains(r'\|')]
matches = []
for line in multifield:
    match = re.findall(r'\w+:', line)
    matches.extend(match)
    crowd_attr = set(matches)
    crowd_attr = list(map(lambda x: x[:-1],crowd_attr))

And we get:

[‘education’, ‘consuptionAbility’, ‘work’, ‘os’, ‘age’, ‘area’, ‘status’, ‘gender’, ‘behavior’, ‘connectionType’]

Great! Now we know this column contains one or more features from this list.

Next, Split this column into 10 columns, representing 10 features separatedly.

Split columns into feature_column

To split the columns, we need to:

iterate each row of the data
- for each row, iterate each feature name:
  - if feature names has values, extract values, add to a dict
convert the dict to a dataframe

# expand the crowd to multiple columns
rows_list = []
for row in crowd:
    dict1 = {}
    for attr in crowd_attr:
        attrpat = attr + ':[0-9]+[,[0-9]+]*'
        match = re.findall(attrpat, row)
        if len(match) == 0:
            value = np.nan
        else:
            value = re.search(r'[0-9]+[,[0-9]+]*',match[0]).group()
        dict1.update({attr:value})
    rows_list.append(dict1)
df_crowd = pd.DataFrame(rows_list)

The result will look like this.

	age	area	behavior	connectionType	consuptionAbility	education	gender	os	status	work
0	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	6410	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	6410	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	NaN	12045	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	NaN	12045	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	NaN	6833	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	NaN	12045	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	NaN	12045	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
70	217,601,79,202,837,942,638,394,347,731,739,393...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
71	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
72	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
73	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
74	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
77	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
78	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
79	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
80	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
81	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
82	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
83	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
84	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
85	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
86	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
87	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
89	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
90	217,601,202,837,942,638,287,5,394,347,731,739,...	2880	NaN	NaN	NaN	NaN	NaN	NaN	13,9	NaN
91	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
92	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
93	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
94	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
95	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
96	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
97	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
99	NaN	7572	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

100 rows × 10 columns

for each feature column, one-hot encoder

The challenge here is, unlike Kaggle ready-to-use dataset, each feature column might contain multiple values separated by comma. We need to use ‘str.get_dummies(sep=’,’)’ in this case.

Here to take the “age” feature as an example.

def onehot_encode_features(df_crowd):
    df_crowd_encode = pd.DataFrame()
    for attr in df_crowd.columns:
        p = df_crowd[attr].str.get_dummies(sep=",")
        p.columns = list(map(lambda x: 'age' + x, p.columns))
        df_crowd_encode = pd.concat([df_crowd_encode,p],axis=1)
    return df_crowd_encode

df_crowd_encode = onehot_encode_features(df_crowd)

The result is :

1	`df_crowd_encode.iloc[0:100:20]`

	age1	...
0	0	...
20	1	...
40	1	...
60	0	...
80	0	...

5 rows × 995 columns

Sweet, these multi-value columns have been successfully converted to one-hot features.

Next, Sparse features?

Most of the features converted are sparse features. Whether to use them or how to use them can be tricky, probably another post on this?

	age1	...
0	0	...
20	1	...
40	1	...
60	0	...
80	0	...

	age1	...
0	0	...
20	1	...
40	1	...
60	0	...
80	0	...

	age1	...
0	0	...
20	1	...
40	1	...
60	0	...
80	0	...