Feature selection in unsupervised learning problems

Bechir Trabelsi
6 min readJul 1, 2021

--

Feature selection is a crucial part of any machine learning project, the wrong choice of features to be used by the model can lead to worse results, as such many techniques and methods were elaborated to get the optimal set of features.

In the case of supervised learning, this task is relatively easy thanks to preexisting tricks, such as feature importance embedded in many scikit learn models perse (i.e Random Forrests, linear regression, logistic regression …etc.), and the existence of loss functions computable directly is also a big help, as we can always run an exhaustive search with a fixed model and compare the obtained loss scores.

Nonetheless, the real problem resides in the case of unsupervised learning, where such techniques are not very intuitive. Henceforth in this article, I’ll be showcasing a technique I used in a recent project to perform feature selection in the case of unsupervised learning.

The data is confidential so I will not be able to share it, nevertheless, I can share the technics I used as they are open source and available online.

The first thing one must get familiar with is PCA or Principal Component Analysis.

The principal components of a collection of points in real coordinate space are a sequence of n unit vectors, where the i’th vector is the direction of a line that best fits the data while being orthogonal to the first i-1 vectors. These directions form an orthonormal basis. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data. (see https://en.wikipedia.org/wiki/Principal_component_analysis )

What we want to do is to plot the explained variance of each component( see article mentioned above)

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df)
df_pca_all = pca.transform(df)
eigenvalues = pca.explained_variance_
plt.bar(np.arange(0,df.shape[1],1), eigenvalues)
plt.plot(eigenvalues, "r")
plt.plot(eigenvalues, "ro")
plt.show()
Explained Variance of each component

So for example, if this was our data, we can say that the explained variance is mostly carried by the first and second components, and thus we can project our data onto these axes, hence a new basis.

Let us project the data and see what happens

Scatter plot of the project data on the first and second components

Now obviously we can see some outlier points here and there, to deal with them we will apply an isolation forest that will automatically detect anomalies, hence outliers.

from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
clf = IsolationForest(n_estimators=1000, random_state=rng)
#df_pca is the new projected data
clf.fit(df_pca)
IF_labels = clf.predict(df_pca)
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=IF_labels, s=40, cmap='viridis')
detected outliers (in purple) using Isolation forest

Now, with our data all cleaned up, we can discuss the elephant in the room: Feature selection in unsupervised learning.

Our main tool will be Principal Feature Analysis, to explain it we’ll take some paragraphs from this article : http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf

Let X be a n -dimensional random vector with zero mean,let C be its co-
variance matrix of X and Let O be a matrix with its columns being the
orthonormal eigenvectors C :
C = OΣO^T
where Σ defines the diagonal matrix composed of the eigenvalues of C,σ1 ≥
σ2 ≥… ≥σn. Let Oq denote the first q columns of O and let V1,V2,…Vn ∈
<q denote the rows of the matrix Oq.

The projection of the i ’th feature of the vector X to the lower dimensional
space is represented by the vector Vi, particularly, the weights of the i′th
feature on each axis of the subspace correspond to the q elements of Vi . The
crucial point is that features that are highly correlated or have high mutual
information will have close absolute value weight vectors Vi. On the two
extreme sides, two independent variables have maximally separated weight
vectors; while two fully correlated variables have identical weight vectors.[8]
Finding the best subset consists of using the structure of Vi to
1 . Identify the subsets of features that are highly correlated and follow to
choose one feature from each subset.
2 . Choose one feature from each subset.

The algorithm can be summarized in the following five steps:
• Step 1: We compute the sample’s covariance or correlation matrix co-
variance matrix.
• Step 2: We Compute the Principal components and eigenvalues of the
covariance or correlation matrix.
• Step 3: The subspace dimension q is chosen, and we construct the
matrix Oq from O.The ratio between the sum of the first q eigenvalues
and the sum of all eigenvalues defines the retained data variability .
• Step 4: We perform clustering on the vectors |V1|,|V2|,…,|Vn| ∈ <q
to get p ≥q clusters using the K -Means algorithm. Choosing p greater
than q is usually necessary if the same variability as the PCA is desired.
• Step 5: We identify the corresponding vector Vi for each cluster, it is
defined as the closest vector to the mean of the cluster. We then choose
the corresponding feature, xi, as a principal feature. Henceforth the
choice of p features.
The complexity of the algorithm is of the order of performing PCA, because
the K-Means algorithm is applied on just nq -dimensional vectors.
In our case we applied PFA in a loop fashion with each model, but to do so
we need to first define a metric to measure performance, our choice was the
Silhouette score metric.
The Silhouette Coefficient is calculated using the mean intra-cluster distance
a and the mean nearest-cluster distance b for each sample.

The sillhouette Coefficient for a sample is (b−a) /max(a,b) . To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. -

-Feature and model selection algorithm:
In this paragraph we will showcase the algorithm we created to obtain the
best model-feature combination.
Let S : Rp ×Rp×q → R define the silhouette score function, where p is the
number of observations and q is the number of kept features.

Algorithm for PFA

Here is how all that translates into lines of code:

from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
X = np.array(X)
sc = StandardScaler()
# X = sc.fit_transform(X)
pca = PCA(n_components=self.q).fit(X) # calculation Cov matrix is embeded in PCA
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
#Run exhaustive search to find the best features to keep:def get_key(val,y):
for key, value in Counter(y).items():
if val == value:
return key

return "key doesn't exist"
p = 1
#n will be the number of features you have
n = 50
for k in range(2,n,1):
print('########## K = ',k,' #######')
pfa = PFA(n_features=k)
pfa.fit(df)
# To get the transformed matrix
x = pfa.features_
rng = np.random.RandomState(42)
clf = IsolationForest(n_estimators=1000, random_state=rng)
clf.fit(x)
IF_labels = clf.predict(x)
m = get_key(max(Counter(IF_labels).values()),IF_labels)
x = pd.DataFrame(x)
x['outliers'] = IF_labels
x = x[x['outliers'] == m ]
x = x.drop(columns = 'outliers')
x = np.array(x)
pca.fit(x)
x = pca.transform(x)
model_agg_clustering = AgglomerativeClustering(n_clusters=3)
# fit model and predict clusters
yhat_agg_clustering = model_agg_clustering.fit_predict(x)
print('agglomerative')
model_birch = Birch(threshold=0.01, n_clusters=3)
# fit the model
model_birch.fit(x)
# assign a cluster to each example
yhat_birch = model_birch.predict(x)
model_mb_km = MiniBatchKMeans(n_clusters=3)
# fit the model
model_mb_km.fit(x)
# assign a cluster to each example
print('Mini Batch Kmeans')
yhat_mb_km = model_mb_km.predict(x)
label_list = [yhat_agg_clustering, yhat_birch, yhat_mb_km]
models = ['AgglomerativeClustering', 'BIRCH', 'MiniBatch_KMeans']
for idx,labels in enumerate(label_list):
silhouette_avg = silhouette_score(x, labels)
print("The average silhouette_score for", models[idx], " is :", silhouette_avg)

As you can see, we performed an exhaustive search to determine the best model-features combination in terms of silhouette score.

I hope you found this short article useful for your projects, and I take no credit for the algorithms mentioned above, as they were developed thanks to the hard work of the scientists who invented them.

--

--

No responses yet