Abstract
Data depth is a statistical method which models data distribution in terms of centeroutward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces HalfSpace Mass which is a significantly improved version of halfspace data depth. HalfSpace Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of HalfSpace Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of HalfSpace Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than halfspace depth; and it runs orders of magnitude faster than \(L_2\) depth, an existing maximally robust location estimator. In clustering, the HalfSpace Mass version of Kmeans overcomes three weaknesses of Kmeans.
Introduction
“Most important for the selection of a depth statistic in applications are the questions of computability and  depending on the data situation  robustness.”  Karl Mosler (2013)
Data depth (Liu et al. 1999) is a statistical method which models data distribution in terms of centeroutward ranking rather than density or linear ranking. In 1975, Tukey (1975) proposed a way to define multivariate median in a data cloud, known as halfspace depth or Tukey depth. Since then it has been extensively studied. Donoho and Gasko (1992) have revealed the breakdown point of Tukey median; Zuo and Serfling (2000) have compared it to various competitors and Dutta et al. (2011) have investigated the properties of halfspace depth. Meanwhile, the concept of data depth has been adopted for multivariate statistical analysis since it provides a nonparametric approach that does not rely on the assumption of normality (Liu et al. 1999).
Despite its popularity, the following characteristics of halfspace depth have hampered its applications. As demonstrated by a simple example in Fig. 1, the “deepest point”, or halfspace median, is not guaranteed to be unique. A set of discrete data points has a layered depth distribution, which is not concave. Moreover, halfspace depth is not a maximally robust depth method, i.e., its distribution is easily distrubed by outliers. While a maximally robust method exists, e.g., \(L_2\) depth (Mosler 2013), it is computationally expensive. No current data depth method is both computationally efficient and robust, as far as we know.
We introduce halfspace mass, a significantly improved version of halfspace depth, which is both efficient and maximally robust. We reveal four theoretical properties of halfspace mass:

(i)
It is concave in a user defined region that covers the source density distribution or the data cloud. An example is shown in Fig. 1.

(ii)
It has a unique maximum point, which can be regarded as a multidimensional median.

(iii)
Its median, which has a breakdown point equal to \(\frac{1}{2}\), is maximally robust.

(iv)
It extends the information carried in a dataset to a higher dimensional space in which such dataset has a zerovolume convex hull.
The key contributions of this paper are the formal definition of halfspace mass and the uncovering of its theoretical properties backed up with their proofs. To demonstrate its applicability to real life problems, halfspace mass is applied to two tasks: anomaly detection and clustering. We provide a comparison with two existing data depth methods: halfspace depth (Tukey 1975) and \(L_2\) depth (Mosler 2013). Based on halfspace mass, we create a clustering algorithm reminiscent of the Kmeans algorithm (Jain 2010).
Our empirical evaluations show that halfspace mass has the following advantages compared to its contenders:

Its maximal robustness leads directly to better performance in anomaly detection than halfspace depth.

Compared to the existing maximally robust \(L_2\) depth, it runs orders of magnitude faster.

Compared to the distancebased Kmeans clustering method, the halfspace massbased version overcomes three weaknesses of Kmeans (Tan et al. 2014) to find clusters of varying densities and sizes, as well as in the presence of noise.
The rest of the paper is organized as follows. Section 2 introduces the formal definitions of halfspace mass as well as the proposed implementation. Sections 3 and 4 provide its theoretical properties and proofs, respectively. Section 5 discusses the relationship between halfspace mass and other data depth methods. Section 6 describes applications of halfspace mass in anomaly detection and clustering. Section 7 reports the empirical evaluations. Section 8 discusses its relation to mass estimation and Sect. 9 concludes the paper.
Halfspace mass
Definitions
The proposed halfspace mass is formally defined in this section. The key notations are provided in Table 1.
Let \(F(\mathbf{x})\) be a probability density on \(\mathbf{x} \in \mathfrak {R}^d\), \(d \ge 1\); \(R \subset \mathfrak {R}^d\) be a convex and closed region covering the domain of F; and H be a closed halfspace formed by separating \(\mathfrak {R}^d\) with a hyperplane that intersects R. Note that the probability mass of H computed with respect to F is \(0 \le P_F(H)=P_F(H \cap R) \le 1\).
Definition 1
Halfspace mass (HM) of a point \(\mathbf{x} \in \mathfrak {R}^d\) with respect to F is defined as:
where \({{\mathcal {H}}}(\mathbf{x}) := \{H:\mathbf{x} \in H\}\) is a set of all closed halfspaces H which contains the query point \(\mathbf{x}\) and \({{\mathbb {H}}(\mathbf{x})} \subset {{\mathcal {H}}(\mathbf{x})}\).
The definition of halfspace mass can be conceptualized as the expectation of the probability mass of a randomly selected halfspace H, which is defined for R and contains the query point \(\mathbf{x}\), given that every halfspace is equally likely. This definition happens to have certain similarity to that of halfspace depth (Tukey 1975). While halfspace depth takes the minimum of probability mass of a random halfspace containing query point \(\mathbf{x}\) as the depth value (see its definition in Table 2 in Sect. 5), halfspace mass takes the expectation of it. This key difference enables halfspace mass to have more desirable properties, which will be discussed in Sects. 3 and 4.
Practically an i.i.d. sample D is usually given instead of the source density distribution F. The sample version of \(HM(\mathbf{x}F)\) is obtained by replacing F with D as follows.
Definition 2
Halfspace mass (HM) of a point \(\mathbf{x} \in \mathfrak {R}^d\) with respect to a given dataset D is defined as:
where \(P_D(H)\) is the empirical probability measure of H with respect to D, i.e., the proportion of data points in D that lie in H. Note that \(0 \le P_D(H) \le 1\).
\(HM(\mathbf{x}D)\) can be estimated by sampling t halfspaces from \({\mathcal {H}}(\mathbf{x})\) for each query point \(\mathbf{x}\). By selecting \({\mathbb {H}}(\mathbf{x}) \subset {\mathcal {H}}(\mathbf{x})\) with size \({\mathbb {H}}(\mathbf{x}) = t\), this estimator is defined as:
where \(H_i\) are elements of \({\mathbb {H}}(\mathbf{x})\).
We also propose a computationfriendly version to estimate \(HM(\mathbf{x}D)\). Instead of using the whole dataset D to calculate \(P_D(H_i)\) in (1), a small subsample \({{\mathcal {D}}}_i \subset D\) with size \({{\mathcal {D}}}_i = \psi \ll D\) is randomly selected from D without replacement for \(i = 1,\ldots ,t\). Let \(R_i\) be a convex region covering \({{\mathcal {D}}}_i\), \(H_i(\mathbf{x})\) be a randomly selected halfspace containing \(\mathbf{x}\) and intersecting \(R_i\), for \(i = 1,\ldots ,t\).
Definition 3
A computationfriendly estimator for \(HM(\mathbf{x}D)\) is defined as:
where \(I(\cdot )\) is an indicator function and \(\mathbf{X}_j\) is a point in \({{\mathcal {D}}}_i\).
Implementation
In general, halfspace mass is a concave function in R, as will be shown in Sects. 3 and 4; therefore it provides distinct centeroutward ordering in the region R, while concavity outside of R is not guaranteed.
When concavity needs to be guaranteed in a region larger than the convex hull of D, a larger R would be desirable. To this end, we propose a projectionbased algorithm to estimate \(HM(\mathbf{x}D)\) in which the region R or \(R_i\) is determined by a size parameter \(\lambda \). It is the ratio of diameters between R and the convex hull of D along every direction. The value of \(\lambda \) should be more than or equal to 1. When \(\lambda = 1\), R or \(R_i\) is the convex hull of D or \({{\mathcal {D}}}_i\). The bigger \(\lambda \) is, the larger R or \(R_i\) expands from the convex hull of D or \({{\mathcal {D}}}_i\).
Algorithm 1 is the training procedure of \(\widetilde{HM}(\cdot D)\). The halfspace is implemented as follows: a random subsample \(\mathcal{D}_i\) is projected onto a random direction \(\ell \) in \(\mathfrak {R}^d\), t times. For each projection, a split point s is randomly selected between a range adjusted by \(\lambda \); and then the number of points that fall in either sides of s are recorded.
Algorithm 2 is the testing procedure when \(\widetilde{HM}(\mathbf{x})\) is ready. Given a query point \(\mathbf{x}\), it is projected onto each of the t directions, and the number of training points that fall on the same side as \(\mathbf{x}\) are averaged and output as estimated value of the halfspace mass for \(\mathbf{x}\).
Parameter setting
Here we provide a general guide for setting the parameters. The parameter t affects the accuracy of the estimation. The larger t is, the more accurate the estimation is. In high dimensional datasets or datasets which are elongated significantly in some direction than others, t shall be set to a large value, in order to gather sufficient information from all directions.
When the computationfriendly version \(\widetilde{HM}(\mathbf{x}D)\) is used, it is worth pointing out that \(R_i\) could be significantly smaller than R, especially when subsample size \(\psi \ll D\). Thus a small \(\psi \) would produce a more concentrated distribution than that produced with a large \(\psi \), as shown in Fig. 2. This is the case where \(\lambda > 1\) could be used for some applications. Another effect of a small \(\psi \) value when \(\lambda = 1\) is that, it limits the range of \(\widetilde{HM}(\mathbf{x}D)\) values. Note that by Definition 3 when \(\lambda = 1\), \(\frac{1}{\psi } \le P_{{{\mathcal {D}}}_i}(H_i(\mathbf{x})) \le \frac{\psi  1}{\psi }\), thus \(\frac{1}{\psi } \le \widetilde{HM}(\mathbf{x}D) \le \frac{\psi  1}{\psi }\).
For the rest of this paper, we use Algorithms 1 and 2 to estimate halfspace mass. The parameter \(\lambda \) is set to 1 by default unless mentioned otherwise.
Properties of halfspace mass
We list four theoretical properties of halfspace mass in this section, which are concavity in region R, unique median, the median having breakdown point equal to \(\frac{1}{2}\), and extension across dimension. Proofs of the lemma and theorems stated in this section can be found in Sect. 4.
Concavity
Lemma 1
HM(xF) under Definition 1 is a concave function for any finite F in any finite R in a univariate real space \(\mathfrak {R}\).
Using this lemma, we can obtain the following theorem on the concavity of the multidimensional halfspace mass distribution.
Theorem 1
\(HM(\mathbf{x}F)\) under Definition 1 is a concave function for any finite F in any finite, convex and closed \(R \subset \mathfrak {R}^d\).
Similarly, \(HM(\mathbf{x}D)\) is also concave in the convex region R covering D.
Unique median
Based on Theorem 1, a unique location in R which has the maximum halfspace mass value is guaranteed, as stated in the following theorem:
Theorem 2
The “center” of a given density F based on halfspace mass \(\mathbf{x}^* := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x}F)\) is a unique location in R, given that F covers an area more than a straight line in \(\mathfrak {R}^d\).
Breakdown point
For a given dataset D of size n and a location estimator T, the breakdown point \(\epsilon (T, D)\) is defined in the following way as in Donoho and Gasko (1992), which is the minimum proportion of strategically chosen contaminating points required to render the estimated location arbitrarily far away from the original estimation:
where \(Q^{(m)}\) is a set of contaminating data points of size m.
We define a location estimator based on halfspace mass as follows: \(T(D) := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x}  D)\). It is a maximally robust estimator with properties given in the following theorem:
Theorem 3
The breakdown point of T, \(\epsilon (T, D) > \frac{n1}{2n1} \rightarrow \frac{1}{2}\) as \(n \rightarrow \infty \).
Extension across dimension
Dutta et al. (2011) reveal that, for a size n dataset in a \(d > n\) dimensional space, since the ddimensional volume of the convex hull of such dataset is going to be zero, halfspace depth will behave anomalously having 0 measures almost everywhere in \(\mathfrak {R}^d\). In such cases, halfspace depth does not carry any useful statistical information.
On the other hand, the definition of halfspace mass enables it not only to rank locations outside the convex hull of the training dataset in the lower dimensional space where this convex hull has positive volume, but also to extend the ranking of locations to a higher dimensional space where the convex hull has zero volume.
As demonstrated in Fig. 3, the training data points are located on a straight line, thus the volume of the convex hull of them in \(\mathfrak {R}^2\) is zero. This renders halfspace depth to have zero measures almost everywhere unless the query point lies in the line segment. On the other hand, halfspace mass is able to rank almost every location in \(\mathfrak {R}^2\) based on their closeness to the center of the dataset. This ability of halfspace mass to extend information carried in a dataset to a higher dimensional space could be very useful to high dimensional problems, especially when the sample size is limited.
Proofs
This section provides the proofs for the lemma and theorems given in the last section. The proofs for Lemma 1, Theorems 1, 2 and 3 are presented in the following four subsections.
Proof of Lemma 1
Given \(R=[r_l, r_u]\), \({{\mathcal {H}}}(x)\) is a set of all halfspaces containing x formed by splitting \(\mathfrak {R}\) at any point \(s \in R\). Then, HM(xF) is represented as follows.
where \(\varDelta s = (r_ur_l)/{{\mathbb {H}}(x)}\); m and \(m_x \) are \({{\mathbb {H}}(x)}\) and the number of \(H \in {{\mathbb {H}}(x)} \) whose splitting point s is \({<}x\), respectively. Since HM(xF) is a double integrated function of the finite F(x), it is twice differentiable.
where \(C_R = \int _{r_l}^s F(y)dy + \int _s^{r_u} F(y)dy\). Since the double differential of HM(xF) is nonpositive, HM(xF) is concave.
Proof of Theorem 1
Let \({{\mathcal {H}}}_\ell (\mathbf{x}) \subset {\mathcal {H}}(\mathbf{x})\) be a set of all halfspaces in \({\mathcal {H}}(\mathbf{x})\) whose splitting hyperplanes are perpendicular to direction \(\ell \) in \(\mathfrak {R}^d\). Let \({{\mathcal {L}}}\) be a set of all directions \(\ell \in \mathfrak {R}^d\). Define
where \({\mathbb {H}}_\ell (\mathbf{x})\) is a subset of \({{\mathcal {H}}}_\ell (\mathbf{x})\). From Definition 1, \(HM(\mathbf{x}F)\) can be decomposed as
where \(P_\ell := P(H \in {{\mathbb {H}}}(\mathbf{x}) \text{ s.t. } H \in {{\mathbb {H}}}_\ell (\mathbf{x}))\) is the probability of a random halfspace H from \({{\mathbb {H}}}(\mathbf{x})\) belonging to the set \({{\mathbb {H}}}_\ell (\mathbf{x})\) and \(\mathbb L \subset {\mathcal {L}}\) is the set of all directions \(\ell \) corresponding to \({{\mathbb {H}}}(\mathbf{x})\).
\(HM(\mathbf{x}F,\ell )\) is equivalent to the univariate mass distribution on \(\ell \) where F is projected onto \(\ell \). Accordingly, from Lemma 1, for all \(\mathbf{x} \in R\), it is concave in the direction of \(\ell \) and constant in the direction vertical to \(\ell \). Thus, \(HM(\mathbf{x}F,\ell )\) is concave in R. Since the summation of multiple concave functions are also concave, \(HM(\mathbf{x}F)\) is concave in R.
Proof of Theorem 2
Here we prove Theorem 2 by contradiction.
Suppose there exists more than one location in R that has the maximum halfspace mass value, say \(\mathbf{x}_1\) and \(\mathbf{x}_2\). Let \(\mathbf{x}^{\ell }\) denote the projection of \(\mathbf{x}\) on a line along direction \(\ell \) in \(\mathfrak {R}^d\), \(F^{\ell }\) denote the projection of density F on \(\ell \). Let \(L = \{\mathbf{x}_1 + c (\mathbf{x}_2  \mathbf{x}_1)  c \in (0,1) \}\) denote the segment that connects \(\mathbf{x}_1\) and \(\mathbf{x}_2\), and \(L^{\ell } = \{\mathbf{x}^{\ell }_1 + c (\mathbf{x}^{\ell }_2  \mathbf{x}^{\ell }_1)  c \in (0,1) \}\) denote the projection of L. The concavity and the upper bound by the maximum value lead to the following:
The onedimensional halfspace mass of F projected on \(\ell \) is also concave in the projection of R, thus
Since \(HM(\mathbf{x}F) = E_{{\mathcal {L}}} [HM( \mathbf{x}^{\ell }F^{\ell })], \forall \mathbf{x}\), combining (4) and (5) we have
Equation (6) shows that \(HM(\mathbf{x}^{\ell }F^{\ell })\) is linear for all \(\mathbf{x}^{\ell } \in L^{\ell }\); thus whenever \(HM(\mathbf{x}^{\ell }  F^{\ell })\) is twice differentiable, by (3) we have
where \(r_u  r_l\) is the length of the projection of R on \(\ell \).
But since F covers an area more than a straight line, there will always exist an \(\ell \) and \(\mathbf{x}\) such that \(\mathbf{x}^{\ell } \in L^{\ell }\) and \(F^{\ell }(\mathbf{x}^{\ell })>0\), which will contradict with (7). Therefore, there is one unique location that has the maximum halfspace mass value in R.
Proof of Theorem 3
Suppose for a size n dataset D, a contaminating set Q of size \(n1\) is strategically chosen. Let U denote the convex hull of D, and \(U^{\ell }\) denote its projection segment on a line along direction \(\ell \), assuming U has a finite volume in \(\mathfrak {R}^d\).
For any \(\ell \), the median point of the projection of \( D \cup Q \) on \(\ell \) will lie within \(U^{\ell }\). Because if it lies outside of \(U^{\ell }\), then at least n out of \(2n1\) points are on one side of the median which contradicts the definition of median. Since Ting et al. (2013) have shown that the univariate mass is maximised at its median, the maximum value of \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\) occurs in the segment \(U^{\ell }\) for all \(\ell \).
For a given query point \(\mathbf{x}\), let \({{\mathcal {L}}}_{\mathbf{x}}^ = \{\ell : \mathbf{x}^{\ell } \notin U^{\ell } \}\) denote the set of directions in \(\mathfrak {R}^d\) on which the projection of \(\mathbf{x}\) lies outside of the projection of the convex hull of D, and \({{\mathcal {L}}}_{\mathbf{x}}^+ = \{\ell : \mathbf{x}^{\ell } \in U^{\ell } \}\) denote the rest of the directions.
For any \(\ell \in {{\mathcal {L}}}^_{\mathbf{x}}\), the onedimensional mass \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\) increases while \(\mathbf{x}^{\ell }\) moves a small enough distance towards \(U^{\ell }\), since it is a concave function with the maximum value occurs somewhere in the segment \(U^{\ell }\).
Let \({{\mathcal {H}}}_{{{\mathcal {L}}}_{\mathbf{x}}^}(\mathbf{x}) \subset {\mathcal {H}}(\mathbf{x})\) be a set of all halfspaces in \({\mathcal {H}}(\mathbf{x})\) whose splitting hyperplanes are perpendicular to directions \(\ell \in {{\mathcal {L}}}_{\mathbf{x}}^\) in \(\mathfrak {R}^d\), and \({{\mathcal {H}}}_{{{\mathcal {L}}}_{\mathbf{x}}^+}(\mathbf{x})\) be defined in the same way. By Definition 1, \(HM(\mathbf{x} D \cup Q)\) can be decomposed into the sum of two parts as follows:
where \(P_{{{\mathcal {L}}}^_{\mathbf{x}}} := P(H \in {{\mathcal {H}}}(\mathbf{x}) \text{ s.t. } H \in {\mathcal {H}}_{{{\mathcal {L}}}^_{\mathbf{x}}}(\mathbf{x}))\) is the probability of a random halfspace H from \({\mathcal {H}}(\mathbf{x})\) belonging to \( {\mathcal {H}}_{{{\mathcal {L}}}^_{\mathbf{x}}}(\mathbf{x})\); and \(P_{{{\mathcal {L}}}^+_{\mathbf{x}}}\) is defined similarly.
Note that as the distance between \(\mathbf{x}\) and U goes to infinity, for a random direction \(\ell \) in \(\mathfrak {R}^d\), \(P(\ell \in {{\mathcal {L}}}^_{\mathbf{x}}) \rightarrow 1\) and \(P(\ell \in {{\mathcal {L}}}^+_{\mathbf{x}}) \rightarrow 0\), hence \(P_{{{\mathcal {L}}}^_{\mathbf{x}}} \rightarrow 1\) and \(P_{{{\mathcal {L}}}^+_{\mathbf{x}}} \rightarrow 0\), A demonstration is shown in Fig. 4.
The location estimator T(D) is within U, the convex hull of D. If the distance between \(T(D \cup Q)\) and T(D) is infinity, then the distance between \(T(D \cup Q)\) and U is also infinity. Thus suppose \(\mathbf{x}^* = T(D \cup Q)\) is infinitely far away from U, then the solid angle of U over \(\mathbf{x}^*\) is 0, therefore almost surely \(\ell \in {{\mathcal {L}}}^_{\mathbf{x}^*}, \forall \ell \in \mathfrak {R}^d\) and \(HM(\mathbf{x}^* D \cup Q) = E_{ {{\mathcal {L}}}^_{\mathbf{x}^*}}[HM({\mathbf {x^*}}^{\ell } D^{\ell } \cup Q^{\ell })]\). Any movement of finite length from \(\mathbf{x}^*\) towards U will increase the onedimensional mass values \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\), \(\forall \ell \in {{\mathcal {L}}}^_{\mathbf{x}} \); thus increase the mass value \(HM(\mathbf{x} D \cup Q)\), which contradicts with the assumption that \(HM(\mathbf{x}^* D \cup Q)\) is the maximum. Therefore \(T(D \cup Q)\) can only be finitely far away from T(D) for a contaminating dataset Q of size \(n1\).
Using the same inference as above, any contaminating dataset Q of any size between 1 to \(n1\) combining dataset D of size n can only cause a finite shift of the location estimator T. Therefore \(\epsilon (T, D) > \frac{n1}{2n1}\).
Relation to other data depth methods
Data depth models data distribution in terms of centeroutward ranking rather than density or linear ranking, and it is a means to define multivariate median. Two example data depth definitions and their associated median definitions are given in Tables 2 and 3, respectively. Halfspace depth and \(L_2\) depth are chosen because the former employs the same halfspaces as in halfspace mass; and the latter is another maximally robust method. The definition of halfspace mass is also provided for comparison.
It is interesting to note the similarity between halfspace mass and halfspace depth, i.e., they are both based on the probability mass of halfspaces. The main difference is between taking the expectation or minimum over probability mass of halfspaces. This has led to the improvement of breakdown point and uniqueness of median shown in Table 3.
\(L_2\) depth and halfspace mass have the same four properties: concavity, unique median which is maximally robust and their distribution extends across dimensions which have zerovolume convex hull. The key difference is the core mechanism: one employs halfspace and the other uses distance. The computation without distance calculations leads directly to the advantage of halfspace mass in time complexity, as shown in Table 3.
Implementation. We implement halfspace depth using a technique similar to that used for \(\widehat{HM}(\mathbf{x}D)\). In the same context given in Definition 2, an estimator of halfspace depth is defined as follows:
We generate t halfspaces, which cover \(\mathbf{x}\) and intersect the convex hull of the given dataset, to find the one which gives the minimum probability mass. The implementation is similar to those shown in Algorithms 1 and 2. The differences are: In training \(\widehat{HD}(\mathbf{x}D)\), \(\psi \) must equal to D and it is most efficient to set \(\lambda = 1\). In the testing phase, \(\widehat{HD}(\mathbf{x})\) finds the minimum probability mass of halfspaces, instead of averaging.
The implementation of \(L_2\) depth is straightforward: Given a query point \(\mathbf{x}\), compute the sum of Euclidean distances to all points in D. The output of \(L_2D(\mathbf{x}D)\) is computed as specified in Table 2.
Applications of halfspace mass
We demonstrate the applications of halfspace mass in two tasks: anomaly detection and clustering, in the following two subsections.
Anomaly detection
The application of halfspace mass to anomaly detection is straightforward since the distribution of halfspace mass is concave with centeroutward ranking. Once every point in the given dataset is given a score, they can be sorted; and those close to the outer fringe of the distribution, i.e., having low scores, are more likely to be anomalies.
The above property is the same for halfspace depth and \(L_2\) depth. Thus, all three methods can be directly applied to anomaly detection.
Clustering
We provide a simple algorithm utilizing halfspace mass in clustering. This algorithm is designed in a fashion that is similar to the Kmeans clustering algorithm.
Let \(\mathbf{X}_i \in D, i = 1,...,n\) denote data points in dataset D and \(Y_i \in \{1,...,K\}\) denote the cluster labels, where K is the number of clusters. Let \(G_k := \{ \mathbf{X}_i \in D : Y_i = k \}\), where \(k \in \{1,...,K\}\), denote the points in the kth group.
The Kmass clustering procedure is given in Algorithm 3. The procedure begins with an initialization that randomly splits the dataset into K equalsize groups. Each iteration consists of two steps. First, data in each group is used to generate a mass distribution \(\widetilde{HM}\). Second, each point \(\mathbf{X}_i\) in the data set is then regrouped based on the mass distributions as follows: \(\widetilde{HM}\) for each group produces a mass value for \(\mathbf{X}_i\); and it is assigned to the group which gives the maximum mass value. We normalise the mass values by the global minimum mass value to give small size groups a better chance to survive the process. The above two steps are iterated until the group labels stay unchanged, between two subsequent iterations, for at least p proportion of the points in the dataset.
Kmeans clustering algorithm (Jain 2010) is provided in Algorithm 4 for comparison. The Kmass algorithm and the Kmeans algorithm share the same algorithmic structure. They differ only in the action required in each of the two steps in the iteration process.
Note that when considering Kmeans as an EM (ExpectationMaximisation) algorithm (Kroese and Chan 2014), Kmeans implements the expectation step in line 3 and the minimisation step in lines 4–6 in Algorithm 4. Similarly, Kmass implements the expectation step in line 3 and the maximisation step in lines 4–6 in Algorithm 3.
Empirical evaluations
In this section, we conduct experiments to investigate the advantages of utilizing halfspace mass in anomaly detection and clustering, first with artificial data sets and second with real datasets. In both cases, robustness is the key determinant for halfspace mass to gain advantage over its contenders.
To simplify notations, we use HM and \(HM^*\) hereafter to denote the sample version (\(\psi = D\)) and the computationalfriendly version (\(\psi \ll D\)) of halfspace mass, respectively. And HD and \(L_2D\) denote halfspace depth and \(L_2\) depth, respectively.
Anomaly detection
In this section, halfspace mass, halfspace depth and \(L_2\) depth are used for anomaly detection. That is, given a dataset, HM is constructed as described in Algorithms 1 and 2; HD and \(L_2D\) are constructed as described in Sect. 5. Then, each of the models is used to score each point in the dataset. In all cases, points with low mass/depth scores are more likely to be anomalies. The final ranking of the points is sorted based on the scores produced from each model.
Area under the ROC curve (AUC) is used to measure the detection accuracy of an anomaly detector. \(AUC=1\) indicates that the anomaly detector ranks all anomalies in front of normal points; \(AUC=0.5\) indicates that the anomaly detector is a random ranker. Visualizations are used to show the impact of robustness. When comparing AUC values in the second experiment, a ttest with \(5\,\%\) significance level is conducted based on AUC values of multiple runs.
The t parameter for both HM and HD is set to 5000 in the experiments, which is sufficiently large since further increase of t observes no noticeable AUC improvement. \(L_2\) depth has no parameter setting.
Anomaly detection with artificial data
Here we show the importance of robustness of an anomaly detector in identifying anomalies. An artificial data set with two clusters of data points is generated for the experiment. As shown in Fig. 5, the dataset consists of a cluster of sparse normal points along with a few local anomalies on the left and a dense cluster of anomalies on the right. Centeroutward ranking scores are calculated using HM, HD and \(L_2D\).
The AUC results, presented in the first row in Fig. 5, show that both HM and \(L_2D\) performed much better than HD. In this example, all of the three methods failed to detect some local anomalies but HD failed to detect the anomaly cluster on the right while the other two methods separated the anomaly cluster from the normal points perfectly.
The second row of the plots in Fig. 5 shows the contour maps of mass/depth values when normal points contaminated with noise were used to train the anomaly detectors; and the third row of the plots shows the contour maps when normal data points only were used to train the anomaly detectors.
The contrast between the second row and the third row of the plots is a testament to the impact of robustness. Being maximally robust, the contour maps of HM and \(L_2D\) remain centered inside the normal cluster. In contrast, the contour map of HD is significantly stretched towards the anomaly cluster. This resulted many clustered anomalies (on the right) being scored with high depth values as equivalent to many normal points; and thus impaired its ability to detect anomalies. Anomalies are contamination to the distribution of normal points. An anomaly detector, which is not robust to contamination, often results in poor ranking outcomes in relation to detecting anomalies. This example shows the impact of contamination has to an anomaly detector which is not robust.
Anomaly detection with benchmark datasets
Here we evaluated the performance of HM, \(HM^*\), HD and \(L_2D\) in anomaly detection using nine benchmark datasets (Lichman 2013). AUC values and runtime results are shown in Table 4. The figures are the average of 10 runs except for \(L_2D\) which is a deterministic method. Boldface figures in the HM, \(HM^*\) and \(L_2\) columns indicate that the differences are significant compared to HD; while boldface figures in the HD column indicate that the differences are significant compared to any of the other methods.
In comparison with HD, both HM and \(HM^*\) have 7 wins and 2 losses, which is evidence that halfspace mass performed better than HD in most datasets.
Note that HM and \(L_2D\) have similar AUC results. This is not surprising since both have the same four properties shown in Table 3.
\(HM^*\) using \(\psi =10\) performed comparably with HM in seven out of the nine data sets. This suggests that the performance of \(HM^*\) can be further improved by tuning \(\psi \).
The major disadvantage of \(L_2D\) is its computational cost. \(L_2D\) ran orders of magnitude slower than the other methods in all data sets, except in the smallest data set with 64 points only. This is because not only \(L_2D\) has a time complexity \(O(n^2)\), it also involves distance measures. The freedom from distance measure is an important feature of halfspace mass, which makes it much more efficient.
Note that HD performed poorly in all three high dimensional datasets. Our investigation suggests that as the number of dimensions increases, an increasing percentage of points will appear at the outer fringe of the convex hull covering the data set. Because HD assigns the same lowest depth value to all these points, they are thus unable to be meaningfully ranked. This is the reason why the AUC results of HD in these three datasets are close to 0.5, equivalent to random ranking. In a nutshell, HD is more prone to the curse of dimensionality than HM or \(L_2D\).
HD outperformed three other methods in the smtp and covertype datasets. A visualization of the smtp dataset revealed that all anomalous points are located at one corner of the data space close to one normal cluster, as shown in Fig. 6. Being at the corner, HD assigned these anomalies with the same lowest score as all points at the outer fringe, while HM or \(L_2\) would assign them higher scores since they are closer to the center than other fringe points. Had the points located inbetween two clusters but had the same distance from the same cluster, HD would have regarded them as normal points. In other words, HD is able to better detect them in this dataset simply because of the special positions the anomalies are placed.^{Footnote 1}
The runtime shown in Table 4 is the sum of training time and testing time. Because the efficiency of the computationfriendly version affects the training process only, Table 5 is provided to show the training and testing time of HM and \(HM^*\) separately. With a small subsample size \(\psi = 10\), \(HM^*\) runs at least two orders of magnitude faster than HM in the training phase in large datasets. Note that in Table 5, the testing time of \(HM^*\) is noticeably longer than HM for most datasets, while they are theoretically expected to be equal since the amount of computation are exactly the same. Our investigation reveals that this is due to a computational issue of Matlab.^{Footnote 2}
In summary, halfspace mass is the best anomaly detectors among the three methods, which has significantly better detection accuracy than HD and runs orders of magnitude faster than \(L_2D\).
Clustering
This section reports the empirical evaluation of Kmass in comparison with Kmeans. The first experiment examines the three scenarios in which Kmeans is known to have difficulty to find all clusters, i.e., clusters with different sizes, densities and the presence of noise. The second experiment evaluates the clustering performance using eight real data sets (Lichman 2013, Franti et al. 2006).^{Footnote 3}
In every trial using a data set, Kmass or Kmeans is executed 40 runs and we report the best clustering result. The clustering performance is measured in terms of Fmeasure, and visualizations of the clustering results are presented where possible in twodimensional datasets.
Kmass employs \(HM^*\) which uses \(\psi =5\) and \(t=2000\) as default in all experiments; it uses \(\lambda = 3\) in the first experiment, and \(\lambda = 1.6\) in the second experiment. Recall that \(\lambda \) controls the size of the convex hull covering the data set. Because the sample size is \(\psi =5\), the convex hull must be enlarged (using \(\lambda > 1\)) in order to cover points which exist outside the convex hull. For the stopping criterion p, both Kmass and Kmeans use \(p=1\) in the first experiment and search for the best result with \(p=0.98\) and 1 in the second experiment.
Clustering with artificial data
Figures 7, 8 and 9 show the clustering results of Kmass and Kmeans on three artificial datasets, representing scenarios having clusters with different sizes, densities and the presence of noise, respectively.
In scenario 1, as shown in Fig. 7, the dataset consists of two sparse clusters and two significantly denser clusters. Kmass easily converged to the global optimal result. But Kmeans converged to a local optimal result which wrongly assigned some points. While it is possible that Kmeans can converge to the global optimal result if an ideal initialization is generated, this is unlikely because the sparse and dense clusters have different data sizes.
In scenario 2, the four clusters are of equal density but with different data sizes, as shown in Fig. 8. Kmass worked well separating the four clusters; but Kmeans failed to converge to the global optimum because of its tendency to split halfway between group centers.
Scenario 3 demonstrates the importance of robustness in clustering. The dataset consists of four clusters of equal sizes and density with the presence of noise, scattered around the four clusters. Figure 9 shows that Kmass, in spite of having a Fmeasure \({<}1\) because the noise points were assigned to the nearest clusters, was able to separate the four clusters perfectly; while Kmeans wrongly assigned many points of the four clusters. This is because Kmeans is not robust against outliers, therefore the group centers could be easily influenced by noise.
In summary, Kmass perfectly separated the four clusters while Kmeans failed to do so in all three scenarios.
Clustering with real datasets
Table 6 lists the data characteristics as well as the best results of Kmass and Kmeans in terms of Fmeasure. Kmass outperforms Kmeans with 6 wins, 1 draw and 1 loss. Kmass runs slower than Kmeans because it must train K models at each iteration; and Kmass is expected to need more iterations than Kmeans in general.
Discussion
Mass estimation (Ting et al. 2013) was recently proposed as an alternative to density estimation in data modeling. It has significant advantages over density estimation in efficiency and/or efficacy in various data mining tasks such as anomaly detection, clustering, classification and information retrieval (Ting et al. 2013). Despite this success, the formal definition of mass is univariate only and its theoretical analysis is limited to two properties: (i) its mass distribution is concave, and (ii) its maximum mass point is equivalent to median (Ting et al. 2013).
The halfspace mass can be viewed as a generalisation of the univariate mass estimation to multidimensional spaces, and it has four properties rather than the two revealed previously. The onedimensional mass estimation is defined as the weighted probability mass (see the details in the Appendix). Halfspace splits reduce to binary splits, and the halfspace mass reduces to the weighted probability mass in one dimensional space defined in Ting et al. (2013).
The two additional properties of halfspace mass, i.e., maximal robustness and extension across dimension, are important in understanding the behaviour of any algorithms designed based on halfspace mass, as we have shown in the empirical evaluation section.
The proof for concavity in Lemma 1 made use of the same idea for the concavity proof as presented by Ting et al. (2013). Other ideas in this paper are new.
Ting et al. (2013) also gave a definition of higher level mass estimation, which can be viewed as a localised version of a level1 mass estimation. We have limited our exposition to level1 mass estimation in this paper so that we have a direct comparison with data depth and its properties. As a result, it is limited to data modeling with a unimodal distribution having a unique maximum as the median. In datasets which have multimodal distribution, HM will be outperformed by existing densitybased anomaly detectors. We believe that HM can be extended to higher level mass estimation as shown in the onedimensional case (Ting et al. 2013), which could be regarded as a localized data depth method (Agostinelli and Romanazzi 2011). We will explore higher level mass estimation using halfspace mass in the near future.
The successful application of halfspace mass in Kmass implies that other data depth methods may also be applicable in Kmass. Our investigation reveals that because halfspace depth can only provide its estimations within the convex hull of a given data set (i.e., the lack of the fourth property stated in Sect. 3.4), it could not be applied to Kmass. A Kmass version using \(L_2\) depth exhibits a better convergence property than Kmass. However, its performance is in general worse than both Kmass and Kmeans.^{Footnote 4} Another drawback of \(L_2\) depth is that it is very costly to compute in large datasets.
Despite all the advantages of Kmass over Kmeans shown in this paper, a caveat is in order here: we do not have a proof that Kmass will always converge like Kmeans.
Conclusions
This paper makes three key contributions:
First, we propose the first formal definition of halfspace mass, which is a significantly improved version of halfspace data depth, and it is the only data depth method which is both robust and efficient, as far as we know.
Second, we reveal four theoretical properties of halfspace mass: (i) halfspace mass is concave in a convex region; (ii) it has a unique median; (iii) the median is maximally robust; and (iv) its estimation extends to higher dimensional space in which training data occupies zerovolume convex hull.
Third, we demonstrate applications of halfspace mass in two tasks: anomaly detection and clustering. In anomaly detection, it outperforms the popular halfspace depth because it is more robust and able to extend across dimensions; and it runs orders of magnitude faster than \(L_2\) data depth. In clustering, we introduce Kmass by using halfspace mass, instead of a distance function, in the expectation and maximisation steps in Kmeans. We show that Kmass overcomes three weaknesses of Kmeans. The maximally robust property of halfspace mass contributes directly to these outcomes in both tasks.
Notes
 1.
We suspect that the result in the covertype dataset is due to the similar reason. But we could not visualize it due to its dimensionality.
 2.
When comparing a fixed size vector to a scalar in Matlab, the runtime of such comparison is not constant. It varies significantly depending on the value of the scalar. The closer the scalar is to the median of the numbers in the vector, the longer it takes for the comparison. Because \(HM^*\) uses a small subsample for projection, the split points \(s_i\) in Algorithm 1 are selected within a narrower range than if the whole dataset was used. Thus \(s_i\) lies near the median of the whole dataset more often in \(HM^*\) than in HM. As a result, the comparisons take significantly longer time in \(HM^*\) than in HM in the testing stage. However, this effect is dampened in high dimensional datasets because the high dimensionality makes the range after projection much longer, even for a small subsample. This irregularity will not occur if another programming language is used.
 3.
 4.
The best Fmeasure out of 40 runs using \(L_2\) depth in clustering with the eight datasets are: 0.947(iris), 0.905(seeds), 0.626(column), 0.595(banknote), 0.939(breast), 1(dim), 0.896(wdbc), 0.943(wine).
References
Agostinelli, C., & Romanazzi, M. (2011). Local depth. Journal of Statistical Planning and Inference, 141(2), 817–830.
Aloupis, G. (2006). Geometric measures of data depth. DIMACS Series in Discrete Math and Theoretical Computer Science, 72, 147–158.
Donoho, D. L., & Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4), 1803–1827.
Dutta, S., Ghosh, A. K., & Chaudhuri, P. (2011). Some intriguing properties of Tukey’s halfspace depth. Bernoulli, 17(4), 1420–1434.
Franti, P., Virmajoki, O., & Hautamaki, V. (2006). Fast agglomerative clustering using a knearest neighbor graph. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11), 1875–1881.
Jain, A. K. (2010). Data clustering: 50 years beyond Kmeans, Pattern Recognition Letters, 31(8), 651–666.
Kroese, D. P., & Chan, J. C. C. (2014). Statistical modeling and computation. New York: Springer.
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Liu, R. Y., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth: descriptive statistics, graphics and inference. The Annals of Statistics, 27(3), 783–840.
Lopuhaa, H. P., & Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19(1), 229–248. doi:10.1214/aos/1176347978.
Mosler, K. (2013). Depth statistics. In C. Becker, R. Fried, & S. Kuhnt (Eds.), Robustness and complex data structures. Festschrift in honour of Ursula Gather (pp. 17–34). Berlin: Springer.
Tan, P.N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining (2nd ed.). Pearson Education, Ltd.
Ting, K. M., Zhou, G.T., Liu, F., & Tan, J. S.C. (2010). Mass estimation and its applications. In Proceedings of KDD’10: The 16th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 989–998)
Ting, K. M., Zhou, G.T., Liu, F., & Tan, J. S. C. (2013). Mass estimation. Machine Learning, 90(1), 127–160.
Tukey, J. W. (1975). Mathematics and picturing data. Proceedings of 1975 international congress of mathematics, Vol. 2, (pp. 523–531).
Zuo, Y., & Serfling, R. (2000). General notion of statistical depth function. Annals of Statistics, 28, 461–482.
Acknowledgments
This project is partially supported by a grant from the U.S. Air Force Research Laboratory, under agreement # FA23861314043, awarded to Kai Ming Ting. It is also partially supported by JSPS KAKENHI Grant Number 25240036, awarded to Takashi Washio. Bo Chen and Gholamreza Haffari are grateful to National ICT Australia (NICTA) for their generous funding, as part of the Machine Learning Collaborative Research Projects. Bo Chen is also supported by a scholarship from the Faculty of Information Technology, Monash University.
Author information
Affiliations
Corresponding author
Additional information
Editors: João Gama, Indrė Žliobaitė, Alípio M. Jorge, and Concha Bielza.
Appendix: Onedimensional mass
Appendix: Onedimensional mass
This appendix reiterated the onedimensional mass estimation, as presented by Ting et al. (2010), for ease of comparison with the halfspace mass introduced in this paper.
Let \(x_1 < x_2 < \cdots < x_{n1} < x_n\) on the real line, \(x_i \in \mathcal {R}\) and \(n>1\). Let \(s_i\) be the binary split between \(x_i\) and \(x_{i+1}\), yielding two nonempty regions having two masses \(m_i^L\) and \(m_i^R\).
Definition 4
Mass base function:
\(m_i(x)\) as a result of \(s_i\), is defined as
Note that \(m_i^L = nm_i^R = i\).
Definition 5
Mass distribution: \(mass(x_a)\) for a point \(x_a \in \{x_1, x_2, \cdots , x_{n1}, x_n\}\) is defined as a summation of a series of mass base functions \(m_i(x)\) weighted by \(p(s_i)\) over \(n1\) splits as follows, where \(p(s_i)\) is the probability of selecting \(s_i\).
Note that it is defined \(\sum _{i=q}^r f(i) = 0\), when \(r < q\) for any function f. \(p(s_i)\) can be estimated on the real line as \(p(s_i) = (x_{i+1}  x_i)/(x_n  x_1) > 0\), as a result of random selection of splits based on a uniform distribution.
Rights and permissions
About this article
Cite this article
Chen, B., Ting, K.M., Washio, T. et al. Halfspace mass: a maximally robust and efficient data depth method. Mach Learn 100, 677–699 (2015). https://doi.org/10.1007/s109940155524x
Received:
Accepted:
Published:
Issue Date:
Keywords
 Halfspace mass
 Mass estimation
 Data depth
 Robustness