IoT Analytics
Evaluate Techniques for Wifi Location
In this review, I work on a new project related to wifi fingerprinting. It uses to navigate the client in shopping malls. Its performance is similar to GPS and it uses in indoor places to determine the location of the user.
In this study, the performance of different ML models (Three different models) was reviewed and evaluated for mobile app recommendations.
The dataset I used for implementing the models is the one from UCI, UJIIndoorLoc [1].
Importing data
First of all, we need to import the data and review the dataset documentation.
train = pd.read_csv("../input/UjiIndoorLoc/TrainingData.csv")
test = pd.read_csv("../input/UjiIndoorLoc/ValidationData.csv")
There are two partitions for this dataset, train, and test. It has 520 features and 19,937 instances without any missing values. Also, I checked the data type of each feature, and all of them are numeric, and I could not find any categorical type.
Feature selection
The dataset is quite large, so to have a quick analysis I sampled the dataset. I took a subsample from the training batch. Furthermore, we can implement another method to drop the features with insignificant variance. For this matter, it is important to check the values of the variables, and I checked them by counting the unique values. Why is it important? because we are taking to account the variance and have to consider proper distribution for p-value.
We removed features with zero variance and decreased the attributes number.
Note that in the code, you can find Spearman’s Correlation method I applied on two variables sequentially, those with the highest positive and negative value have the strongest relationship.
Define the dependant and response variables
We need to define the input and independent variables based on the problem which we are looking for it. I am looking forward to predicting the client location, hence I need to choose a related variable of the variables to the location. Well, I am thinking to create a new feature as a location, based on longitude and latitude.
Create a single value from Latitude and Longitude
To refer to the location in one specific point, the Haversine formula [2] introduced, which returns the distance between two points.
sin(lat/2)**2 + cos(lat) * sin(lng/2)**2
So I took to account the Earth’s radius in kilometers and calculated the distance of the response variable as a distance of each client.
def haversine(lat, lng):
r = 6371
lat, lng = map(radians, [lat, lng])
a = sin(lat/2)**2 + cos(lat) * sin(lng/2)**2
return 2 * r * asin(sqrt(a))dist = []
for i in range(train.shape[0]):
dist.append(haversine(train.LONGITUDE[i], train.LATITUDE[i]))
dist = pd.Series(dist)
train['distance'] = dist.values
Implemented this method for both train and test and all samples[3].
Now that we have the response variable, it is time to drop the features to have the input and output variables with target labels.
Modeling
Regression
Here, we make a new feature that continues, and the best model would be a regression model. For this matter, I used Support Vector Regression to train it over this input and output and predict the customer’s location in the shopping mall.
Note that this study is with sample codes. To access the main code, please refer here.
Classification
For the classification, I considered the FLOOR where the customer is located.
For the classification, I used the following models and train them over k-fold cross-validated indexes.
- C5.0,
- SVC
- KNN
from sklearn.svm import SVC
SVC(gamma='auto')
Note that during the training, I normalized all the features by scaling them to unit variance.
pred_svc = np.zeros_like(y_test)score_svc = []kf = KFold(n_splits=10)
for train_index, test_index in kf.split(X_train, Y_train):
x_train, x_test = X_train[train_index], X_train[test_index]
y_train, y_test = Y_train[train_index], Y_train[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(x_train, y_train)
pred_svc[test_index] = clf.predict(x_test)
score_svc.append(clf.score(x_test, y_test))
Model evaluation
For the model evaluation, due to the binary classification, the best metrics would be, Confusion matrix, accuracy, and cohen_kappa[4], which I implemented. The best model with the highest performance could select to predict the unseen values over the test data set.
pd.DataFrame(np.column_stack((score_svc, score_knn, score_tree)), index=["accuracy"], columns=['SVC', 'KNN', 'C5.0'])
References
- 1- Torres-Sospedra, Joaquín, et al. “UJIIndoorLoc: A new multi-building and multi-floor database for WLAN fingerprint-based indoor localization problems.” 2014 international conference on indoor positioning and indoor navigation (IPIN). IEEE, 2014.
- 2- Haversine formula
- 3-Related code
- 4-Artstein, Ron, and Massimo Poesio. “Inter-coder agreement for computational linguistics.” Computational linguistics 34.4 (2008): 555–596.