Data processing and classification
Once the files were aligned, preliminary testing for classification was conducted on the dataset. For classification, a SVM model with a radial basis function kernel was used across all tests. We note that we also experimented with deep neural networks trained on the raw signal and obtained better results with the hand-crafted features. This might be due to limited training data, or that network architectures need to be more carefully tailored to this data modality. The dataset contains the raw signals to support further research on deep neural networks on this dataset.
First, different feature sets were tested for their classification accuracy. Common time domain EMG feature sets from studies by Phyniomark et al., Hudgins et al. and Du et al.15,16,17 were compared across all 19 participants. In addition, ten common EMG features (Root mean square (RMS), Logarithmic Varience (LOGVAR), Waveform Length (WL), Wilson Amplitude (WAMP), Standard Slope Change (SSC), Zero Crossings ZC, Autoregressive Coefficients 1-4 (AR1, AR2, AR3, AR4)) were identified from current literature and an exhaustive search of all feature combination was performed for a single participant. Twelve top feature sets from this sweep were then run across all 19 participants and the results were compared to the standard feature sets. For this comparison, the classification was performed using k-fold cross validation, with 4 folds each and 5 samples per letter. The result of the feature comparison is shown in Fig. 3. Feature comparison was performed with a 0.2s window length. The standard feature sets from the literature performed worse than the feature sets found through a feature sweep. The feature set with the best performance from this comparison was found to be (RMS, LOGVAR, WL, WAMP, ZC, AR1, AR2) with a classification accuracy of 87.4 ± 2.5 % averaged across the two test days for all 19 participants.

The effect of window size for each keypress segment was also explored. Window sizes of 0.1 to 0.8 s in increments of 0.1 s were tested for classification accuracy and standard deviation. Similar to the feature analysis, k-fold cross validation was performed with 4 folds for this comparison. We note that an average typist generally types 40 – 60 words per minute which translates on average to 180+ keypresses per minute. Therefore, a timing window of less than 0.3 s should be used for real time application. A 0.2s window limits a typist to 300 characters per minute or roughly 60-80 words per minute. Figure 4 provides the results for different window lengths, which show that the classification accuracy increases with the window length size. However, we observe diminishing returns. We note that there is a trade-off between real-time usability of the algorithm and window length; a window length of 0.2 s is recommended for the baseline feature set to best balance the classification accuracy with the length of time needed per keypress.

Mean classification accuracy across different window lengths with associated standard deviation.
Because the spacebar trials were extracted from the synchronization presses used when collecting the letter trials, precise movement instructions were not provided for the spacebar. The movements associated with this class are therefore expected to be more variable. Indeed, when we exclude the spacebar class, the accuracy on the 26-class problem using the 0.2 s window and feature set RMS, WAMP, AR1, AR2, AR3 increases to 90.2 ± 2.1 %.
Inter-session classification was also evaluated. For this analysis, we computed the classification accuracy when training on data from day one and testing on data from day two, and vice versa. The average of these two results was then taken to find an overall cross-test classification accuracy per participant. The feature set and window size used for this evaluation were the values obtained during the intra-session optimization. Inter-session comparisons per participant resulted in poorer performance than the intra-session evaluation (Fig. 5). The highest classification accuracy observed for a single participant was found for P1 of 24.26 ± 0.53%. The average classification accuracy across all participants for the inter-session comparison was 13.66 ± 1.71%.

Classification Accuracy for training and testing on datasets from two separate days per participant.
A leave-one-participant-out approach (LOPO) was used to assess the transferability of classification models between participants. Once again, the feature set and window size for this evaluation were those obtained during the intra-session optimization. Similarly to the inter-session experiment, poor performance was observed for the LOPO models (Fig. 6). Here, the highest classification accuracy was found to be 24.81% for P14 with an average classification accuracy of 15.24 ± 5.08%. This poor result is not surprising as generalizing across sessions and individuals is a more challenging task. We also note the similarity between the inter-session accuracy and LOPO accuracy. As such, it seems that with current features different sessions are as unique as different patients.

Classification Accuracy of a LOPO classification model.
Federated Learning Experiments
We further explored how data can be shared across users via federated learning (FL). In federated learning, the users all train a joint model without aggregating their data in order to maintain privacy. They do however send gradients that can reveal some information about their dataset. We also experimented with personalized federated learning where each user has a unique personalized classifier instead of one joint classifier for all users.
As federated learning allows us to access more data, we tried to use a slightly more complex model, using a multilayer perceptron (MLP) over the extracted features. The FL experiments were conducted using the calculated features data (RMS, LOGVAR, WL, WAMP, ZC, AR1, AR2). Each 96-entry input vector corresponds to a single window, 6 calculated features for each of the 16 channels.
A few MLP architectures were tested. The results published here were achieved using an MLP that contains 4 Dense Blocks of sizes 96 → 192 → 192 → 192 → 48 followed by a Linear Layer 48 → 26. The first 3 Dense Blocks contain a Linear Layer followed by a ReLU activation and Dropout with probability 0.5. The 4th Dense Block uses an ELU activation and no Dropout.
Each FL round, 10 clients (out of 19) were sampled to receive the current global model. Each sampled client trained about 20 local epochs before sending back their gradients.
We experimented with one baseline federated learning algorithm, FedAvg18 and two personalized federated learning algorithms FedPer19 and pFedGP20. In FedAvg each round the server sends the model to several clients, they compute gradients on their data and the server aggregates all the gradients and updates the model. In FedPer all the network, besides the last layer, is shared and trained using FedAvg, while the last layer for each client is unique and is trained on its personal dataset. In pFedGP we use a Gaussian process with deep kernel learning, where the kernel is shared between clients and leaned with FedAvg while each clients predicts using a unique Gaussian process with its data on its dataset.
The results of the federated learning experiments are detailed in Table 1. We first notice that despite the higher capacity and access to more data, the intra-session accuracy is lower than the SVM trained on a single user. This again shows how transferring knowledge between users and sessions is a challenging problem. We hope that perhaps proper neural network design on this unique problem will address this issue. We also note that performance of the simple FedAvg model shows the best performance on the inter-session challenge, although the results are too low to be of use.
Implications of Validation Results for Open Challenges
The typing dataset presented here provides a valuable resource to improve myoelectric control strategies on a challenging problem with clear applications in human-machine interactions, going beyond common hand gesture recognition scenarios. We showed that classical machine learning methods performed relatively well on the intra-session problem for this task. A simple model on a standard feature set can achieve 87.4 ± 2.5% accuracy. However, there is still significant room for improvement with more complex models that must handle the limited amounts of data.
The relatively low classification accuracy observed when training and testing across the two sessions as well as on new participants provides interesting insight into the transferability of the learned classifier. Further work is warranted to explore improvements in both the inter-participant and inter-session classification scenarios. It would be particularly meaningful to improve the inter-session classification performance, which would remove the need for a lengthy calibration step each time the myoelectric interface is used. This is especially important if we wish to use the myoelectric interface for any commercial product. Inter-sessional variability of myoelectric classifiers is a well-known issue that is not specific to the present dataset. Palermo et al found that across 10 subjects, inter-sessional classification accuracies dropped by an average of 27.03% from training and testing with intra-session data to inter-session data21. There is evidence to show that if training data is collected over many sessions, the accuracy drop may be less22. Training with the limb in many different positions can also improve intersession accuracy23. Neither of these solutions are ideal, as increasing the number of training sessions reduces ease of implementation of the system, while multi positional training may be difficult for someone who is impaired. Recently, feature disentanglement methods have been used to find feature sets that are more robust to cross-session classification24. Most intriguingly, it has been demonstrated that large-scale training on thousands of individuals can also lead to generalizable classifiers25. Further work to optimize algorithms for cross-sessional EMG may provide sEMG control systems that are more viable in operation for fine-grained tasks.
This dataset provides a comprehensive and diverse set of sEMG recordings for fine-grained classification via the typing problem. There have been a few datasets published that are related to the presented work. One article contains keypress data from a single individual12; that work included 32 characters in a dataset obtained by transcribeing a recording of a conversation. Our dataset improves upon that work by containing data from 19 participants as opposed to one. Another dataset includes an sEMG typing dataset from 37 participants, with a slightly different focus putting emphasis on password security13. Other datasets include key presses but are more focused on detecting a typing task rather than identifying individual keypresses26.
Despite the increased complexity in our dataset, there are still some limitations, in particular the use of individual keypresses only rather than natural typing tasks. The well-controlled movements included in our dataset provide a valuable resource to characterize the expected performance achievable on this task. Nonetheless, the impact of variability in typing movements introduced during more natural typing scenarios will be an important avenue for future investigation. We also note the additional variety in movement for the spacebar class.