September 8, 2025
Project code is on GitHub
I entered a Kaggle competition where participants had to identify body-focused repetitive behaviors (BRFBs). The idea is to easily detect, monitor, and inform treatment of BRFB disorders.
BRFBs are behaviors that are performed frequently and cause harm to the body, like hair pulling, skin picking, and teeth grinding. These can constitute a psychiatric disorder on their own, and they also commonly co-occur with other anxiety disorders like generalized anxiety disorder, obsessive-compulsive disorder, and social anxiety. BRFBs are common: in one example, 12% of a large undergraduate population had a BRFB disorder. (It was hard to find comprehensive prevalence studies). 1 2
I had a few goals for the project:
Since this is a time series classification problem, I started by reading a few papers on common approaches to applying deep learning to time series. Wang and Yan3 trained a multilayer perceptron, fully convolutional network, and a ResNet (a deeper convolutional net with shortcut connections) on a number of datasets. The convolutional networks performed a bit better than the multilayer perceptron. I also read about LITE4, which extends a basic convolutional network with 3 techniques: 1) channels with hand-designed, fixed kernels for common features; 2) multiplexing convolutions, that is learning a few parallel convolution layers that have different kernel sizes; and 3) separable convolution layers, which are similar to regular convolutions but with less computational cost, and aren't used in this project.5
I decided to use a basic convolutional network, since they seem to be the standard approach, and since multiplexing convolutions are used in both LITE and a related model, Inception, I incorporated them as well.
A more complete description of the architecture (along with data preprocessing steps, etc) is available in the documentation on GitHub.
I paid attention to regularization and optimization methods used in the literature, and also heavily consulted the Deep Learning book.
In the literature, batch norm was always used after convolution layers. The Deep Learning book notes (section 11.2) that batch norm can be particularly useful to improve optimization of convolutional networks. It also recommends applying batch norm after linear layers, prior to ReLU. The intuitive reason is that if we apply batch norm between an ReLU layer and the following linear layer, the input vector is more non-Gaussian (since it came from ReLU) and thus harder to normalize.
For the optimizer I chose Adam, which combines regular stochastic gradient descent with adaptive learning rates (Deep Learning, section 8.5) with momentum (section 8.3.2). I later used AdamW, which also incorporates weight decay regularization, since I had models that were overfitting. In all training runs I used early stopping based on validation score, using the metric provided by the competition.
When using AdamW, it is possible to tune learning rate and weight decay separately, i.e. finding the optimal combination by searching one axis a time and not searching over the entire joint space. However, fixing learning rate and tuning decay, then doing the reverse, is not optimal in PyTorch. This post by Fabian Schaipp has a great overview and the takeaway is that in PyTorch, you should probably tune the product of learning rate and decay, holding their ratio constant, then tune the ratio of learning rate to decay, using the best product.
After I had designed my very first, basic model, I felt something was wrong with it. The training loss was much higher than I expected even for a simple model. I had calculated the expected cross entropy loss from naively predicting equal probability for each class, and my model wasn't performing well relative to that baseline.
My first step was to train it on a dataset with only a single example. Now it was clear something was wrong, because on some runs the error dropped to roughly 0, while on other runs it "got stuck" at a high loss. After visualizing the distribution of gradients and weights for each parameter in these problematic runs, it was clear that the gradient was about 0 and weights were very small.
This did not make sense. With a single training example, there is only one correct class. If we look at the bias terms of the final output layer, the one corresponding to the correct class should have a nonzero gradient, since increasing that bias clearly decreases loss.
The reason for this behavior was that I mistakenly added an ReLU layer after the final linear one. In this case the gradient depends on whether the linear neuron has a positive value or not. If it is positive, ReLU has no impact and the gradient of the bias term is positive as expected. But if the neuron is negative, the following ReLU layer resets it to 0. This means that small nudges in either direction have no impact on the ReLU layer output, or the loss, so the gradient is zero. The extra ReLU layer had broken optimization.
In early iterations, I tried model specifications roughly like this (again, see the documentation on GitHub as well as the full training code for details):
I tried tuning the learning rate, as well as the precise number of channels and kernel sizes. The competition metric is the mean of two F1 scores: one binary F1 score for classification of BFRB versus non-BRFB gestures, and one multiclass F1 score for classification of the individual BRFB gestures along with the whole non-BRFB class.
Different configurations had a noticeable impact on validation performance (which could range from combined F1 of 0.6-0.7 in my first runs) and training time. But in all cases there was a huge discrepancy between training loss and validation loss, so I tried more aggressive regularization.
The first thing I tried was switching from Adam to AdamW, and adding weight decay. Again, the amount of weight decay had a notable impact on performance and a very dramatic impact on training time, but training loss still dropped much faster than validation loss.
Simplifying the model was much more successful. I simplified by reducing the hidden units for the demographics branch of the model, and also by adding a global average pooling layer prior to the final fully connected layer. Now the smallest configurations showed roughly matching training and validation loss (for example, training with only 4 demographic hidden units, kernel size 3, and 10 convolution channels). Slightly larger configurations seemed to be a sweet spot where the model reached much better training loss, somewhat better validation loss, and the highest combined F1 test scores.
My final competition score is 0.71. This is still far from the 0.82 required for bronze and puts me in the ~33rd percentile. That's to be expected: I started with a very basic approach, and spent most of my time figuring out the fundamentals. I had to do some data processing issues, get organized for local development and remote training with different datasets and models, learn how to debug training, speed up slow steps, figure out new APIs, etc (see appendices). I left a lot of directions unexplored.
The basic tradeoff between deep learning and "classical" machine learning algorithms is that deep learning models are typically flexible enough to train on raw, noisy data, but are harder to optimize and require more data to generalize well. The classical algorithms usually do not fit noisy data as well, and require a lot more manual transformation of the data, informed by domain expertise. The training set for this competition is about 8000 sequences long, and clearly it was enough to train a decent sized neural net, but my intuition is that data cleaning and transformation could go a long way. I think these would be more impactful than additional tuning.
The best place to start could be missing values. There is a small amount of missing data in the dataset due to sensor failure. Also, the time-of-flight sensor uses the value -1 to encode "nothing in range," which could make learning more difficult than if it was encoded as a separate channel. In research with wearable sensors, there is also a set of commonly applied transformations such as applying high or low-pass filters, combining data along axes into a single magnitude and adding that as an additional feature, accounting for gravity, etc.
One final reflection: I set out to make decisions in a principled way, as much as possible, rather than throwing everything at the problem and seeing what sticks. This is hard! At any point in a process like this, there are many directions to try next, and it is never certain which one is most important. All I can do, I guess, is read more, design better tests and experiments, and document well.
Build a complete version before improving the parts. My initial attempts at the problem were capable of fully processing, training, and making predictions using subsets of the training dataset, but I waited far too long before making a real submission to the competition. Unfortunately the documentation of the train and test datasets is wrong: it says "train only" next to most fields that are not available during testing, but doesn't specify that for behavior
. And behavior
is a useful field because it marks the parts of the time series where the participant is getting into position, preparing to do the gesture, and actually doing the gesture. If I had made a complete submission early on, I would have realized this and made significant changes.
Experiment enough to get a feel for model capacity. The most helpful thing I did was dramatically constrain the model until its generalization error got close to zero. I was surprised at how much I had to reduce model capacity in order to get this to happen, and how relatively ineffective other regularization techniques were. Once I found this boundary it was possible to tune the model more effectively with mild increases in capacity until it reached a happy medium between high generalization error and high training error. (All this happened a little too late because at first I couldn't even measure generalization error clearly. My early training runs used validation sets and output a score, but I used the competition metric for validation (and early stopping) and did not log the decrease in loss, so the scales were different).
Pay attention to software design. A machine learning project is a small app, and design and performance make a big difference. Some of the data processing parts of the project were fairly slow and felt clunky to use, and next time I'd spend more time thoughtfully designing them. I did make an intentional choice to write a simple Python library and not a bunch of Kaggle notebooks, and this felt much more efficient. That said, notebooks are perfect when the goal is to document a specific process or run, or write something up, especially when visualization is involved.
A perfect F1 score of... zero?. I noticed that when I fit a model to a single training example for debugging, sometimes I'd get a 0 binary F1 score (binary meaning the score for BFRB versus non-BFRB classification). Remember that the denominator of precision is "number of positive labels" and the denominator of recall is "number of actually positive cases." Hence, if there are only negative classes and the model predicts negative for all of them, the F1 score is undefined. The particular implementation used by the competition imputes zero if the score is undefined, so the zero score is actually expected behavior.
The missing Kaggle notebook. I used the Kaggle API to manage training on a GPU and submission to the competition. This turned out to be a pretty clunky experience. At one point, I had set the title of my Kaggle dataset and the slug in dataset-metadata.json
, and the slug didn't follow the usual rules for converting from a title. There was a warning that "unexpected behavior may occur," which turned out to mean the whole dataset was hidden from my Kaggle profile. It would be nicer if the slug wasn't configurable at all, or at least if the CLI refused to upload with an invalid slug, instead of warning something will break before diving right in and breaking it.
Padding math. The PyTorch padding=same
parameter does not work in a convolution layer with dilation. Here is the math:
To calculate the output size for arbitrary settings:
$$O = \left\lfloor \frac{I + 2P - D(K-1) - 1}{S} + 1 \right\rfloor$$
where $I$ is input length and $O$ is output length.
Memory profiling. At one point I tried a large enough model that my laptop didn't have enough capacity to train it. I ran out of memory, but when I used tracemalloc
to confirm the problem, it gave me confusing results. The problem is that PyTorch's C++ code does not use PyMalloc
to allocate memory and hence it is not tracked by tracemalloc
, as described in this StackOverflow answer.
https://www.sciencedirect.com/science/article/abs/pii/S0165178118308734 - paper on prevalence ↩
https://www.sciencedirect.com/science/article/abs/pii/S0022395624006903 - paper on comorbidities ↩
https://arxiv.org/pdf/1611.06455 - time series class baseline ↩
https://arxiv.org/pdf/2409.02869 - LITETime ↩
The basic idea is to decompose a convolution that has $C_i$ input channels, $C_o$ output channels and kernel size $K$. The first operation (called depthwise convolution) has $C_i$ output channels and each output is tied to exactly one input, with kernel size $K$. The second (called pointwise convolution) has a 1-dimensional kernel and $C_o$ output channels. In the paper, they do the math to show that this approach requires less multiplications and fewer parameters to learn (except when there are only 1 or 2 output channels). ↩