default logo

Meet the MachineHack champions who fell in love with the “Predict The Movie Genre” hackathon

MachineHack successfully concluded its Movie script ranking: predict the movie genre hackathon. With over 600 registrations and active participation of 60 participants, we present to you the top 2 competitors and the approaches that helped them solve the problem.

# 1: Amul Patil

Amul started his career as a financial analyst, where he was introduced to some basic inferential statistics used in business setups. Intrigued by the potential of data and its impact on decision making, he chose to follow data science as a career path. He has worked for several organizations, developing his skills throughout his career.

Register for FREE Workshop on Data Engineering>>

“MachineHack is a great platform for learners, which continues to engage in frequent new case studies and allows participants to compete while learning along the way.” – he shared his opinion on MachineHack.

Approach to solving the problem

Amul explains his approach as follows:

  1. Data cleansing and creation of an initial baseline model using logistic regression
  2. Finding hyperparameters of different algorithms and using the TFIDF Vectorizer
  3. A set of three models (Naive Bayes, Logistic Regression & SVD) and XGboost were used for the final probabilities.
    1. The arithmetic mean of the Naive Bayes probabilities (on cross-validation models)
    2. The arithmetic mean of the probabilities Logistic regression (on cross-validation models)
    3. Features of SVD components created
    4. The XGboost meta-learner was used on the outputs of the three outcomes above

I have tried some of the latest techniques like ULMFIT (semi-supervised learning) and text synthesis followed by BERT. Both approaches took a long time to train on the available resources (eg Colab) and seemed overkill for the task at hand. So I decided to stick with the above setup which resulted in a better score in the private ranking.

Get the full code here.

# 2: Sairam Chitreddy

Sairam obtained his B.Tech in Engineering Physics from IIT Delhi and then joined FIITJEE in 2015 as the Faculty of Physics for IIT JEE. His passion for general learning and problem solving led him to machine learning and, without any coding training, he enrolled in various MOOCs to familiarize himself with the demanding field. He uses his free time to venture into deep learning and natural language processing. He spends most of his time making projects, participating in hackathons. He is also actively seeking entry into the field of data science.

“It was my 2sd competition on MachineHack after forecasting the price of the plane ticket. Although I did not do well in this competition, later when I read the winner’s solution it improved my perspective on how to approach the problem and I also learned some interesting techniques. Personally, during this hackathon, not only my application skills but also my understanding of the underlying concepts improved as well. Also, I felt that the learning does not stop once the competition is over because, during this hackathon, I realized that there were still many other ideas to try, implement and improve performance. Another aspect of MachineHack that I like is the variety of competitions. »- Sairam shared his opinion on MachineHack

Approach to solving the problem

He briefly explains his approach as follows.

I used Google Colab to work with GPUs. The major problem I have encountered is “CUDA: out of memory” errors. Here are the steps I followed to resolve this issue:

1) Process one script at a time instead of a batch of scripts

2) Use of distilled models

3) Removal of unused variables at each step

4) Switching of gradient calculations. (Even when the model is in evaluation mode, it was still calculating gradients, so torch.no_grad () should be used)

When encoding I used sequences of maximum possible lengths instead of individual sentences, as one of the strengths of transformer models is the self-attention mechanism. Longer sequences give better incorporations of contextual words.

Then for each script I took the average of the sequence encodings and applied a Logistic Regression on them to establish a baseline which ultimately turned out to be a winning solution.

See also
The winners of Last Hacker Standing - Soccer Fever Challenge

My leaderboard score was too bad compared to the validation scores. So, after referring to some of Kaggle’s discussions, I realized that the K-fold stratified cross-validation gives a better estimate of the model’s performance. And indeed, my final score was almost within one standard deviation of my 5 times CV score.

Apart from the above approach, I tried to treat each encoding as an example, with script-label as a label and install a neural network in it, but failed due to too much noise.

Other ideas to explore that I haven’t been able to try:

1) Use more powerful models instead of a simple logistic regression

2) Resolve class imbalance with oversampling (failing to do so could be the reason for poor leaderboard scores

My most important learning from this competition – Trust your K-fold cross-validation, as the division used to calculate public leaderboard scores may be skewed.

Get the full code here.

Check out the new hackathons here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Join our Telegram Group. Be part of an engaging community

Source link

Leave a Reply