Rishiraj Acharya

Google Summer of Code 2022
Kaggle examples for TensorFlow Decision Forests

Contributor: Rishiraj Acharya, Mentor: Josh Gordon

📝 Project Summary

Libraries like XGBoost and LightGBM dominated most Kaggle competitions on tabular data and achieved state-of-the-art results on a variety of datasets. The TensorFlow Decision Forests was also a very powerful new library with potential to give highly accurate predictions. However due to the lack of proper awareness and availability of detailed examples in the Kaggle community, it was not being used to its true potential. I solved this with a series of notebooks that demonstrated the usage of TensorFlow Decision Forests for a variety of datasets. I created beginner-friendly baseline examples and also dived into fine-tuned advanced examples that achieved some of the best scores in tabular competitions. As additional contribution, I’m working on a project that auto-trains, auto-tunes and auto-serves the best TF-DF model directly from CSV files.

🛠️ Contributions

To demonstrate getting started and automated hyper-parameter tuning in TF-DF I worked on a real Kaggle competition with the Tabular Playground Series - Feb 2021 dataset. It is a tabular dataset with 300,000 rows and 26 columns in training (93.66 MiB .CSV training dataset + 58.85 MiB .CSV test set) that is suitable for training algorithms to solve regression problems. The feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous. I chose a regression task as the official tutorial already had examples for classification tasks.

[x] This tutorial will guide you to get started with TF-DF model training, evaluation, exploratory data analysis and submission in Kaggle competitions:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-baseline-eda
[x] This tutorial will guide you to explore the various tunable hyper-parameters in TF-DF and briefly explain how each one affects the model quality:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-exploring-hyper-parameters
[x] This tutorial will guide you to automatically tune hyper-parameters of a GradientBoostedTrees model to perform a regression task using TF-DF Tuner:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-automatic-tuning-eda
[x] This tutorial will guide you to automatically tune hyper-parameters of a GradientBoostedTrees model to perform a regression task using Keras Tuner:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-keras-tuner-eda
[x] These tutorials will guide you to blend the predictions from multiple tuned TF-DF models to achieve higher scores in the leaderboard:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-blending
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-tuning-blending
[x] This tutorial will guide you to stack TF-DF models. It is an ensemble ML algorithm that uses meta-learning and is often the final step in tabular competitions:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-stacking
[x] This tutorial will guide you to work with texts natively in TF-DF as well as use pretrained embeddings from TF-Hub to boost scores in Kaggle competitions:
https://www.kaggle.com/code/rishirajacharya/gsoc-2022-tf-df-text-tf-hub-embedding

🔗 Merged PRs:

[x] Fixed bug in dataframe to tf dataset conversion - https://github.com/tensorflow/decision-forests/pull/53
[x] Fix broken links in documentation - https://github.com/tensorflow/decision-forests/pull/85

🚀 Future Contributions

Working on a project that auto-trains TF-DF models with k-fold Cross Validation directly from CSV files, auto-tunes them using Keras Tuner and auto-serves the best model using FastAPI: https://github.com/tensorflow/decision-forests/issues/123

🙏 Acknowledgements

I loved the way GSoC went for me this year all thanks to the helpful, friendly, safe, and smooth environment created for me by my mentor Josh Gordon. I got adequate space to think of my own and implement ideas without being pressurized. At the same time, I was being offered all kinds of guidance and assistance for staying focused on my goals. GSoC has made me realize that this is what makes me happy and it's what I want to do all my life. Hence I'll soon be applying for a job at Google. I would also like to thank Sayak Paul, GDE in ML, for introducing me to the TF-DF library which helped me become one of its earliest adopters.

📚 References

This site is open source. Improve this page.