Avatar

Makoto Yui

Senior Principal Engineer

Arm Treasure Data

Biography

Makoto YUI is a Senior Principal Engineer of Treasure Data, working on ML-as-a-Service. He is currently leading the development of productizing Treasure AutoML as the tech lead. My work has been productized as the heart of CDP predictive scoring and Content Affinity Engine as well.

Aside from that, he is leading the development of Apache Hivemall, an open source library for scalable machine learning on Apache Hive and Spark. Hivemall is awarded for IDC InfoWorld’s Bossie Awards 2014 to picks in big data tools.

He received Ph.D in computer science from Nara Institute of Science and Technology (NAIST) in 2009. He won the best student award from those who graduated NAIST in 2009. He also won MITOH Youth Super Creater Award at the goverment program for young engineers in 2003.

Interests

  • Machine Learning
  • Recommendation Systems
  • Data Engineering
  • Database Systems

Education

  • Ph.D. Computer Science, 2009

    Nara Institute of Science and Technology Graduate School of Information Science (NAIST), Japan.

  • M.E. Computer Science, 2006

    Nara Institute of Science and Technology Graduate School of Information Science (NAIST), Japan.

  • B.E. Computer Science, 2003

    Shibaura Institute of Technology.

Experience

 
 
 
 
 

Senior Principal Engineer

Treasure Data

Apr 2023 – Present Tokyo, Japan

Working with multi-regional teams, leading AutoML productization as a technical lead. Designed, build, lead the team, and operate Auto-scaling container platform for Automl service and implemented various machine learning functions running on top of it.

Promoted to a Senior Principal Engineer by sucessfully leading ML team as a Tech Lead for productizing Treasure Automl.

 
 
 
 
 

Principal Engineer

Arm Treasure Data

Aug 2018 – Mar 2023 Tokyo, Japan

Leading the development of a large scale machine learning service (ML-as-a-Service) in Treasure Data cloud service. My work has became productized as the heart of CDP predictive scoring (Predictive models with automated feature engineering) and Content Affinity Engine (Data augumentation of user interests using NLP, Ontology, and Wikipedia Corpus).

Also, I worked on an Airflow-like workflow management system (OSS’d as Digdag) and it’s next generation Docker runtime design and implementation using AWS ECS cluster auto scaling. A number of customers are using it for running/managing machine learning workflows. Some of example ML workflows can be seen in this repository.

 
 
 
 
 

Apache Hivemall PPMC member

The Apache Software Foundation

Sep 2016 – Sep 2022
Leading the development of Apache Hivemall at Apache Incubator as the original creator (slide).
 
 
 
 
 

Research Engineer

Treasure Data (Acquired by Arm)

Apr 2015 – Aug 2018 Tokyo, Japan
  • Worked as the company’s 1st machine learning engineer. Grow machine learning team and leaded machine learning applications at the company. Productized Apache Hivemall in Treasure Data cloud service. Aside from development, I did anything I can contribute to the company growth as a startup member including ML consulting for customers, presentations in sales meeting, talks at external events as well at the early stage of the company.

  • Internship management: I initiated and organized Summer Internship program from 2015 to 2018 and mentored a number of students. We implemented various Ranking measure and anonaly detection algorithms. Another student worked on Field-aware Factorization Machines and online Kernel SVM. We successfully finished the intership program and 5 master students joined to Treasure Data in the past 3 years. This accomplishments are what I’m proud of.

  • Consulting: Aside from development tasks, I consulted 20~30 customer-facing machine learning projects from Fortune 500 companies to foreign startups (Indonesia, Taipei, Israel). Consulted industrial segments includes: Telecom, Insurance, Automobile, Ad-tech, EC, Media Agency, Internet-related Service, Real-estate, Gaming, and Online Publishers. My consulting work for Subaru is featured by CEO’s talk at Softbank world. Those consulting gave me precious and unique experiece applying ML to diverse domain problems such as dealing with overfitting (data leakage), feedback loops, and pitfalls for optimization by evaluation measures.

  • External talks: I presented talks in various conferences (such as ApacheCon, Hadoop Conference, Annnual event of Japan DataScientist Society) and gave demos at research conferences (RecSys'18).

 
 
 
 
 

Visiting Researcher

The University of Edinburgh

Sep 2011 – Nov 2011 Edinburgh, UK
As a visiting researcher from AIST, I worked with Paolo Basala and Prof. Malcolm Atkinson at Data Intensive Research (DIR) group. Designed a distributed streaming data processing system on EDIM1 data-intensive machine (an energy-efficient PC cluster) for scientific workflows.
 
 
 
 
 

Senior Researcher

National Institute of Advanced Industrial Science and Technology (AIST)

Apr 2010 – Mar 2015 Tsukuba, Japan
Working on distributed and parallel data processing and large-scale machine learning at the data science research group. Designed and managed 50 nodes Hadoop cluster for managing scientific workflows. Promoted to a Senior Researcher in the 3rd year at AIST.
 
 
 
 
 

Visiting Postdoc

Centrum Wiskunde and Informatica (CWI)

Oct 2009 – Mar 2010 Amesterdam, Nederlands
Worked with Peter Boncz and Prof. Martin Kersten at INS1 database research group. Designed and implemented a parallel database system on the top of shared-nothing MonetDB servers.
 
 
 
 
 

Visiting Researcher

Waseda University

Apr 2009 – Mar 2010 Tokyo, Japan
While I’m receiving my JSPS followship, I worked with Prof. Yamana.
 
 
 
 
 

System Engineer

NEC Infomatic Systems, Ltd

Apr 2004 – Mar 2006 Tokyo, Japan
Designed and implemented RM4GS (Reliable Messaging for Grid Services) as a reference implementation of OASIS WSRM which provides reliable messaging facilities for Web Services.

Skills

Machine Learning

Implemented various Machine Learning algorithms

Java

Fluent, Main Programming Language. 10+ years experience

Python

Tool for Scripting

Linux

Long-term user. Uses Ubuntu/Debian/Redhat/AmazonLinux

AWS

Experienced in AWS tech stacks

Used ECS/Fargate in production. Experienced in Container security issues.

Databases

Experienced in Postgres/MySQL. Ph.D in Database Systems :-)

hadoop

Hadoop

Hadoop/Spark/Hive master

terraform

Terraform

Tool for DevOps

Accomplish­ments

Bossie Awards 2014: The best open source big data tools

Hivemall is awarded for InfoWorld’s to picks in big data tools.

IPSJ Yamashita SIG Research Award 2009

Awarded for a research paper on non-blocking database buffer management.

JSPS Research Fellow (PD)

Recieved a goverment-sponsored Postdoctoral Fellowship.

Best Student Award

I was selected as the best student from those graduated this year.

JSPS Research Fellow (DC2)

Recieved a goverment-sponsored Fellowship for Ph.D students

MITOH Youth Super Creater Award

Won Super Creater Award in the goverment program for young engineers. The MITOH Program aims to discover and develop outstanding human resources called Super Creators. Specifically, these are persons possessing creative ideas and skills for achieving software innovation and who can put these ideas and skills to use. Super Creators discovered through this program implemented by IPA are expected to play active roles as world-class IT human resources that help support Japan’s IT industry during the next generation.

IPSJ Outstanding Paper Award 2003

Our paper for XML database system on the top of PostgreSQL is selected as an outstanding paper.

Projects

*

BTree4j

Btree4j is from scratch implementation of a disk-based Prefix B+-tree written in Pure Java. It’s pretty fast and 100k ops/sec is expected even on laptop.

Apache Hivemall (Incubating)

Apache Hivemall is a scalable machine learning library that runs on Apache Hive and Spark. Apache Hivemall offers a variety of functionalities including regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. Won IDG’s InfoWorld 2014 Bossie Awards 2014, the best open source big data tools.

XBird

XBird is a light-weight XQuery processor and database system written in Java. The light-weight means reasonably fast and embeddable. Impmented fully Function Programming lanuguage for XML, XQuery using JavaCC. It passes about 91% of the minimal conformance of XQuery Test Suite.

XpSQL

Developed a multi-functional XML database environment using PostgreSQL as a graduate student project. Extended PostgreSQL functionality using Server Programming Interface. Awarded super creater award at IPA mitoh youth, a goverment program to educate young engineers.

Talks

Fireside chat at Indeed

Gave a invited talk about my OSS experience.

ApacheCon North America 2018

Gave a talk titled Introduction to Apache Hivemall v0.5.0: Machine Learning on Hive/Spark. Join talk with Takeshi Yamamuro from NTT.

Recommendation 101 using Hivemall

Gave a talk titled Recommendation 101 using Hivemall for Treasure Data customers.