ML on a large dataset with Spark MLlib

Day 1 /  / Track 4  /  RU /

What a Java programmer needs to be able to do and understand in a typical BigData + ML project:

  • How to choose features;
  • How to recode features;
  • How to scale;
  • How to clear and fill in the gaps;
  • How to evaluate the quality of clusterization and binary classification;
  • What to do if your classification is suddenly not binary;
  • How to make cross-validation.

And all of this in Java + Spark!

Besides, we'll talk about pitfalls you can be caught up by while using MLlib, the aspects of some popular algorithms realization, kick some open source rivals and discuss the peculiarities of integration in already existing applications.

Alexey Zinoviev
Alexey Zinoviev

Just as Charon from the Greek myths, Alexey helps people to get from one side to the other, the sides being Java and Big Data in his case. Or, in more simple words, he is a trainer at EPAM Systems. He works with Hadoop/Spark and other Big Data projects since 2012, forks such projects and sends pull requests since 2014, presents talks since 2015. His favourite areas are text data and big graphs.