Lots of people know how to use Hadoop/Spark for generating reports, but lately there’s an increasing number of clients asking for real-time processing of big data streams (as opposed to merely saving them to S3 or Cassandra for future reports which take hours to generate). It’s often required to calculate “on the fly” some aggregated values for a short period of time and filter a stream to decrease a load on the subsequent calculation stages. It’s common to see projects where the team organize a “lake of data” in Amazon, just dropping all the incoming events in Kafka. Is Spark able to manage all the streams from Kafka? If so, at what cost and what can be used to help it? Don’t expect an introduction to Spark and RDD or some blah-blah about Big Data. One case — one solution — a bit of theory — editing the configs — writing the code.
Go to presentation