What the Dumb Tiger Wants to Share: Spark

Showing posts with label Spark. Show all posts

2016-06-04

Strata + Hadoop World 2016, London

Some notes about the workshops and talks I attended.

Hadoop application architectures: Fraud detection by Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Gwen Shapira (Confluent) & Ted Malaska (Cloudera): The use case for this tutorial was network anomaly detection based on NetFlow data. The system aimed for a real time dashboard, a near real time component updating profiles and a long-term storage layer enabling analytics. They went through the system architecture and discussed several technical options for each component. Based on their assessments concerning maturity, production readiness, community support, the number of committers and proven commercial success the final set of products was Kafka, Flume, Spark Streaming (the recently released Kafka Streams was also mentioned), HDFS, HBase (or Kudu), MapReduce, Impala and Spark ... at some point one has to cut the rope and make a decision, as there are new products popping up currently nearly every day. The demo application and the slides are publicly available.
An Introduction to time series with Team Apache by Patrick McFadin (DataStax): This tutorial mainly dealt with Cassandra and introduced Cassandra's features like consistent hashing, replication, multi-datacenter support and its write path (1. write to commit log and memtable; 2. write to SSTable on disk; 3. perform compaction). The sample application uses Kafka for collecting data, Spark for processing and Cassandra as storage entity by utilizing the Spark Cassandra Connector. These slides seem to be a slightly older version, but the main content matches.
Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Apache Kudu (incubating) by Todd Lipcon (Cloudera): A wrap-up of the current status of Kudu. These slides seem to be an older version, but the benchmarks results shown there are still valid.
TensorFlow: Machine learning for everyone by Sherry Moore (Google): There was quite a hype around TensorFlow since its initial release a couple of months ago. This talk introduced the motivation and the basic concepts plus the ease of use including some small code examples.
Why is my Hadoop job slow? by Bikas Saha (Hortonworks Inc): That talk started with the current monitoring capabilities of Ambari plus Grafana integration for building custom metrics dashboards. A lot of Hadoop services provide audit logs. The YARN timeline server and Ambari Log Search (based on Solr) help on event correlation. He closed with tracing and analysis by showing Zeppelin for ad-hoc analysis.
Anomaly detection in telecom with Spark by Ted Dunning (MapR Technologies): This was about establishing a session between a caller and a telecom tower and how to deal with problems like signal interference and different data distribution patterns. The demo application is publicly available. These slides are from a colleague of Ted, but covering the same topics.
Beyond shuffling: Tips and tricks for scaling Spark jobs by Holden Karau (IBM): She covered several topics like reusing RDDs, avoiding groupByKey() and testing Spark applications. Two of her GitHub repos are of particular interest: One with code examples from a book of her, and one for enabling the implementation of tests.
Floating elephants: Developing data wrangling systems on Docker by Chad Metcalf (Docker) & Seshadri Mahalingam (Trifacta): The aim was to provide "Hadoop as a service" for developers, i.e. scalability and performance are currently not an explicit goal, but interface completeness and support for several Hadoop distributions. Therefore, they bundle all jars of a particular Hadoop distribution version into a single "bundle jar" and use Docker as a runtime environment. The next step was to split the different Hadoop services into separate containers and use Docker Networking to tie them together. In the future, they want to scale the Hadoop services and scale out on several Docker hosts The current state is available here.
Simplifying Hadoop with RecordService, a secure and unified data access path for compute frameworks by Alex Leblang (Cloudera): Overview of RecordService that provides a unified fine-grained security service for Hadoop. This video is a recording from an earlier Strata conference.
Reactive Streams: Linking reactive applications to Spark Streaming by Luc Bourlier (Lightbend): Live demo of a random number generator with configurable numbers per seconds generated running on several Raspberry Pis and calculating some statistics using Spark Streaming. He showed impressive graphs concerning handling back pressure, if too many numbers are generated per second - Spark 1.5 acts more intelligently than Spark 1.4 here. The key facts also appear in these slides and in this video from another conference.
Applications of natural language understanding: Tools and technologies by Alyona Medelyan (Entopix): Natural language understanding (NLU) - as opposed to natural language processing (NLP) - tries to analyze the context e.g. to provide results in discovering of relations within text or for sentiment analysis. There are a lot of open source libraries for machine learning, deep learning and NLU to choose from - priority should be to decide on the algorithm rather than on a particular library, but domain knowledge is more important than the algorithm. Commercial libraries and APIs often do not provide added value, but they offer support and possibly a better solution for a specific business area like healthcare.
Insight at the speed of thought: Visualizing and exploring data at scale by Nicholas Turner (Incited): They started a project to get rid of static reports that were generated over night covering only 3% of the data. They wanted immediate availability, 100% data coverage and self-service (no static pre-generated reports). They now have a Cloudera Hadoop cluster using Flume for data ingestion, streaming the current day into Avro files and running batch jobs over night that move the data from the Avro files into Parquet files, that are partitioned by day. For data wrangling and preparation they use Trifacta including anomaly detection, cleansing and joining of data. Visualization and analytics are provided by an in house developed search tool and Zoomdata based on Impala queries. They use data science for fraud detection (using a Solr index updated in real time) leading to machine learning to improve fraud detection and to validate the decisions concerning frauds. The case study can be found here.

2015-09-18

Distributed Matters 2015

Distributed Matters 2015 in Berlin - Friday was workshop day. I've chosen to attend Data applications on Hadoop, live from the trenches held by Arnaud Cogoluègnes and A Hands-on Introduction to Ansible by Victor Volle.
Arnaud's workshop was well-prepared by supplying a VM with everything needed for the workshop. It was a beginner level workshop with a good amount of exercises using Java. Unfortunately, we ran out of time to do all exercises. Arnaud covered the following topics:

Introduction into HDFS, YARN and MapReduce
Cascading (incl. using Tez under the hood)
File formats to store data in Hadoop (e.g. Avro, ORC, Parquet)
Hive
Spring Batch

Victor's workshop consisted of an introduction into Ansible accompanied by exercises performing installation and configuration tasks. Again, we ran out of time to complete all exercises. He covered the following topics:

Motivation for automation and configuration management
Ansible core concepts (incl. the very good documentation)
Idempotency and target state detection
Playbooks with some modules by example
Modules and reuse of code
The still poor multi-stage/multi-environment support of Ansible, if you want to group nodes in several dimensions (e.g. roles like web servers and development stages like test environment)

Saturday was conference day with a bunch of talks in two parallel tracks. Here are my notes from the talks I've attended:

The keynote was held by Kyle Kingsbury about Jepsen, a tool to simulate network partition for different distributed databases, schedulers etc. Network partition is kind of a worst case scenario in distributed systems, and Kyle went through some NoSQL database products and how they cope with network partition and whether they keep their promise of data consistency.
Clojure at Braintree: Real-time Data Pipeline with Kafka by Joe Nash: Braintree is a payment processor using real-time data pipelines. Their data warehouse is Amazon Redshift, and Joe described their journey to introduce Kafka and Clojure for feeding data into Elasticsearch for transaction search. Kafka features they found useful:
- Implements the pub/sub pattern.
- Topics are spread across partitions.
- The retention period is configurable.
- The message stream is strongly ordered.
Upgrade your database: without losing your data, your perf or your mind by Charity Majors: What can go wrong? A lot! Data loss, queries being slowed down, broken replication, ... Risk assessment is important, e.g. about the type of data you risk to loose or the existence of a rollback path. Charity presented why MongoDB has the highest update risk in the context she is working - mainly because of the type of data they store in there and the fact that MongoDB is quite young and immature with a lot of changes between releases. Some learnings she presented:
- One change at a time.
- In an ideal world run tests with production traffic and production data with each change. They capture and replay each API request to hit the production database and a shadow database (which might already be upgraded) and compare the results.
- Don't trust vendors, run your own benchmarks.
- Know your workloads.
- Measure and look for outliers e.g. by tracking the p99 percentile.
Stream based textanalytics with Spark and Elasticsearch by Stefan Siprell and Hendrik Saly: They explained CRISP (Cross Industry Standard Process for Data Mining) using different domains. Tools mentioned were R (with some small code examples concerning data cleansing and clustering being presented), Spark (e.g. for stream processing of tweets and pushing data to Elasticsearch - incl. some code examples), Elasticsearch (for interactive data exploration using Kibana). Some machine learning fundamentals like supervised learning, e.g. based on support vectors, were covered.
Running database containers using Marathon and Flocker by Kai Davenport: He talked about scheduling of containers across many hosts. Mesos manages resources in a cluster. It uses a master/slave architecture and stores its state in ZooKeeper. Marathon is a framework on top of Mesos and manages containers in a cluster. Flocker orchestrates storage across a cluster of containers providing a REST API. It requires an agent on each host and relies on a central control service. Kai closed his talk with a storage failover live demo.
Containers! Containers! Containers! And Now? by Michael Hausenblas: No slides! Just a terminal session and a web browser. Michael lead us towards the motivation for using containers and a cluster management layer for containers - Marathon again. He went through this blog post in a live demo fashion.
Microservices with Netflix OSS and Spring Cloud by Arnaud Cogoluègnes: Spring Cloud wraps the tools from Netflix OSS to enable them for Spring projects. Arnaud showed some Java code examples to demonstrate how easily this can be used.
Conflict Resolution with Guns by Mark Nada: This talk was about problems of distributed systems like split brain, building a quorum and the CAP theorem. Gun DB, a distributed cache, tries to mitigate those problems by using local caching and conflict resolution algorithms. It challenges the need for strongly consistent systems.
Disque: a detailed overview of the distributed implementation by Salvatore Sanfilippo, the guy behind Redis: This final talk was about Disque, a distributed message queue. Disque has been forked from Redis and uses the same protocol. Salvatore provided some technical insights into the API and details like ID generation, delivery guarantees, replication schemes and save restarts (restarting the system with persisting the messages before shutdown and loading them again after start). He concluded with a short live demo.