Showing posts with label Cassandra. Show all posts
Showing posts with label Cassandra. Show all posts

2016-06-04

Strata + Hadoop World 2016, London

Some notes about the workshops and talks I attended.
  • Hadoop application architectures: Fraud detection by Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Gwen Shapira (Confluent) & Ted Malaska (Cloudera): The use case for this tutorial was network anomaly detection based on NetFlow data. The system aimed for a real time dashboard, a near real time component updating profiles and a long-term storage layer enabling analytics. They went through the system architecture and discussed several technical options for each component. Based on their assessments concerning maturity, production readiness, community support, the number of committers and proven commercial success the final set of products was Kafka, Flume, Spark Streaming (the recently released Kafka Streams was also mentioned), HDFS, HBase (or Kudu), MapReduce, Impala and Spark ... at some point one has to cut the rope and make a decision, as there are new products popping up currently nearly every day. The demo application and the slides are publicly available.
  • An Introduction to time series with Team Apache by Patrick McFadin (DataStax): This tutorial mainly dealt with Cassandra and introduced Cassandra's features like consistent hashing, replication, multi-datacenter support and its write path (1. write to commit log and memtable; 2. write to SSTable on disk; 3. perform compaction). The sample application uses Kafka for collecting data, Spark for processing and Cassandra as storage entity by utilizing the Spark Cassandra ConnectorThese slides seem to be a slightly older version, but the main content matches.
  • Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Apache Kudu (incubating) by Todd Lipcon (Cloudera): A wrap-up of the current status of Kudu. These slides seem to be an older version, but the benchmarks results shown there are still valid.
  • TensorFlow: Machine learning for everyone by Sherry Moore (Google): There was quite a hype around TensorFlow since its initial release a couple of months ago. This talk introduced the motivation and the basic concepts plus the ease of use including some small code examples.
  • Why is my Hadoop job slow? by Bikas Saha (Hortonworks Inc): That talk started with the current monitoring capabilities of Ambari plus Grafana integration for building custom metrics dashboards. A lot of Hadoop services provide audit logs. The YARN timeline server and Ambari Log Search (based on Solr) help on event correlation. He closed with tracing and analysis by showing Zeppelin for ad-hoc analysis.
  • Anomaly detection in telecom with Spark by Ted Dunning (MapR Technologies): This was about establishing a session between a caller and a telecom tower and how to deal with problems like signal interference and different data distribution patterns. The demo application is publicly available. These slides are from a colleague of Ted, but covering the same topics.
  • Beyond shuffling: Tips and tricks for scaling Spark jobs by Holden Karau (IBM): She covered several topics like reusing RDDs, avoiding groupByKey() and testing Spark applications. Two of her GitHub repos are of particular interest: One with code examples from a book of her, and one for enabling the implementation of tests.
  • Floating elephants: Developing data wrangling systems on Docker by Chad Metcalf (Docker) & Seshadri Mahalingam (Trifacta): The aim was to provide "Hadoop as a service" for developers, i.e. scalability and performance are currently not an explicit goal, but interface completeness and support for several Hadoop distributions. Therefore, they bundle all jars of a particular Hadoop distribution version into a single "bundle jar" and use Docker as a runtime environment. The next step was to split the different Hadoop services into separate containers and use Docker Networking to tie them together. In the future, they want to scale the Hadoop services and scale out on several Docker hosts The current state is available here.
  • Simplifying Hadoop with RecordService, a secure and unified data access path for compute frameworks by Alex Leblang (Cloudera): Overview of RecordService that provides a unified fine-grained security service for Hadoop. This video is a recording from an earlier Strata conference. 
  • Reactive Streams: Linking reactive applications to Spark Streaming by Luc Bourlier (Lightbend): Live demo of a random number generator with configurable numbers per seconds generated running on several Raspberry Pis and calculating some statistics using Spark Streaming. He showed impressive graphs concerning handling back pressure, if too many numbers are generated per second - Spark 1.5 acts more intelligently than Spark 1.4 here. The key facts also appear in these slides and in this video from another conference.
  • Applications of natural language understanding: Tools and technologies by Alyona Medelyan (Entopix): Natural language understanding (NLU) - as opposed to natural language processing (NLP) - tries to analyze the context e.g. to provide results in discovering of relations within text or for sentiment analysis. There are a lot of open source libraries for machine learning, deep learning and NLU to choose from - priority should be to decide on the algorithm rather than on a particular library, but domain knowledge is more important than the algorithm. Commercial libraries and APIs often do not provide added value, but they offer support and possibly a better solution for a specific business area like healthcare.
  • Insight at the speed of thought: Visualizing and exploring data at scale by Nicholas Turner (Incited): They started a project to get rid of static reports that were generated over night covering only 3% of the data. They wanted immediate availability, 100% data coverage and self-service (no static pre-generated reports). They now have a Cloudera Hadoop cluster using Flume for data ingestion, streaming the current day into Avro files and running batch jobs over night that move the data from the Avro files into Parquet files, that are partitioned by day. For data wrangling and preparation they use Trifacta including anomaly detection, cleansing and joining of data. Visualization and analytics are provided by an in house developed search tool and Zoomdata based on Impala queries. They use data science for fraud detection (using a Solr index updated in real time) leading to machine learning to improve fraud detection and to validate the decisions concerning frauds. The case study can be found here.

2012-05-30

NoSQL Matters 2012

A new conference about NoSQL took place in Cologne this year in May - it's called NoSQL Matters, consisted of three parallel tracks and lasted for two days. I attended the following talks.
On Tuesday:
  • Scalable NoSQL - Past, Present and Future by Doug Judd: This keynote started with showing the changes in hardware, software development, IT applications, research and corporate leaderhip over the last decades comparing the pre-internet era with the current internet era. Moving on to NoSQL databases three categories for the different database products were introduced: auto-sharding, distributed hashing and data consistency. The talk ended with an outlook on future and present evolutions focussing on disk drive technology, networking and application trends.
  • NoSQL - A Technology for Real Time Enterprise Applications? by Dirk Bartels: It's been about big data processing in general - not going into depth so much, but showing lots of market analysts' comparisons and mentioning basic theories like the CAP theorem. Also, some differences in requirements concerning NoSQL databases for old-style enterpríses and web enterprises were shown. He finished with application examples from their customers.
  • Designing for Concurrency with Riak by Mathias Meyer: This talk held by the author of the Riak Handbook dealt with data consistency and concurrent writes. Having started with per-document changelogs and vector clocks, some more Riak-specific features like secondary indexes were introduced - besides some distributed data structures in general like g-counters.
  • Hypertable - The Storage Infrastructure behind one of the World's Largest Email Services by Doug Judd: Hypertable was built following Google's BigTable architecture and focuses on horizontal scalability. It uses sparse tables and column families.
  • From Tables to Graph. Recommendation Systems, a Graph Database Use Case Analysis by Pere Urbón-Bayes: This talk gave a short introduction into recommendation systems (detecting similarities, measuring the "distance" between items based on their properties etc.), what's the relation to graph processing and why graph databases may help. The following graph databases, graph processing frameworks and APIs were shortly introduced: Neo4j, OrientDB, Apache Giraph, Signal/Collect, Blueprints API.
  • Welcome to Redis 2.6 by Salvatore Sanfilippo: The main author of Redis gave an overview of the new features introduced in release 2.6: scripting (uses Lua, scripts are atomic and are run within the server), more bit operations, millisecond key expiration, increments by floating-point numbers, serialization of values (dump and restore), AppendOnlyFile improvements, improvements for small sets, hashes etc.
On Wednesday:
  • NoSQL Adoption - What's the Next Step? by Luca Garulli: It's been an entertaining keynote about database history (starting with stone tablets and papyrus ...) and the three rules of NoSQL - one is: "If you only have a hammer, everything looks like a nail." ;-). He also mentioned a criteria catalogue for choosing the right database product and showed some future developments and risks for NoSQL databases.
  • NoSQL - Not Only a Fairy Tale by Timo Derstappen and Sebastian Cohnen: This talk showed the evolution of an ad server's persistence layer (from Amazon S3 to CouchDB back to Amazon S3 with Redis as caching layer) and the lessons learned. In their scenario CouchDB didn't scale as needed due to replication and compaction overhead - CouchDB's strengths like multi-master replication, MVCC and append-only storage weren't needed. Redis' performance was impressive to them.
  • The No-Marketing Bullshit Introduction to Couchbase Server 2.0 by Jan Lehnardt: This talk was about Couchbase, which offers auto-sharding by introducing so-called vBuckets (which reminds me of consistent hashing), automatic failover, a Memcached-compatible API and SDKs for lots of programming languages. Release 2.0 also offers incremental MapReduce, replication across datacenters etc. He also presented a live demo.
  • Apache Cassandra: Real-World Scalability, Today by Jonathan Ellis: This talk started with an overview of Cassandra's high availability features (no single point of failure, multi-master and multi-datacenter awareness etc.). Then he introduced more details about partitioning and replication based on consistent hashing (taking just the primary key into account) and performance features like the log-structured storage engine, row-level isolation and builtin compression. After that he presented lots of examples of Cassandra use cases.
  • NoNoSQL@Google by Olaf Bachmann: This talk introduced, how Google's Ads Traffic Quality Team makes heavy use of big data analysis. Basically, data is stored as Protocol Buffers. There is heavy use of MapReduce, but it's hard to write, maintain and debug. You can countervail this by using Sawzall, but that makes MapReduce inflexible. So, the next try was Dremel, but that has limitations concerning intermediate and output data size. After that SqlMR (SQL on top of MapReduce) was given a try, but it lacks of interactivity and it's hard to debug. So, at Google SQL is still the data analysis language of choice, although there are different dialects of it.
  • Theoretical Aspects of Distributed Systems, Playfully Illustrated by Pavlo Baron: This entertaining talk showed several problems and possible solutions in distributed systems with audience interaction: time synchronization, vector clocks, re-hashig, consistent hashing, gossip architecture, hinted handoff, quorum, master election, failure detection, partition tolerance etc.
The baristas have a break ...