2016-06-28

Baakenhafen, HafenCity

Am Petersenkai
Blick von der Baakenhafen Brücke
U-Bahnhof HafenCity Universität

ContainerDays

I skipped the workshop day and just joined the second day of ContainerDays 2016. My notes from the sessions I attended:
  • From Borg To Kubernetes: The History And Future Of Container Orchestration by Mandy Waite (Google): Mandy explained Google's demand for scheduling different workloads (monitoring, production tasks, batch runs) with resource isolation concerning CPU, memory and I/O. Many learnings and experience from internal container management systems has been baked into Kubernetes (k8s). Some core features of k8s were presented (e.g. deployments update a service, which is made of pods, that in turn group closely related containers) concluding with an outlook of some upcoming features (e.g. improved monitoring).
  • Shaping Applications for Docker, CoreOS, Kubernetes and Co by Thomas Fricke (Endocode AG) (here's another version of the slides from a different conference and a video of his talk at that conference): Key takeaway: When designing your application, separate "cattle" (i.e. systems you don't explicitly care about and that can be scaled massively and replaced easily) and "pets" (i.e. systems you care about), separate stateless layers (this is your "cattle") and persistence layers (these are your "pets"). Thomas also introduced the key differences between virtualization and containers and showed the existing ecosystem in the container world plus a case study with the lessons learned.
  • Linuxkernel features building my $CONTAINER by Erkan Yanar: Erkan showed some details of the base technologies for containers that are available for years now (e.g. chrootcapabilitiescgroups, namespaces).
  • The NoOps Movement by Marco Hutzsch (Otto): To me, what Marco named NoOps is actually DevOps done right or pushed further. It's about automation, infrastructure as code, full-stack responsibility incl. developers being on call, serverless architecture etc. He claimed to get rid of dedicated operations teams, but instead to build delivery teams with engineers being able to develop software and being interested in running what they've built. Reminds me of what we had at my former employer, and I can confirm that you can really gain speed with this approach.
  • Rancher Docker - From zero to hero by Michael Vogeler (Nexinto GmbH): Actually, I've recently played around a bit with Rancher in the AWS cloud. It offers nice features like resource management, cross-host networking, service discovery, monitoring and support for several scheduling engines like Kubernetes or Swarm. And it's set up quite easily - so, worth a second look ...
  • Plan B: Service to Service Authentication with Oauth by Henning Jacobs (Zalando): Each team has its own virtual data center (i.e. separate AWS cloud accounts) with its own domain. If one service wants to talk to another service, it has to contact the ELB bound to the domain of the other service. For the authorization on that communication they've chosen OAuth 2.0 with credentials stored on S3 (protected by IAM policies) to acquire auth tokens. The tokens are JSON Web Tokens which are self-contained for validation. The public keys are stored in a OAuth provider service, token revocations are handled by another service. All components are open source, docs are here.
  • A lightning talk by Florian Leibert (Mesosphere) providing an overview of the history and the architecture of Mesos and DC/OS including a live demo.
  • Efficient monitoring in modern environments by Tobias Schmidt (Soundcloud): Soundcloud created Prometheus. You should collect metrics on each and every layer of your systems (hosts, containers, applications) - detailed per component and not just aggregated. Aggregations are handy for alerting and dashboards. Four golden signals to monitor: Latency, traffic (e.g. requests per second), errors and saturation/utilization. Don't rely on medians, use percentiles instead (90%, 95%, 99%, 99.9% etc.). At Soundcloud every developer is on call. Use symptom based alerting instead of cause based, i.e. monitoring business processes instead of isolated technical components adds value - the four golden signals mentioned before can provide a good guidance. Only alert when human interaction is required - this reduces noise and prevents alert fatigue. Use alert grouping and service silencing e.g. for maintenance, avoid static thresholds in favor of relative thresholds e.g. based on historic data, use a ticketing system and treat warnings like new features. Provide concise playbooks/runbooks to enable fast problem mitigation. Practice outages! Recommended read: My Philosophy on Alerting.
  • A web shop in containers - Building the microservice platform for otto.de by Florian Sellmayr (ThoughtWorks) & Felix Bechstein (Otto): Florian and Felix presented an overview of the architecture behind otto.de using microservices running on Docker containers controlled by Marathon. With this architecture they gained freedom and responsibility for the developers, the ability to deploy new services without interaction with operations, increased standardization, better scalability and faster delivery. Problems they faced: Poor adoption of Mesos in various frameworks, service discovery, poor support for multi-tenancy and authentication, resource and container isolation. To mitigate some of these problems they came up with separated clusters and refactored them a lot over time (automation helped here a lot). Secrets are managed by Vault.

2016-06-27

Frei, und wieder ab an die Elbe!

Wandgemälde der FrauenFreiluftGalerie
  • Abendessen in der Kombüse.
  • Sundowner (allerdings ohne Sonne bzw. deren Untergang) in der Tower Bar.
Wohl Deutschlands nördlichster Weinberg

2016-06-04

Strata + Hadoop World 2016, London

Some notes about the workshops and talks I attended.
  • Hadoop application architectures: Fraud detection by Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Gwen Shapira (Confluent) & Ted Malaska (Cloudera): The use case for this tutorial was network anomaly detection based on NetFlow data. The system aimed for a real time dashboard, a near real time component updating profiles and a long-term storage layer enabling analytics. They went through the system architecture and discussed several technical options for each component. Based on their assessments concerning maturity, production readiness, community support, the number of committers and proven commercial success the final set of products was Kafka, Flume, Spark Streaming (the recently released Kafka Streams was also mentioned), HDFS, HBase (or Kudu), MapReduce, Impala and Spark ... at some point one has to cut the rope and make a decision, as there are new products popping up currently nearly every day. The demo application and the slides are publicly available.
  • An Introduction to time series with Team Apache by Patrick McFadin (DataStax): This tutorial mainly dealt with Cassandra and introduced Cassandra's features like consistent hashing, replication, multi-datacenter support and its write path (1. write to commit log and memtable; 2. write to SSTable on disk; 3. perform compaction). The sample application uses Kafka for collecting data, Spark for processing and Cassandra as storage entity by utilizing the Spark Cassandra ConnectorThese slides seem to be a slightly older version, but the main content matches.
  • Hadoop's storage gap: Resolving transactional access/analytic performance trade-offs with Apache Kudu (incubating) by Todd Lipcon (Cloudera): A wrap-up of the current status of Kudu. These slides seem to be an older version, but the benchmarks results shown there are still valid.
  • TensorFlow: Machine learning for everyone by Sherry Moore (Google): There was quite a hype around TensorFlow since its initial release a couple of months ago. This talk introduced the motivation and the basic concepts plus the ease of use including some small code examples.
  • Why is my Hadoop job slow? by Bikas Saha (Hortonworks Inc): That talk started with the current monitoring capabilities of Ambari plus Grafana integration for building custom metrics dashboards. A lot of Hadoop services provide audit logs. The YARN timeline server and Ambari Log Search (based on Solr) help on event correlation. He closed with tracing and analysis by showing Zeppelin for ad-hoc analysis.
  • Anomaly detection in telecom with Spark by Ted Dunning (MapR Technologies): This was about establishing a session between a caller and a telecom tower and how to deal with problems like signal interference and different data distribution patterns. The demo application is publicly available. These slides are from a colleague of Ted, but covering the same topics.
  • Beyond shuffling: Tips and tricks for scaling Spark jobs by Holden Karau (IBM): She covered several topics like reusing RDDs, avoiding groupByKey() and testing Spark applications. Two of her GitHub repos are of particular interest: One with code examples from a book of her, and one for enabling the implementation of tests.
  • Floating elephants: Developing data wrangling systems on Docker by Chad Metcalf (Docker) & Seshadri Mahalingam (Trifacta): The aim was to provide "Hadoop as a service" for developers, i.e. scalability and performance are currently not an explicit goal, but interface completeness and support for several Hadoop distributions. Therefore, they bundle all jars of a particular Hadoop distribution version into a single "bundle jar" and use Docker as a runtime environment. The next step was to split the different Hadoop services into separate containers and use Docker Networking to tie them together. In the future, they want to scale the Hadoop services and scale out on several Docker hosts The current state is available here.
  • Simplifying Hadoop with RecordService, a secure and unified data access path for compute frameworks by Alex Leblang (Cloudera): Overview of RecordService that provides a unified fine-grained security service for Hadoop. This video is a recording from an earlier Strata conference. 
  • Reactive Streams: Linking reactive applications to Spark Streaming by Luc Bourlier (Lightbend): Live demo of a random number generator with configurable numbers per seconds generated running on several Raspberry Pis and calculating some statistics using Spark Streaming. He showed impressive graphs concerning handling back pressure, if too many numbers are generated per second - Spark 1.5 acts more intelligently than Spark 1.4 here. The key facts also appear in these slides and in this video from another conference.
  • Applications of natural language understanding: Tools and technologies by Alyona Medelyan (Entopix): Natural language understanding (NLU) - as opposed to natural language processing (NLP) - tries to analyze the context e.g. to provide results in discovering of relations within text or for sentiment analysis. There are a lot of open source libraries for machine learning, deep learning and NLU to choose from - priority should be to decide on the algorithm rather than on a particular library, but domain knowledge is more important than the algorithm. Commercial libraries and APIs often do not provide added value, but they offer support and possibly a better solution for a specific business area like healthcare.
  • Insight at the speed of thought: Visualizing and exploring data at scale by Nicholas Turner (Incited): They started a project to get rid of static reports that were generated over night covering only 3% of the data. They wanted immediate availability, 100% data coverage and self-service (no static pre-generated reports). They now have a Cloudera Hadoop cluster using Flume for data ingestion, streaming the current day into Avro files and running batch jobs over night that move the data from the Avro files into Parquet files, that are partitioned by day. For data wrangling and preparation they use Trifacta including anomaly detection, cleansing and joining of data. Visualization and analytics are provided by an in house developed search tool and Zoomdata based on Impala queries. They use data science for fraud detection (using a Solr index updated in real time) leading to machine learning to improve fraud detection and to validate the decisions concerning frauds. The case study can be found here.

Rule, Britannia! Britannia rule the waves!

Beruflich auf einer Konferenz in London ... d.h. nur abends Zeit für Sightseeing, zudem weit außerhalb des Zentrums und dazu noch typisch britisches Wetter :-)
Vor dem Veranstaltungszentrum mit Blick auf die Docklands
Ampelwirrwarr vor dem Tower of London - im Hintergrund das Shangri-La Hotel in den tief hängenden Wolken
Auf der Tower Bridge
Piccadilly Circus
In der Emirates Air Line über der Themse
Blick über die Docklands zum City Airport