For its third version, the entire Adaltas crew is gathering in Morzine for an entire week with 2 days devoted to know-how the fifteenth and the 16Th of september 2022.
The audio system select one of many 3 codecs accessible:
- Presentation: from 20 minutes to 1 hour
- Demonstration: from 45mn to 2h
- Coaching: from 1h to 2h
Program
As soon as an intervention has been carried out, its supported sources in addition to an article protecting the intervention will likely be revealed on the Adaltas web site. Right here is the calendar and the listing of subjects lined throughout this week.
Thursday, September fifteenth, 2022
- 9:30 Kubernetes Networking Lab
- 10:45 Working Kafka clusters on Kubernetes with Strimzi
- 12:00 Expose your containers and digital machines with a public IP
- 14:30 DuckDB introduction
- 15:30 Utilizing LXD with Terraform for Native Improvement Environments
- 16:30 Comparability of frameworks for knowledge high quality validation
- 17:15 A quick have a look at Apache Arrow
Friday, September sixteenth, 2022
- 9:30 Introduction to SingleStoreDB, the database for transactional and analytical workloads
- 10:45 Introduction to Apache Iceberg, the open desk format
- 12:00 Introduction to Apache Kyuubi
- 14:30 Vector Databases, Milvus Overview
- 15:30 tdp-server, relaxation service administration for tdp clusters
- 16:30 Ballista, a Rust based mostly distributed question engine
- 17:15 Information safety all over the world
Abstracts
Kubernetes Networking Lab
- Speaker: Paul-Adrien CORDONNIER
- Period: 1h15
- Format: discuss + demo
- Schedule: Thursday, September fifteenth, 2022 at 9:30
The objective of this lab is to offer to anybody an introduction to the world of Kubernetes community communications. We’ll attempt to cowl many of the ideas at a excessive stage and observe it in a sandbox surroundings.
On the finish of the session we must always all be capable to know what’s the aim of every aspect within the networking stack, how they’re used. The lab must also function a reminder when confusion will inevitably happen throughout your Kubernetes journey.
Listed below are the lined ideas:
- Low stage primary networking (CNI)
- Kubernetes networking API (Providers)
- DNS
- Expose Kubernetes utility outdoors (LoadBalancer, Ingress, Gateways)
- Service Mesh
Working Kafka clusters on Kubernetes with Strimzi
- Speaker: Leo SCHOUKROUN
- Period: 1h15
- Format: discuss + demo
- Schedule: Thursday, September fifteenth, 2022 at 10:45
Kubernetes shouldn’t be the primary platform that involves thoughts to run Apache Kafka clusters.
We’ll undergo the fundamentals of Strimzi, a Kafka operator for Kubernetes curated by Crimson Hat. A particular focus will likely be made on the storage downside which is commonly a ache level on naked steel Kubernetes clusters.
We may even evaluate Strimzi with different Kafka operators by offering their professionals and cons.
The presentation will finish with an indication presenting numerous use circumstances for Strimzi.
Expose your containers and digital machines with a public IP
- Speaker: David WORMS
- Period: 1h
- Format: dialogue + demo
- Schedule: Thursday, September fifteenth, 2022 at 12:00
Digital machines and containers are generally uncovered to the net with port forwarding. In such case, the general public IP is shared with the host machine. Whereas this work properly in lots of state of affairs, it’s typically essential to affiliate the visitor machine with its distinct public IP, for instance to host your individual e mail server, to achieve entry to an inner community, or to show Kubernetes companies.
The final concept is to route the visitors from a public IP or a CIDR subnet to a visitor machine operating inside a bunch machine. Stated in a different way, the connectivity exposes containers and digital machines with a static public tackle.
It really works seamlessly with any hypervisor together with VMware ESXi, Citrix Xen Server, OpenStack, and Proxmox, … The lined process is utilizing LXD in cluster mode.
DuckDB introduction
- Speaker: Stephan BAUM
- Period: 1h
- Format: presentation + demo
- Schedule: Thursday, September fifteenth, 2022 at 14:30
DuckDB is an embedded columnar-vectorized OLAP DBMS utilizing SQL queries.
We’ll current the structure and specificities of DuckDB DBMS, why it has been created, the way it achieves its efficiency by describing the ART indexing course of and we’ll clarify through which circumstances DuckDB ought to be used or not.
Lastly, a demo will illustrate the fundamental utilization of DuckDB in a Python pocket book and the way it pertains to Pandas and Apache Arrow.
Utilizing LXD with Terraform for Native Improvement Environments
- Speaker: Gauthier LEONARD
- Period: 1h
- Format: discuss + demo
- Schedule: Thursday, September fifteenth, 2021 at 15:30
LXD is a contemporary, safe and highly effective system container and digital machine supervisor. LXD presents vital benefits over different customary virtualization instruments (particularly Vagrant):
- Unified interface for managing containers, VMs, and networks
- Tremendous quick provisioning due to system containers
- Reside resizing of containers/VMs
- Working each domestically and on a number of hosts clusters (subsequently usable each for improvement and manufacturing)
But the LXD API, the LXC CLI, and cloud-init are fairly onerous to apprehend for brand spanking new customers and don’t permit straightforward versioning of surroundings configurations.
The LXD Terraform provider is a sublime answer to do infra-as-code on high of LXD. Within the demo, we’ll see the right way to migrate from Vagrant+VirtualBox to Terraform+LXD for native improvement environments.
Comparability of frameworks for knowledge high quality validation
Information high quality is a crucial problem that plenty of firms haven’t addressed but effectively.
Even when the exams which might be carried out, they’re executed manually on a subset of tables. These days, I used to be collaborating in organising an automatic pipeline. Based mostly on their necessities and the technical stack, I proposed a number of libraries that may very well be used for the aim and a PoC with the chosen one.
I wish to share the expertise on the topic, describe at present the preferred frameworks for knowledge validation and current their professionals and cons. Specifically, these frameworks are:
- Deequ
- Nice Expectations
- Delta Reside Tables (DLT)
- Soda
A quick have a look at Apache Arrow
- Speaker: Albert Konrad
- Period: 45min
- Format: discuss + demo
- Schedule: Thursday, September fifteenth, 2022 at 17:15
Is it a software program improvement platform? Is it an in-memory knowledge storage format? Or is it only a file format? No, it’s Apache Arrow.
We’ll take a really temporary have a look at what Apache Arrow is, what downside(s) it solves and focus on the way it enchantment to Information Engineers. In a fast demo we’ll additionally check if Apache Arrow delivers on its promise.
Introduction to SingleStoreDB, the database for transactional and analytical workloads
- Speaker: Sergei Kudinov
- Period: 1h15
- Format: presentation
- Schedule: Friday, September sixteenth, 2022 at 9:30
SingleStoreDB unifies transactions and analytics in a single engine to drive low-latency entry to giant datasets. With its patented Common Storage, SingleStore permits operational and analytical workloads to be processed utilizing a single desk sort. Constructed for builders and designers, SingleStoreDB relies on a distributed SQL structure, delivering 10-100 millisecond efficiency on advanced queries.
The presentation will cowl the structure and optimisation methods by which SingleStore positive aspects efficiency.
Introduction to Apache Iceberg, the open desk format
- Speaker: Yanis Bariteau
- Period: 1h15
- Format: presentation + demo
- Schedule: Friday, September sixteenth, 2022 at 10:45
Iceberg is presently employed by organizations together with Netflix, Apple, Adobe, LinkedIn, Expedia, Stripe, and others because the open customary for big analytic tables within the cloud.
It’s a desk format for analytical datasets that may interface with a variety of compute engines. It has a ton of capabilities that allow knowledge professionals to efficiently deal with giant knowledge, even as much as tens of petabytes in measurement, along with high-performance searches on knowledge at relaxation.
Introduction to Apache Kyuubi
- Speaker: Guillaume Holdorf
- Period: 45min
- Format: presentation
- Schedule: Friday, September sixteenth, 2022 at 12:OO
Apache Kyuubi democratizes the entry to your knowledge storage answer by permitting SQL requests from any ODBC/JDBC shopper. The Kyuubi servers help you serve a considerable amount of requests in a distributed method and guarantee HA, excessive performances, and safe entry to your knowledge.
On this presentation we’ll see the completely different characteristic of Apache Kyuubi and what they permit to do.
Vector Databases, Milvus Overview
- Speaker: Tobias Chavarria
- Period: 45min
- Format: presentation + demo
- Schedule: Friday, September sixteenth, 2022 at 14:3O
Milvus is an open supply vector database, constructed for scalable similarity search. It’s a part of the LF AI & Information Basis.
Milvus offers capabilities like CRUD operations, metadata filtering, and horizontal scaling and provides:
- Extremely Accessible
- Extremely Scalable
- Cloud-native
tdp-server, relaxation service administration for tdp clusters
- Speaker: Guillaume BOUTRY
- Period: 1h
- Format: discuss + demo
- Schedule: Friday, September sixteenth, 2021 at 15:15
tdp-server is the net service exposing REST Apis over tdp-lib core functionalities whereas offering a number of customers capabilities, safety and extra contextual info to deployments.
As a reminder, tdp-lib core functionalities are activity scheduling (by a DAG definition) and variable versioning (by git repositories).
With tdp-server, you’ll be capable to handle companies and elements as sources the place you need to use the completely different endpoints to switch the configuration (with GET
, PUT
(replaces), PATCH
(modifies present)). You can not add companies/elements utilizing POST or delete them utilizing DELETE. Figuring out which service/element is on the market is finished by tdp-lib utilizing its discovery functionalities.
Then, a very powerful characteristic is deploy
, with deploy, you’ll be capable to carry out actions on the cluster. It’s a easy endpoint, containing three parameters: targets
, sources
, and filter
.
Ballista, a Rust based mostly distributed question engine
- Speaker: Gonzalo Etse
- Period: 45min
- Format: presentation
- Schedule: Friday, September sixteenth, 2022 at 16:15
Ballista is a distributed compute engine constructed with Rust, and leveraging Apache Arrow, Arrow Flight and DataFusion. Its fashionable structure permits different programming languages, like Python, C++, and Java, to work with out the problems of serialization.
Apache Arrow permits for in-memory use, whereas flight will additional empower environment friendly knowledge switch between processes. Additional on, DataFusion alongside applied sciences similar to Google Protocol Buffers will allow quick and environment friendly use of reminiscence throughout purposes.
Ballista continues to be below work, and is being carried out on high of DataFusion. Whereas nonetheless on early levels, the structure offers glorious reminiscence effectivity and reminiscence utilization may be 5x – 10x decrease than Apache Spark in some circumstances, which implies that extra processing can match on a single node, decreasing the overhead of distributed compute.
Information safety all over the world
- Speaker: Paul Farault
- Period: 45min
- Format: discuss
- Schedule: Friday, September sixteenth, 2022 at 17:OO
Information safety is a basic topic for firms. Not just for private knowledge (of shoppers, customers or workers), but additionally for the information of the corporate itself.
Each of those are mentioned, ranging from the Alstom case – confronted with the FCPA and the DOJ in 2014 – to the elemental guidelines in regards to the safety of non-public knowledge imposed by the GDPR.
This presentation marks step one in a collection about knowledge safety. Future episodes will tackle technical responses to those issues.