Databricks

Software : Cloud Computing : Data & Analytics

Website | Blog | Video

San Francisco, California, United States

VC-H

With origins in academia and the open source community, Databricks was founded in 2013 by the original creators of Apache Spark™, Delta Lake and MLflow. As the world’s first and only lakehouse platform in the cloud, Databricks combines the best of data warehouses and data lakes to offer an open and unified platform for data and AI.

Assembly Line

How Corning Built End-to-end ML on Databricks Lakehouse Platform

📅 Date:

✍️ Author: Denis Kamotsky

🔖 Topics: MLOps, Quality Assurance, Data Lakehouse

🏢 Organizations: Corning, Databricks, AWS


Specifically for quality inspection, we take high-resolution images to look for irregularities in the cells, which can be predictive of leaks and defective parts. The challenge, however, is the prevalence of false positives due to the debris in the manufacturing environment showing up in pictures.

To address this, we manually brush and blow the filters before imaging. We discovered that by notifying operators of which specific parts to clean, we could significantly reduce the total time required for the process, and machine learning came in handy. We used ML to predict whether a filter is clean or dirty based on low-resolution images taken while the operator is setting up the filter inside the imaging device. Based on the prediction, the operator would get the signal to clean the part or not, thus reducing false positives on the final high-res images, helping us move faster through the production process and providing high-quality filters.

Read more at Databricks Blog

Maersk embraces edge computing to revolutionize supply chain

📅 Date:

✍️ Author: Paula Rooney

🔖 Topics: IIoT, 5G

🏢 Organizations: Maersk, Microsoft, Databricks


Gavin Laybourne, global CIO of Maersk’s APM Terminals business, is embracing cutting-edge technologies to accelerate and fortify the global supply chain, working with technology giants to implement edge computing, private 5G networks, and thousands of IoT devices at its terminals to elevate the efficiency, quality, and visibility of the container ships Maersk uses to transport cargo across the oceans.

“Two to three years ago, we put everything on the cloud, but what we’re doing now is different,” Laybourne says. “The cloud, for me, is not the North Star. We must have the edge. We need real-time instruction sets for machines [container handling equipment at container terminals in ports] and then we’ll use cloud technologies where the data is not time-sensitive.”

Laybourne’s IT team is working with Microsoft to move cloud data to the edge, where containers are removed from ships by automated cranes and transferred to predefined locations in the port. To date, Laybourne and his team have migrated about 40% of APM Terminals’ cloud data to the edge, with a target to hit 80% by the end of 2023 at all operated terminals. Maersk has also been working with AI pioneer Databricks to develop algorithms to make its IoT devices and automated processes smarter. The company’s data scientists have built machine learning models in-house to improve safety and identify cargo. Data scientists will some day up the ante with advanced models to make all processes autonomous.

Read more at CIO

Solution Accelerator: Multi-factory Overall Equipment Effectiveness (OEE) and KPI Monitoring

📅 Date:

✍️ Authors: Jeffery Annor, Tarik Boukherissa, Bala Amavasai

🔖 Topics: Manufacturing Analytics

🏢 Organizations: Databricks


The Databricks Lakehouse provides an end-to-end data engineering, serving, ETL, and machine learning platform that enables organizations to accelerate their analytics workloads by automating the complexity of building and maintaining analytics pipelines through open architecture and formats. This facilitates the connection to high-velocity Industrial IoT data using standard protocols like MQTT, Kafka, Event Hubs, or Kinesis to external datasets, like ERP systems, allowing manufacturers to converge their IT/OT data infrastructure for advanced analytics.

Using a Delta Live Tables pipeline, we leverage the medallion architecture to ingest data from multiple sensors in a semi-structured format (JSON) into our bronze layer where data is replicated in its natural format. The silver layer transformations include parsing of key fields from sensor data that are needed to be extracted/structured for subsequent analysis, and the ingestion of preprocessed workforce data from ERP systems needed to complete the analysis. Finally, the gold layer aggregates sensor data using structured streaming stateful aggregations, calculates OT metrics e.g. OEE, TA (technical availability), and finally combines the aggregated metrics with workforce data based on shifts allowing for IT-OT convergence.

Read more at Databricks Blog

Part Level Demand Forecasting at Scale

📅 Date:

✍️ Authors: Max Kohler, Pawarit Laosunthara, Bryan Smith, Bala Amavasai

🔖 Topics: Demand Planning, Production Planning, Forecasting

🏢 Organizations: Databricks


The challenges of demand forecasting include ensuring the right granularity, timeliness, and fidelity of forecasts. Due to limitations in computing capability and the lack of know-how, forecasting is often performed at an aggregated level, reducing fidelity.

In this blog, we demonstrate how our Solution Accelerator for Part Level Demand Forecasting helps your organization to forecast at the part level, rather than at the aggregate level using the Databricks Lakehouse Platform. Part-level demand forecasting is especially important in discrete manufacturing where manufacturers are at the mercy of their supply chain. This is due to the fact that constituent parts of a discrete manufactured product (e.g. cars) are dependent on components provided by third-party original equipment manufacturers (OEMs). The goal is to map the forecasted demand values for each SKU to quantities of the raw materials (the input of the production line) that are needed to produce the associated finished product (the output of the production line).

Read more at Databricks Blog

Using MLflow to deploy Graph Neural Networks for Monitoring Supply Chain Risk

📅 Date:

🏢 Organizations: Databricks


We live in an ever interconnected world, and nowhere is this more evident than in modern supply chains. Due to the global macroeconomic environment and globalisation, modern supply chains have become intricately linked and weaved together. Companies worldwide rely on one another to keep their production lines flowing and to act ethically (e.g., complying with laws such as the Modern Slavery Act). From a modelling perspective, the procurement relationships between firms in this global network form an intricate, dynamic, and complex network spanning the globe.

Lastly, it was mentioned earlier that GNNs are a framework for defining deep learning algorithms over graph structured data. For this blog, we will utilise a specific architecture of GNNs called GraphSAGE. This algorithm does not require all nodes to be present during training, is able to generalise to new nodes efficiently, and can scale to billions of nodes. Earlier methods in the literature were transductive, meaning that the algorithms learned embeddings for nodes. This was useful for static graphs, but the algorithms had to be re-run after graph updates such as new nodes. Unlike those methods, GraphSAGE is an inductive framework which learns how to aggregate information from neighborhood nodes; i.e., it learns functions for generating embeddings, rather than learning embeddings directly. Therefore GraphSAGE ensures that we can seamlessly integrate new supply chain relationships retrieved from upstream processes without triggering costly retraining routines.

Read more at Ajmal Aziz on Medium

Optimizing Order Picking to Increase Omnichannel Profitability with Databricks

📅 Date:

✍️ Authors: Peyman Mohajerian, Bryan Smith

🔖 Topics: BOPIS, Operations Research

🏢 Organizations: Databricks


The core challenge most retailers are facing today is not how to deliver goods to customers in a timely manner, but how to do so while retaining profitability. It is estimated that margins are reduced 3 to 8 percentage-points on each order placed online for rapid fulfillment. The cost of sending a worker to store shelves to pick the items for each order is the primary culprit, and with the cost of labor only rising (and customers expressing little interest in paying a premium for what are increasingly seen as baseline services), retailers are feeling squeezed.

But by parallelizing the work, the days or even weeks often spent evaluating an approach can be reduced to hours or even minutes. The key is to identify discrete, independent units of work within the larger evaluation set and then to leverage technology to distribute these across a large, computational infrastructure. In the picking optimization explored above, each order represents such a unit of work as the sequencing of the items in one order has no impact on the sequencing of any others. At the extreme end of things, we might execute optimizations on all 3.3-millions simultaneously to perform our work incredibly quickly.

Read more at Databricks Blog

Virtualitics’ integration with Databricks sorts out what’s under the surface of your data lake

📅 Date:

🏢 Organizations: Virtualitics, Databricks


Databricks users can benefit from Virtualitics’ multi-user interface because it can enable hundreds more people across the business to get value from complex datasets, instead of a small team of expert data scientists. Analysts and citizen data scientists can do self-serve data exploration by querying large datasets with the ease of typing in question and AI-guided exploration instead of writing lines of code. Business decision makers get their hands on AI-generated insights that can help them take smart, predictive actions.

Read more at Virtualitics Blog

How to Build Scalable Data and AI Industrial IoT Solutions in Manufacturing

📅 Date:

✍️ Authors: Bala Amavasai, Vamsi Krishna Bhupasamudram, Ashwin Voorakkara

🏢 Organizations: Databricks, Tredence


Unlike traditional data architectures, which are IT-based, in manufacturing there is an intersection between hardware and software that requires an OT (operational technology) architecture. OT has to contend with processes and physical machinery. Each component and aspect of this architecture is designed to address a specific need or challenge, when dealing with industrial operations.

The Databricks Lakehouse Platform is ideally suited to manage large amounts of streaming data. Built on the foundation of Delta Lake, you can work with the large quantities of data streams delivered in small chunks from these multiple sensors and devices, providing ACID compliances and eliminating job failures compared to traditional warehouse architectures. The Lakehouse platform is designed to scale with large data volumes. Manufacturing produces multiple data types consisting of semi-structured (JSON, XML, MQTT, etc.) or unstructured (video, audio, PDF, etc.), which the platform pattern fully supports. By merging all these data types onto one platform, only one version of the truth exists, leading to more accurate outcomes.

Read more at Databricks Blog