Keeping data storage simple and secure for AI

Artificial intelligence (AI), which is one of the most transformative technologies in recent memory, is becoming more pervasive. One of AI’s key uses is to make sense of the data that is being gathered from all sources. “Data is like money,” says Fermin Serna, chief security officer at Databricks, who adds that companies are looking into how to extract business value from their data.

However, deriving value out of the data can be challenging. “[For instance, organisations might not know which platform] to trust [with their data], or if they should use a single platform or multiple platforms for different use cases,” he tells DigitalEdge. This is where Databricks, a data and AI company, comes in, as it brings together data warehouses and data lakes under a single lakehouse architecture. A data lakehouse is a system that has the ability to transform raw data into usable data, or data that makes sense.

Patrick Wendell, Databricks’ co-founder and vice president of engineering, highlights that the company is “particularly good” at incorporating their clients’ enterprise data. There are also currently very few companies that have both a data platform and a machine learning (ML) platform. “Most vendors either just have the parts that are good at managing data but they’re not really deep on machine learning, or they’re a great machine learning company but they don’t really have a good data platform. What we’ve found is that you really need both — that’s how you build ML applications, enrich those models with your data, and continuously retrain those models as your data gets updated,” he says.

Concerns around LLMs

There is heightened interest in large language models (LLMs) today, perhaps thanks to ChatGPT’s popularity. LLMs are a type of AI algorithm that is trained on a massive trove of data to produce human-like responses to natural language queries. As the basis of generative AI like ChatGPT, LLMs can generate and classify text, answer questions in a conversational manner, translate text from one language to another, and more.

Wendell notes that one of the biggest challenges in implementing and deploying LLMs is ensuring such models are safe. “This means things like saying inappropriate things to customers, and so forth. I think the minimum bar is [to ensure those LLMs aren’t] violating the reputation of a brand if they’re putting them directly in the face of their customers,” he says.

The next set of concerns is ensuring those LLMs are accurate and able to understand the company’s data. However, when asked about bias in LLMs, Wendell notes that it’s “early days” for how LLMs are being aligned. “What companies need to do is to take their business data and teach the models how to understand those data,” he explains, adding that this can be done in several ways, including taking a general model and tweaking it to suit the company’s needs or building a model entirely from scratch using a company’s proprietary data, which is its key asset.

“The challenge these companies have is, how can they take advantage of that data while protecting it? Because they don’t want to give that data away [since] it’s their secret sauce,” he muses.

Privacy matters

Since AI runs on data, its usage raises concerns around privacy and security. According to Serna, keeping things simple is the best way to address those concerns. “We want one single platform for data and AI use cases,” he says. “If you trust 10 different platforms, then your exposure to data breaches and the complexity to govern expands 10 times.”

In that sense, opting for a lakehouse — like the one Databricks has — makes more sense compared to platforms such as data warehouses. Unlike a data warehouse which requires a copy of its customers’ data, a lakehouse does not require that. “Whenever you give a copy of the data to someone else, you lose control. You need to trust their policies, cybersecurity efforts, etc.”

“The beauty of a data lakehouse, [or] the way that Databricks implements it, is that we use your data wherever [it resides in]. We don’t copy the data and move it to our cloud accounts. We process the data wherever it is… [which] makes the cybersecurity team’s lives super easy,” he adds.

Having all the data in a single place is a strategy Serna believes in. The chief security officer likens the security process to an onion, where there are multiple layers of security perimeters. “If you have one onion, you keep peeling and peeling, and at the core of the onion is the data. But if you have 10 onions, your attention will go to 10 different places and you may not do such a good job.”

Simplicity also seems to be a growing trend, according to Serna. “[Our Asia Pacific customers are not just looking for] security, but also simplicity. They don’t want to talk to 10 different people; they want to consolidate, consider costs as well as security and governance.”

Data privacy is a big focus at Databricks due to the nature of its business. “People use Databricks to manage the repository of all their data with Databricks, [including] very critical and sensitive information. We spend probably 40% of our resources on some flavours of data security, data protection, access, control, and so forth,” says Wendell.

As AI and ML progress, Wendell notes that many companies are worried that their data will be used by bigger organisations to train their models. “Enterprises need to be very careful about making sure the way they consume these models [will] not expose their own data to other potential long-term competitors for training,” he says.

To stay ahead of the latest tech trends, click here for DigitalEdge Section

Meanwhile, the most important thing is that LLMs have to provide flexibility in allowing the respective companies to train their models to take the form of the personality that represents their brands, he says.

One of Databricks’ clients is GetGo, a Singapore car-sharing service. According to GetGo’s chief technology officer Malik Badaruddin and head of data science and engineering, Maximilian Yap, the tools from Databricks are key to helping the company understand the data they have collected. The data can range from the booking transactions made at GetGo, to app usage, fuel levels and driver behaviours, just to name a few.

However, the pair note that before a company begins using AI, it must ensure that its data quality is of a certain standard, otherwise the results will not be as effective. “Machine learning is about statistical learning. You need enough data to build something. The question is how much is enough,” notes Yap.

Today, using AI to sift through its data has helped GetGo to save time on certain issues such as fuel theft. From a very manual solution of tracking the monthly transactions and cross-referencing it with the usage of the cars, AI has helped the company shorten the process by identifying suspicious users. As a result, many of such users have been weeded out from the system, says Malik.

Global Malaysian energy group National Petroleum (Petronas) is another organisation that has benefitted from Databricks’ solutions. At a breakout session at Databricks’ “Data and AI Summit 2023” in June, a Petronas spokesperson shared that the Fortune 500 company uses data to create new value and sustain business growth in a bid to ensure that it stays ahead and relevant as a disruptor.

On Aug 23, Petronas’ digital arm, Petronas Digital, signed a memorandum of understanding (MOU) with Databricks to drive data and AI initiatives. Under the MOU, both parties plan to build generative AI models that are specific to Petronas’ needs, deploy the company’s ESG analytics use cases, train talent and enable talent building, as well as provide Databricks’ cutting-edge technologies and latest data science tools through the Databricks’ University Alliance. Petronas will also help Databricks “monetise data assets and AI models to a large, open ecosystem of data consumers, all from a single platform”.

Keeping data storage simple and secure for AI

Related News

Highlights

Re test Testing QA Spotlight

Related News

What tech leaders hope to see in Singapore Budget 2024 (Part 1)

Google Cloud deepens its commitment to encourage generative AI development in Singapore

Microsoft, Amazon and Google are kingmakers for AI start-ups