Using analytics and machine learning (ML) to better understand your customers has become an everyday event in any data-driven enterprise. The good news is that organizations have large data volumes at their disposal to build and train ML models. The not so good news is that data science and engineering teams are faced with a series of blockers/challenges that prevent them from being as productive as possible. These challenges include complex data access approaches, the need to migrate and/or reformat data before analysis can begin, and the adoption of different operating models. But the challenges don’t stop there.
One of the biggest challenges is that data engineers, analytic users and data scientist come to their jobs with completely different points of view when it comes to data, which means they have different goals and use different tools (as shown in Figure 1). Data engineers are taking data and building pipelines to create the connection points for the other two personas. They prefer tools based on open-source technology to innovate faster while reducing lock-in to proprietary technology stacks. Analytic users live in a SQL-based world, so they prefer tools like Presto-SQL, and Apache Spark on Kubernetes for an agnostic platform that deploys any application or framework into any environment or infrastructure. Data scientists build data pipelines, but they approach it in two different ways: senior scientists prefer Jupyter notebooks, PyTorch, and Apache Spark. And citizen scientists prefer to use pre-integrated solution stacks.
Another big challenge is where the data resides. There are data centers, cloud (existing or future), and edge – all which have a set of infrastructure and services with specific access paths that can disrupt both productivity and established application and persona access patterns.
How do you begin to solve for these challenges? Without getting into specific implementations, let’s agree on a few key principles.
- A solution should provide a unified platform that increases productivity through a simple and secure data experience. This doesn’t mean that your data has to move to a single location. But it does mean that personas can utilize a self-service app store to download the libraries, pre-configured templates, or certified ISV solutions they want to use with single-click download and deployment.
- Automate everything end-to-end including provisioning of tools/libraries/frameworks so teams can get to work quickly.
- Simplify data access through a converged file and object system, known as a data fabric, that abstracts underlying infrastructure to reduce complexity. The data fabric should support files, objects, streams, and databases; ingest and transform the data into a single, persistent data store.
- Have an open-source foundation that allows data science teams to pick up and drop their work onto any infrastructure: on premises, cloud, or edge.
HPE Ezmeral delivers a secure, unified analytics platform that is optimized for on-premises, edge, and cloud deployments to deliver frictionless access to data. The integrated app store (Figure 2) enables one-click download and deployment of opinionated stacks and certified ISV solutions, or allows you to build or bring you own open-source tools/stacks, all supported by HPE 24×7.
HPE’s integrated data fabric enables direct access across hybrid/multicloud environments through both open-source and standard interfaces. Accessing data using the native S3 API, NFS, HDFS, POSIX, or CSI reduces the need to change existing access methods for applications or users. HPE’s data fabric abstracts the underlying infrastructure – this means you can access data on bare metal, cloud, on-premises, or edge to reduce complexity and create a bridge that allows traditional and modern applications and processes to securely access the same datasets on the same system.
The app store experience boosts the data science and engineering team’s productivity by deploying native best-of-breed open-source tools, libraries, and frameworks out-of-the box, such as Apache Spark 3.x Operator on Kubernetes, Delta Lake, Hive, or Thrift. If users are using older versions of Apache Spark, HPE Ezmeral can accommodate multiple versions running concurrently. If you prefer a different toolset, utilize the built-in app workbench to build or bring your own open-source stacks.
HPE Ezmeral Unified Analytics addresses the key pain points of data analytics. It delivers high performance, cost efficiency, and a secure and unified data experience to connect to data wherever it exists. The open-source foundation means you can move to a modern analytic platform without refactoring or moving data without accumulating any additional technical debt.
Read the new solution brief, Modernize Data Analytics, to learn more. Or visit HPE Ezmeral software.
Original post: https://www.cio.com/article/189472/how-to-simplify-your-approach-to-data-analytics.html