Dataframe-style coding with analytical SQL databases
Fig. 5: If there are insufficient computer resources, the notebook kernel stops (Jupyter Notebook example). © H. Erb
Fig. 5: If there are insufficient computer resources, the notebook kernel stops (Jupyter Notebook example). © H. Erb
As shown in the previous notebook examples, when working with a notebook client, data analysts can execute their own SQL queries directly against a powerful database - just like with any other SQL editor - and thus analyze or modify very large data sets. For interactive data analysis with Python, Pandas was already mentioned in the previous section. Pandas is one of the most widely used open source libraries for Python, which has become the de facto standard for data analysts and in the field of data science due to its intuitive data structure and extensive APIs. Pandas uses in-memory processing and is ideal for small to medium-sized data sets (low gigabyte range). However, Pandas ' ability to process large amounts of data is limited due to out-of-memory errors. Fig. 5 shows such a situation, which arose during a simulation with a Jupyter notebook (Python kernel).
The motivation for using high-performance server backends has not fundamentally changed: the processing times for data transformations, aggregations, customer segmentation, etc. take too long or are not possible at all due to the high data volume. While an analytical algeria telegram screening database is usually available for processing SQL-based jobs, an alternative must be found for Python/Pandas that offers, for example, a comparable user experience for working with data frames. There are a number of alternatives to Pandas, one of which is Apache Spark , another could be the data warehouse database in use, provided it has a corresponding client API for Python.
In the field of file-based data lakes, Apache Spark was and still is a compute engine for distributed computing that is part of so-called "big data platforms". Typical Spark use cases include the preparation or analysis of large amounts of data, in which the dataset to be processed is distributed across all associated computer nodes of a Spark Compute Cluster and processed in parallel. PySpark establishes the connection to Python as an interface to Apache Spark [10] . PySpark is well suited to processing large amounts of data, but learning the new PySpark syntax and refactoring the code from Pandas to PySpark can be a bit tedious. But a solution is in sight: With Spark version 3.2 and higher, the Pandas API is available for Spark, enabling distributed processing with Spark while using the familiar Pandas syntax [11] .