Python data science handbook jake vanderplas pdf download






















With this handbook, you? Python Data Science H and book : Tools and. For many researchers, Python is a first-class tool mainly because of its libraries for storing,. Several resources exist for individual pieces of this. I Python ,. Working scientists and data. With this. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science H and book do you get them all?

Introduction to NumPy. Data Manipulation with Pandas. Visualization with Matplotlib. Machine Learning. This is a book about doing data science with Python, which immediately begs the question: what is data science?

In my mind, these critiques miss something important. Data science, despite its hype- laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia. Figure P With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but as a new set of skills that you can apply within your current area of expertise.

Who Is This Book For? Why Python? Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets.

This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind.

If you are looking for a guide to the Python language itself, I would suggest the sister project to this book, A Whirlwind Tour of the Python Language. Python 2 Versus Python 3 This book uses the syntax of Python 3, which contains language enhancements that are not compatible with the 2.

Though Python 3. Since early , however, stable releases of the most important tools in the data science ecosystem have been fully compatible with both Python 2 and 3, and so this book will use the newer Python 3 syntax. However, the vast majority of code snippets in this book will also work without modification in Python 2: in cases where a Py2-incompatible syntax is used, I will make every effort to note it explicitly. Outline of This Book Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python data science story.

IPython and Jupyter Chapter 1 These packages provide the computational environment in which many Python- using data scientists work. NumPy Chapter 2 This library provides the ndarray object for efficient storage and manipulation of dense data arrays in Python.

Matplotlib Chapter 4 This library provides capabilities for a flexible range of data visualizations in Python. The PyData world is certainly much larger than these five packages, and is growing every day.

With this in mind, I make every attempt through these pages to provide references to other interesting efforts, projects, and packages that are pushing the boundaries of what can be done in Python.

Using Code Examples Supplemental material code examples, figures, etc. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. For example, writing a program that uses several chunks of code from this book does not require permission. Answering a question by citing this book and quoting example code does not require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. Copyright Jake VanderPlas, Installation Considerations Installing Python and the suite of libraries that enable scientific computing is straightforward. This section will outline some of the considerations to keep in mind when setting up your computer. Though there are various ways to install Python, the one I would suggest for use in data science is the Anaconda distribution, which works similarly whether you use Windows, Linux, or Mac OS X.

Because of the size of this bundle, expect the installation to consume several gigabytes of disk space. Any of the packages included with Anaconda can also be installed manually on top of Miniconda; for this reason I suggest starting with Miniconda.

Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width bold Shows commands or other text that should be typed literally by the user.

My answer sometimes surprises people: my preferred environment is IPython plus a text editor in my case, Emacs or Atom depending on my mood. The IPython notebook is actually a special case of the broader Jupyter notebook structure, which encompasses notebooks for Julia, R, and other programming languages. IPython is about using Python effectively for interactive scientific and data-intensive computing.

This chapter will start by stepping through some of the IPython features that are useful to the practice of data science, focusing especially on the syntax it offers beyond the standard features of Python.

Finally, we will touch on some of the features of the notebook that make it useful in understanding data and sharing results. The bulk of the material in this chapter is relevant to both, and the examples will switch between them depending on what is most convenient. In the few sections that are relevant to just one or the other, I will explicitly state that fact. Before we start, some words on how to launch the IPython shell and IPython notebook. Launching the IPython Shell This chapter, like most of this book, is not designed to be absorbed passively.

Once you do this, you should see a prompt like the following: IPython 4. Launching the Jupyter Notebook The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities. Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems. Upon issuing the command, your default browser should automatically open and navigate to the listed local URL; the exact address will depend on your system.

Help and Documentation in IPython If you read no other section in this chapter, read this one: I find the tools discussed here to be the most transformative contributions of IPython to my daily workflow. While web searches still play a role in answering complicated questions, an amazing amount of information can be found through IPython alone. What arguments and options does it have?

What attributes or methods does this object have? Accessing Documentation with? The Python language and its data science ecosystem are built with the user in mind, and one big part of that is access to documentation.

Python has a built-in help function that can access this information and print the results. Depending on your interpreter, this information may be displayed as inline text, or in some separate pop-up window.

Because finding help on an object is so common and useful, IPython introduces the? This quick access to documentation via docstrings is one reason you should get in the habit of always adding such inline documentation to the code you write! Accessing Source Code with?? If this is the case, the?? Tab completion of object contents Every Python object has various attributes and methods associated with it. Like with the help function discussed before, Python has a built-in dir function that returns a list of these, but the tab-completion interface is much easier to use in practice.

For example, the following will instantly be replaced with L. Similarly, suppose we are looking for a string method that contains the word find somewhere in its name.

Also, while some of these shortcuts do work in the browser-based notebook, this section is primarily about shortcuts in the IPython shell.

The most immediately useful of these are the commands to delete entire lines of text. Keystroke Action Backspace key Delete previous character in line Ctrl-d Delete next character in line Ctrl-k Cut text from cursor to end of line Ctrl-u Cut text from beginning fo line to cursor Ctrl-y Yank i. The most straightforward way to access these is with the up and down arrow keys to step through the history, but other options exist as well: Keystroke Action Ctrl-p or the up arrow key Access previous command in history Ctrl-n or the down arrow key Access next command in history Ctrl-r Reverse-search through command history The reverse-search can be particularly useful.

Recall that in the previous section we defined a function called square. At any point, you can add more characters to refine the search, or press Ctrl-r again to search further for another command that matches the query.

That is, if you type def and then press Ctrl-p, it would find the most recent command if any in your history that begins with the characters def. While some of the shortcuts discussed here may seem a bit tedious at first, they quickly become automatic with practice.

Once you develop that muscle memory, I suspect you will even find yourself wishing they were available in other contexts. These magic commands are designed to succinctly solve various common problems in standard data analysis.

A common case is that you find some example code on a website and want to paste it into your interpreter. Rather than running this code in a new window, it can be convenient to run it within your IPython session. Help on Magic Functions:? Documentation for other functions can be accessed similarly.

The latter may be surprising, but makes sense if you consider that print is a function that returns None; for brevity, any command that returns None is not added to Out. Where this can be useful is if you want to interact with past results. In this case, using these previous results probably is not necessary, but it can become very handy if you execute a very expensive computation and want to reuse the result!

The easiest way to suppress the output of a command is to add a semicolon to the end of the line: In [14]: math. For more information, I suggest exploring these using the? The magic happens with the exclamation point: anything appearing after!

The shell is a way to interact textually with your computer. Someone unfamiliar with the shell might ask why you would bother with this, when you can accomplish many results by simply clicking on icons and menus. A shell user might reply with another question: why hunt icons and click menus when you can accomplish things much more easily by typing? While it might sound like a typical tech preference impasse, when moving beyond basic tasks it quickly becomes clear that the shell offers much more control of advanced tasks, though admittedly the learning curve can intimidate the average computer user.

Here we're moving the file myproject. Note that with just a few commands pwd, ls, cd, mkdir, and cp you can do many of the most common file operations. For example, the ls, pwd, and echo commands can be run as follows: In [1]:! SList This looks and acts a lot like a Python list, but has additional functionality, such as the grep and fields methods and the s, n, and p properties that allow you to search, filter, and display the results in convenient ways.

In [13]:! The default is Context, and gives output like that just shown. So why not use the Verbose mode all the time? As code gets complicated, this kind of traceback can get extremely long. Depending on the context, sometimes the brevity of Default mode is easier to work with. This debugger lets the user step through the code line by line in order to see what might be causing a more difficult error. The IPython-enhanced version of this is ipdb, the IPython debugger.

Refer to the online documentation of these two utilities to learn more. The ipdb prompt lets you explore the current state of the stack, explore the available variables, and even run Python commands!

Early in developing your algorithm, it can be counterproductive to worry about such things. IPython provides access to a wide array of functionality for this kind of timing and profiling of code. It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result. For example, it prevents cleanup of unused Python objects known as garbage collection that might otherwise affect the timing.

From here, we could start thinking about what changes we might make to improve the performance in the algorithm. At this point, we may be able to use this information to modify aspects of the script and make it perform better for our desired use case. This is on top of the background memory usage from the Python interpreter itself. The front page features some example notebooks that you can browse to see what other folks are using IPython for!

It includes everything from short examples and tutorials to full-blown courses and books composed in the notebook format! Video tutorials Searching the Internet, you will find many video-recorded tutorials on IPython. As you go through the examples here and elsewhere, you can use it to familiarize yourself with all the tools that IPython has to offer.

For example, images—particularly digital images—can be thought of as simply two- dimensional arrays of numbers representing pixel brightness across the area. Sound clips can be thought of as one-dimensional arrays of intensity versus time.

No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers. For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science.

This chapter will cover NumPy in detail. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you. Once you do, you can import NumPy and double-check the version: In[1]: import numpy numpy. For example, to display all the contents of the numpy namespace, you can type this: In [3]: np.

Understanding Data Types in Python Effective data-driven science and computation requires understanding how data is stored and manipulated. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.

Users of Python are often drawn in by its ease of use, one piece of which is dynamic typing. Understanding how this works is an important piece of learning to analyze data efficiently and effectively with Python. But what this type flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value.

This means that every Python object is simply a cleverly disguised C structure, which contains not only its value, but other information as well. Looking through the Python 3. Figure Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.

The standard mutable multielement container in Python is the list. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. The difference between a dynamic-type list and a fixed-type NumPy-style array is illustrated in Figure The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier.

Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type.

The difference between C and Python lists Fixed-Type Arrays in Python Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module available since Python 3.

Much more useful, however, is the ndarray object of the NumPy package. If types do not match, NumPy will upcast if possible here, integers are upcast to floating point : In[9]: np.

Here are several examples: In[12]: Create a length integer array filled with zeros np. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages. The standard NumPy data types are listed in Table Note that when constructing an array, you can specify them using a string: np. While the types of operations shown here may seem a bit dry and pedantic, they comprise the building blocks of many other examples used throughout the book.

Get to know them well! This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. In this case, the defaults for start and stop are swapped. For example: In[24]: x2 Out[24]: array [[12, 5, 2, 4], [ 7, 6, 8, 8], [ 1, 6, 7, 7]] In[25]: x2[:2, :3] two rows, three columns Out[25]: array [[12, 5, 2], [ 7, 6, 8]] In[26]: x2[:3, ] all rows, every other column Out[26]: array [[12, 2], [ 7, 8], [ 1, 7]] Finally, subarray dimensions can even be reversed together: In[27]: x2[, ] Out[27]: array [[ 7, 7, 6, 1], [ 8, 8, 6, 7], [ 4, 2, 5, 12]] Accessing array rows and columns.

One commonly needed routine is accessing single rows or columns of an array. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.

Creating copies of arrays Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. The most flexible way of doing this is with the reshape method. Where possible, the reshape method will use a no-copy view of the initial array, but with noncontiguous memory buffers this is not always the case. Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.

Array Concatenation and Splitting All of the preceding routines worked on single arrays. Concatenation of arrays Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines np.

Splitting of arrays The opposite of concatenation is splitting, which is implemented by the functions np. The related functions np. Computation on NumPy Arrays: Universal Functions Up until now, we have been discussing some of the basic nuts and bolts of NumPy; in the next few sections, we will dive into the reasons that NumPy is so important in the Python data science world.

Computation on NumPy arrays can be very fast, or it can be very slow. It then introduces many of the most common and useful arithmetic ufuncs available in the NumPy package. This is in part due to the dynamic, interpreted nature of the language: the fact that types are flexible, so that sequences of operations cannot be compiled down to efficient machine code as in languages like C and Fortran. Each of these has its strengths and weaknesses, but it is safe to say that none of the three approaches has yet surpassed the reach and popularity of the standard CPython engine.

A straightforward approach might look like this: In[1]: import numpy as np np. But if we measure the execution time of this code for a large input, we see that this operation is very slow, perhaps surprisingly so! It turns out that the bottleneck here is not the operations themselves, but the type-checking and function dispatches that CPython must do at each cycle of the loop. Introducing UFuncs For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine.

This is known as a vectorized operation. You can accomplish this by simply performing an operation on the array, which will then be applied to each element.

This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution. Ufuncs are extremely flexible—before we saw an operation between a scalar and an array, but we can also operate between two arrays: In[5]: np.

Any time you see such a loop in a Python script, you should consider whether it can be replaced with a vectorized expression. Table The basic np. A look through the NumPy documentation reveals a lot of interesting functionality. Another excellent source for more specialized and obscure ufuncs is the submodule scipy.

If you want to compute some obscure mathematical function on your data, chances are it is implemented in scipy. Advanced Ufunc Features Many NumPy users make use of ufuncs without ever learning their full set of features. Specifying output For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored. Aggregates For binary ufuncs, there are some interesting aggregates that can be computed directly from the object.

A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.

Outer products Finally, any ufunc can compute the output of all pairs of two different inputs using the outer method. Another extremely useful feature of ufuncs is the ability to operate between arrays of different sizes and shapes, a set of operations known as broadcasting. Summing the Values in an Array As a quick example, consider computing the sum of all values in an array. Multidimensional aggregates One common type of aggregation operation is an aggregate along a row or column.

Similarly, we can find the maximum value within each row: In[12]: M. The axis keyword specifies the dimension of the array that will be collapsed, rather than the dimension that will be returned. Some of these NaN-safe functions were not added until NumPy 1. Table provides a list of useful aggregation functions available in NumPy. We may also wish to compute quantiles: In[16]: print "25th percentile: ", np. Broadcasting is simply a set of rules for applying binary ufuncs addition, subtraction, multiplication, etc.

We can similarly extend this to arrays of higher dimension. While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. The geometry of these examples is visualized in Figure Visualization of NumPy broadcasting The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.

Used with permission. The shapes of the arrays are: M. How does this affect the calculation? But this is not how the broadcasting rules work! That sort of flexibility might be useful in some cases, but it would lead to potential areas of ambiguity. Centering an array In the previous section, we saw that ufuncs allow a NumPy user to remove the need to explicitly write slow Python loops. Broadcasting extends this ability. Imagine you have an array of 10 observations, each of which consists of 3 values.

Plotting a two-dimensional function One place that broadcasting is very useful is in displaying images based on two- dimensional functions. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks. Example: Counting Rainy Days Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city. What is the average precipitation on those rainy days?

How many days were there with more than half an inch of rain? Digging into the data One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range.

The result of these comparison operators is always an array with a Boolean data type. Working with Boolean Arrays Given a Boolean array, there are a host of useful operations you can do. Another way to get at this information is to use np.

For example: In[22]: are all values in each row less than 8? These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays.

Be sure that you are using np. But what if we want to know about all days with rain less than four inches and greater than one inch? For example, we can address this sort of compound question as follows: In[23]: np. Here are some examples of results we can compute when combining masking with aggregations: In[25]: print "Number days without rain: ", np. A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.

We are then free to operate on these values as we wish. When would you use one versus the other? In Python, all nonzero integers will evaluate as True. For Boolean NumPy arrays, the latter is nearly always the desired operation.

Fancy Indexing In the previous sections, we saw how to access and modify portions of arrays using simple indices e. For example: In[8]: row[:, np. Modifying Values with Fancy Indexing Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. For example: 82 Chapter 2: Introduction to NumPy www.

The result, of course, is that x[0] contains the value 6. Why is this not the case? Here we make no such claim! This book contains proven steps and strategies on how to understand and completely learn the language Python. But like other tools and languages the key to learn it is not in just reading the book but to get hands on experience by practicing the code that is given in examples as well as practice on your own.

We have given many examples, feel free to change them as you like. The more you practice the more you will learn. Think of new ideas that can enhance your code and make it more useful. For example, if you have been given an example that will take two numbers and find their sum, try to write code for the same function of three numbers or more.

Another important thing that readers should know is that this book is aimed at beginners, but it accommodates intermediate and more advanced programmers. I assume three types of readers would be interested in this book. Those who have no knowledge about programming, and this is their first venture in the world of software development.

Secondly those who know programming and want to increase their software development skills by learning a new language, Python. And thirdly who understand Python and have worked with it and are interested in the book just to increase their knowledge of the language. Some of the topics that we are going to explore in this guidebook when it comes to advanced Python for Data Analysis will include: String Operations, Functions and Loops. What is Predictive Analystics with a lot of hands-on examples.

Here it is highly recommended to the readers that they search for other online resources to cla. This book covers the most popular Python 3 frameworks for both local and distributed in premise and cloud based processing.

Along the way, you will be introduced to many popular open-source frameworks, like, SciPy, scikitlearn, Numba, Apache Spark, etc.

The book is structured around examples, so you will grasp core concepts via case studies and Python 3 code. As data science projects gets continuously larger and more complex, software engineering knowledge and experience is crucial to produce evolvable solutions. You'll see how to create maintainable software for data science and how to document data engineering practices.

This book is a good starting point for people who want to gain practical skills to perform data science. All the code will be available in the form of IPython notebooks and Python 3 programs, which allow you to reproduce all analyses from the book and customize them for your own purpose.

Practical Data Science with Python will empower you analyze data, formulate proper questions, and produce actionable insights, three core stages in most data science endeavors. Feb 23, Aug 30, Remove basemap from requirements. Jun 1, View code. Launch executable versions of these notebooks using Google Colab : Launch a live notebook server with these notebooks using binder : Buy the printed book through O'Reilly Media About The book was written and tested with Python 3. Software The code in the book was tested with Python 3.

Code of conduct. Releases No releases published. Packages 0 No packages published. You signed in with another tab or window.



0コメント

  • 1000 / 1000