AWS Machine Learning Foundations Part One

Posted by Jason Feng on May 23, 2020

This is part one of the notes of the course AWS Machine Learning Foundations on Udacity.

Simple, readable and concise code

1. Use meaningful names

  • Be descriptive and imply type - E.g. for booleans, you can prefix with is_ or has_ to make it clear it is a condition. You can also use part of speech to imply types, like verbs for functions and nouns for variables.
  • Be consistent but clearly differentiate - E.g. age_list and age is easier to differentiate than ages and age.
  • Avoid abbreviations and especially single letters - (Exception: counters and common math variables) Choosing when these exceptions can be made can be determined based on the audience for your code. If you work with other data scientists, certain variables may be common knowledge. While if you work with full stack engineers, it might be necessary to provide more descriptive names in these cases as well.
  • Long names != descriptive names - You should be descriptive, but only with relevant information. E.g. good functions names describe what they do well without including details about implementation or highly specific uses.

2.Use whitespace properly

  • Organize your code with consistent indentation - the standard is to use 4 spaces for each indent. You can make this a default in your text editor.
  • Separate sections with blank lines to keep your code well organized and readable.
  • Try to limit your lines to around 79 characters, which is the guideline given in the PEP 8 style guide. In many good text editors, there is a setting to display a subtle line that indicates where the 79 character limit is.

3. DRY (Don’t Repeat Yourself) Don’t repeat yourself! Modularization allows you to reuse parts of your code. Generalize and consolidate repeated code in functions or loops.

4. Abstract out logic to improve readability Abstracting out code into a function not only makes it less repetitive, but also improves readability with descriptive function names. Although your code can become more readable when you abstract out logic into functions, it is possible to over-engineer this and have way too many modules, so use your judgement.

5. Minimize the number of entities (functions, classes, modules, etc.) There are tradeoffs to having function calls instead of inline logic. If you have broken up your code into an unnecessary amount of functions and modules, you’ll have to jump around everywhere if you want to view the implementation details for something that may be too small to be worth it. Creating more modules doesn’t necessarily result in effective modularization.

6. Functions should do one thing Each function you write should be focused on doing one thing. If a function is doing multiple things, it becomes more difficult to generalize and reuse. Generally, if there’s an “and” in your function name, consider refactoring.

7. Arbitrary variable names can be more effective in certain functions Arbitrary variable names in general functions can actually make the code more readable.

8. Try to use fewer than three arguments per function Try to use no more than three arguments when possible. This is not a hard rule and there are times it is more appropriate to use many parameters. But in many cases, it’s more effective to use fewer arguments. Remember we are modularizing to simplify our code and make it more efficient to work with. If your function has a lot of parameters, you may want to rethink how you are splitting this up.

9. Documentation

  • Additional text or illustrated information that comes with or is embedded in the code of software.
  • Helpful for clarifying complex parts of code, making your code easier to navigate, and quickly conveying how and why different components of your program are used.
  • Several types of documentation can be added at different levels of your program:
    • In-line Comments - line level
    • Docstrings - module and function level
    • Project Documentation - project leve

Refactoring is to restructure code to improve internal structure without changing external functionality. The more you refactor, the easier it will be to structure and write good code the first time.

Testing And Data Science

  • Problems that could occur in data science aren’t always easily detectable; you might have values being encoded incorrectly, features being used inappropriately, unexpected data breaking assumptions
  • To catch these errors, you have to check for the quality and accuracy of your analysis in addition to the quality of your code. Proper testing is necessary to avoid unexpected surprises and have confidence in your results.
  • TEST DRIVEN DEVELOPMENT: a development process where you write tests for tasks before you even write the code to implement those tasks.
  • UNIT TEST: a type of test that covers a “unit” of code, usually a single function, independently from the rest of the program.

Unit Tests

We want to test our functions in a way that is repeatable and automated. Ideally, we’d run a test program that runs all our unit tests and cleanly lets us know which ones failed and which ones succeeded. Fortunately, there are great tools available in Python that we can use to create effective unit tests!

Unit Test Advantages and Disadvantages

The advantage of unit tests is that they are isolated from the rest of your program, and thus, no dependencies are involved. They don’t require access to databases, APIs, or other external sources of information. However, passing unit tests isn’t always enough to prove that our program is working successfully. To show that all the parts of our program work with each other properly, communicating and transferring data between them correctly, we use integration tests. In this lesson, we’ll focus on unit tests; however, when you start building larger programs, you will want to use integration tests as well.

You can read about integration testing and how integration tests relate to unit tests here. That article contains other very useful links as well.

Unit Testing Tools To install pytest, run pip install -U pytest in your terminal. You can see more information on getting started here.

  • Create a test file starting with test_
  • Define unit test functions that start with test_ inside the test file
  • Enter pytest into your terminal in the directory of your test file and it will detect these tests for you! test_ is the default - if you wish to change this, you can learn how to in this pytest configuration

In the test output, periods represent successful unit tests and F’s represent failed unit tests. Since all you see is what test functions failed, it’s wise to have only one assert statement per test. Otherwise, you wouldn’t know exactly how many tests failed, and which tests failed.

Your tests won’t be stopped by failed assert statements, but it will stop if you have syntax errors.

Test Driven Development and Data Science

writing tests before you write the code that’s being tested. Your test would fail at first, and you’ll know you’ve finished implementing a task when this test passes. Tests can check for all the different scenarios and edge cases you can think of, before even starting to write your function. This way, when you do start implementing your function, you can run this test to get immediate feedback on whether it works or not in all the ways you can think of, as you tweak your function. When refactoring or adding to your code, tests help you rest assured that the rest of your code didn’t break while you were making those changes. Tests also helps ensure that your function behavior is repeatable, regardless of external parameters, such as hardware and time. Test driven development for data science is relatively new and has a lot of experimentation and breakthroughs appearing, which you can learn more about in the resources below.

Code Review

Questions to Ask Yourself When Conducting a Code Review First, let’s look over some of the questions we may ask ourselves while reviewing code. These are simply from the concepts we’ve covered in these last two lessons!

Is the code clean and modular?

  • Can I understand the code easily?
  • Does it use meaningful names and whitespace?
  • Is there duplicated code?
  • Can you provide another layer of abstraction?
  • Is each function and module necessary?
  • Is each function or module too long?

Is the code efficient?

  • Are there loops or other steps we can vectorize?
  • Can we use better data structures to optimize any steps?
  • Can we shorten the number of calculations needed for any steps?
  • Can we use generators or multiprocessing to optimize any steps?

Is documentation effective?

  • Are in-line comments concise and meaningful?
  • Is there complex code that’s missing documentation?
  • Do function use effective docstrings?
  • Is the necessary project documentation provided?

Is the code well tested?

  • Does the code high test coverage?
  • Do tests check for interesting cases?
  • Are the tests readable?
  • Can the tests be made more efficient?

Is the logging effective?

  • Are log messages clear, concise, and professional?
  • Do they include all relevant and useful information?
  • Do they use the appropriate logging level?

Reference