Sunday, 1 September 2019

Teaching an old Pythonista new Gopher tricks

I recently got a new job where I need to write a lot of Golang, so needed to learn it.
I figured that you don't really learn a language unless you try and write code that actually does something useful. However having been to a recent Golang meetup where someone had come to a similar conclusion, and had written a full emulator of the Gameboy in Go - I also figured I wanted to do something that was not quite so complex or low level ... ie hopefully, could be done in a week.

So I decided to take the plunge by creating an open source package that does the same job, as a Python one that I released many years ago called django-csvimport. A simple add-on for the Django ORM that caters for loading data to models from CSV files, with the option to generate the model code from scratch for a CSV file by checking the data fields and determining the data type for each column.

Also doing a task where I had solved the problems in another language would mean I could just focus on how Golang might approach the problem, not the problem itself. So this post is about the practical differences between writing a Python and Golang solution. As such it compares the languages as tools for a certain job, which I hope is complementary to the many posts that compare the languages themselves. Suffice is to say, they differ in many ways ... most significantly in static vs. dynamic typing ... whilst being most similar in regarding readable consistent simple syntax as paramount - where other languages have different priorities - hence for both auto-formatting code is good practise, with Go's builtin go format doing the job of Python's black or yapf.

So firstly Django is one of the leading full web frameworks for Python, so what is the equivalent for Go? Gorilla, Gin, Buffalo etc. there are plenty of frameworks but which is the leading one with an ORM? ... I tried out a couple but reading around it, it became apparent that if you choose to develop a web app in Go, then the majority of devs don't use a framework at all!, so already the differences in the languages was becoming apparent. Reasons? If you choose Go for creating a web app then performance may be a significant requirement, even micro frameworks can be slower than raw code. Go is a recent language and as such has lots of web related features built into the core already ...  templating, etc. and even imports are url based so a web framework in Go gives you less than it does in Python.
So instead I checked out Go ORMs and decided to write an extension package for Gorm as one of the leading Go ORMs.

So ditching the Web Framework / UI integration features of django-csvimport as an unnecessary extra, then the problem just consists of two parts, creating ORM model definitions that create relational database tables and parsing the CSV files to import the data to those tables.

From this high level spec. the core functional components that compose the tool that we want to rebuild in Golang are:


  1. CLI interface to take arguments specifying source files and actions to perform
  2. An ORM to manage vendor independent database schema creation and population
  3. Utility to inspect data sources and determine data types
  4. Template tool to create ORM models (metaprogramming)
  5. CSV parser to read in CSV files - ideally capable of handling various formats and poor or inconsistent formatting - ie real CSV files!
 For all of these we would hope for language level packages are available to do the major lifting. Then the package can just knit them together into a CSV to relational database import utility.

So stepping through these and rating Go vs Python...

CLI framework (draw)

As a minimum, our task requires a command line utility to point to the CSV data files to be imported.
Django comes with a CLI framework in the form of management commands. For our Go CSV import, gormcsv, we just have the ORM so we could roll our own CLI handling, but in this case that is probably not a great idea, since like Python, Go has a dominant CLI framework - Cobra equates to Python's Click. It uses the Viper config framework which is like Python's core configparser lib with extras. Within the gormcsv module I created these CLI command go files as a cmd package via Cobra's autogenerate feature and used them to wrap the importcsv.go and inspectcsv.go source files in the importcsv package that do the real work.

ORM (draw)

Any language's leading ORM's should cope with the database management and data population tasks and GORM is functionally similar in its capabilities to the Django ORM


Data source introspection tool  (Python win)

Messytables is a mature package designed for the task of scraping in data from various heterogenous third party sources - possibly of poor quality. As such it is one of the many utilities created around  Python's well established role in the data analytics realm. Go has no such tool. There is no third party package to cater for inspecting, type checking and cleaning up data sources :-(
So we have to make our own much simpler data inspector that will hopefully cope Ok with the most common data types if they are reasonably consistently formatted.

Templating tool for creating models (Go win)

For GORM and Django the ORM models are implemented directly as classes in the language rather than using an intermediate DSL or XML etc. So to create models based on introspecting source data metaprogramming must be used to generate code.
Templates are available in the core of Go. Also given it is statically typed and has no generics, then for some problems that generics would solve, the best alternative is to use metaprogramming. Hence templated generation of Go code is a normal Go pattern. So arguably this is better (core) supported in Go than Python. For Python code generation is rarely needed, and my original django-csvimport implementation just uses string construction and didn't even employ one of Python's many add on template packages, eg. Django or Jinja2 templates (hmm needs a rewrite!)
Note that both languages have fully functional reflection / introspection libraries in the core.

CSV Parser (Python win)

Most important to this application is the quality of the CSV parser. This is where Go is sadly completely let down. Its CSV parser is frankly inadequate and can only cope with CSV that is strictly formatted according to RFC 4180.

To quote from Python's csv parser library ...

CSV format was used for many years prior to attempts to describe the format in a standardized way in RFC 4180. The lack of a well-defined standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources.

TBH Python 3's CSV parser is itself significantly more strict about format than the old Python 2 one and so certain CSV files cannot be parsed that Python 2 happily dealt with - largely due to the switch to unicode resulting in more character encoding related critical fails.  However the Go parser is a whole other level of strict and realistically it can probably handle less than 10% of the real world CSV source files out there that you might want to scrape data from, into a database. Whilst Python 3's can probably cope with over 80%

I also investigated third party Go librarys that cater for parsing a more realistic range of CSV formatting, but found none that did so.

Conclusion

So in conclusion, Python may not be a Gopher Snake but for this task it does rather eat Go for breakfast. There is no ready made third party package to deal with ingesting unknown or badly formatted data like Python's aptly named messytables. Golang may sometimes be used for writing performant concurrent data processing in data science ... but it isn't used for the scraping and cleaning data sources part of the job! However this is a minor issue compared to the major blocker of not having an existing library that can import real world (ie sloppy format) CSV files.

So I have written my Go package for pushing CSV files to databases, gormcsv, and due to Go's great concurrency features it could certainly beat django-csvimport hands down in speed terms where big data quantities of CSV sources need ingesting. But I have yet to release it.  Because with such poor compatibility with real CSV files, there doesn't seem to be much point - however I will hopefully persist in finishing things off, probably as a less performant work around to pre-clean CSV files into strict RFC 4180 prior to parsing. Since implementing my own CSV parser from scratch for Go would likely break my original goal's of coming up with an open source project in the language that would take no longer than a week!

Oh and what do I think of Go? Well I like it, I most like the concept of classes just being data structs with bags of composed methods loosely coupled to them. I least like the error handling unseparated from normal code flow ... since it can lead to poor readability of code due to the excessive error boilerplate stuck within the program flow. It is my new favourite (statically typed) language ... but it hasn't replaced Python as my overall favourite.