Python – semifluid.com

2019 VSS DNA

Steven A. Cholewiak — Sat, 27 Apr 2019 16:00:00 +0000

It’s been a while since I last posted one, but here’s a new Vision Sciences Society force-directed diagram of co-authorships (see past graphs here: 2014, 2015, & 2016). This year has 1293 abstracts for analysis. The graph was generated in Python using NetworkX, with authors and abstracts as nodes and edges corresponding to authorship. Individuals who are authors on more than one abstract will have edges connecting to those abstracts.

Orange dots are abstracts, light blue dots correspond to individuals who are first authors on an abstract, and dark blue dots correspond to the other author(s). You can view an interactive version here.

2015 VSS DNA

Steven A. Cholewiak — Mon, 23 Mar 2015 19:54:41 +0000

Another year, another Vision Sciences Society force-directed diagram of co-authorships (see last year’s 2014 VSS DNA). This year, we have 1419 abstracts being analyzed. The graph was generated in Python using NetworkX, with authors and abstracts as nodes and edges corresponding to authorship. Individuals who are authors on more than one abstract will have edges connecting to those abstracts.

Orange dots are abstracts, light blue dots correspond to individuals who are first authors, and dark blue dots correspond to the other author(s). This visualisation should not to be interpreted as sets of in-groups/out-groups. It ignores past/future VSS co-authorships, casual collaborations, professional collaborations outside of VSS, and likely has inaccuracies due to the way authors’ names are analysed (see after the break for more). I am intrigued by the “scholarly social network” and this visualization is just one piece of a very incomplete puzzle.

There are often inconsistencies in author names (e.g., “Steven Cholewiak” vs. “Steven A. Cholewiak” vs. “Stëvèn Chólëwìäk”), so I use the difflib SequenceMatcher to calculate ratios of the names’ similarities and names that are very similar (a ratio of 0.9 or higher) are assumed to be the same. That is admittedly a very naïve method of dealing with naming inconsistencies (e.g., is “John Smith” the same person as “John Q. Smith” or “John H. Smith”?) but I’d love to see a favourable alternative.

You can view an interactive force-directed d3.js version here. The code for the graph and force-directed diagram generation is available on GitHub here. The notebooks can also be viewed using nbviewer.ipython.org:

Calories as a function of alcohol in popular beers

Steven A. Cholewiak — Sun, 14 Dec 2014 15:23:36 +0000

In the USA, a standard drink is defined as including 0.6 fluid ounce (18 mL or 14 g) of ethanol (see Alcohol equivalence), meaning that a “standard” 12 oz beer has about 5% ABV. However, beers vary quite a bit in their alcohol content as well as their caloric content, so it seems reasonable to ask: If I have a beer with a given ABV, approximately how many calories does it have?

While browsing the web, I found a table listing the calories in a number of beers and thought it would be interesting to visualize using Python and plot.ly. It is a simple visualization, but one I find neat. Without further adieu:

Each blue point on the plot is a beer from the beer100.com domestic and international tables — feel free to explore the plot with your mouse. As you can see, unsurprisingly, as a beer’s alcohol content increases, so do the number of calories. Fitting a linear regression to the data, we see that a linear trend fits quite well: $latex f(x) = (28.2)*x + (8.25)$, where $latex x$ is the beer’s ABV (in percent). This means that if a beer has an alcoholic content of 5%, we can expect it to have approximately 150 calories (149.25 as predicted by the fit). However, there is quite a bit of variability between different beers of the same ABV. For example, Bud Ice Light and Kronenbourg Imported Dark Beer (whose label is a bit ambiguous, but I am assuming may be Kronenbourg 1664 Brune) are both 5% ABV, but have 115 and 163 calories per 12 oz, respectively.

In addition to the data points, I’ve also included a line illustrating the calories for pure ethanol as a function of ABV (assuming it is mixed with water to dilute it). This could be considered the “alcohol purity line” for empty calories (i.e., this would be the closest to a neutral spirit). If you compare light to non-light beers (done using a simple if “Light” is in name), you can see that the light beers are shifted closer to the pure ethanol line:

This simple string comparison misses a number of light beers (like Miller Genuine Draft 64 and Budweiser Select 55 which are also closest to the “alcohol purity line”), but captures the general trend. However, note that the more (in my humble opinion) flavorful and interesting beers lie above the original linear fit line.

Finally, I wanted to quickly compare the beer100.com data to brewer-supplied information. Unfortunately, most brewers avoid disclosing their nutritional facts; however, Anheuser-Busch and MillerCoors are relatively transparent, providing some facts about their beers and malt beverages. After normalizing the data to a 12oz serving size, we can see that, like the beer100.com data, there is quite a bit of variability.

2 Degrees of Academic Separation using Google Scholar v1

Steven A. Cholewiak — Thu, 19 Jun 2014 09:15:40 +0000

Another post, another neat force-directed graph. This one illustrates the interconnections between professors and students who have been co-authors on some of my papers and presentations, as scrapped from Google Scholar citations. It could be described as the first version of a rough illustration of my 2 degrees of separation in academia.

The dark orange circle in the center is myself, light blue circles are papers/presentations, light orange circles are co-authors, and dark-blue circles are co-authors of my co-authors (i.e., have not necessarily directly worked with me on a project).

Unfortunately, as of today, not all of my co-authors have Google Scholar pages, so there are a number of co-authors whose connections and branches are under-represented. In addition, Google Scholar does not necessarily accumulate all of a given author’s papers/presentations and often makes mistakes misattributing papers to profiles. So, the veracity of the information represented here should be taken with a grain of salt unless I find a better service for generating these networks.

For some more information on how this was created, click-through to the post.

As with the VSS DNA graph I made before the Visual Sciences Society Annual Meeting this past May, I used Python, NetworkX, and D3.js. In addition, I took advantage of another Python module, GoogleScholar, to screen-scrape information from the Google Scholar profiles.

Starting with my Google Scholar citation profile, I loop through the individual entries and extract the titles and co-authors of each entry. The names and titles are connected as nodes using NetworkX. I then had a list of co-authors:

Ari Weinstein
Benjamin Kunsberg
Bernard D Adelstein
Bina Pastakia
Chia-Chien Wu
Chris L Baker
David S Ebert
E Daniel Hirleman
Flip Phillips
Gaurav Kharkwal
Hong Z Tan
Jacob Feldman
Joshua B Tenenbaum
Julia E. Mazzarella
Kevin Sanik
Kristina Denisova
Kwangtaek Kim
Manish Singh
Matthew B Kocsis
Melissa M Kibbe
Paul Ringstad
Peter C Pantelis
Roger W. Cholewiak
Roland W Fleming
Ryan M Traylor
Steven W Zucker
Sung-Ho Kim
Tim Gerstner

To create the connections, I search for the co-authors names on Google Scholar (the profiles that were used are linked above) and do the same thing, extracting the titles and (co-?)co-authors names. This allowed me to produce a network diagram illustrating individuals who have been my co-authors, along with co-authors of those co-authors. Many of my co-authors did not have profiles when I generated this first version and there were a few with technical problems (e.g., one profile was populated with a large number of papers from another individual with the same name as my co-author, but a different person, and pruning these problematic entries would have been labor intensive). Still, it is a neat illustration worth sharing.

I am not currently including the code on this page because it is quite messy and “non-pythonic”, but I’m happy to share it if there is interest. In addition, since this image was produced with D3.js, there is an interactive version of the graph available. I chose not to include it because it can be quite computationally taxing with the large number of nodes and connections and therefore not the best for directly including on the blog.

UPDATE June 20, 2014: I removed the co-author labels from the lead image because I don’t want to give the false impression that specific co-authors are better connected than others. Since this visualization is dependent on a 3rd party scraping service, it is problematic to draw any conclusions about “connectedness” from this representation.

VSS 2014 “DNA” v1

Steven A. Cholewiak — Sat, 03 May 2014 18:20:29 +0000

Here’s an illustration I pulled together using Python, NetworkX, and D3.js to illustrate the interconnections between abstracts that will be presented at the Vision Sciences Society 2014 annual meeting in approximately 2 weeks. Orange dots represent abstracts, Light Blue dots represent authors with at least one first authorship, and Dark Blue dots represent other authors (second through last).

As you can see, there are large numbers of abstracts that have few shared authors. Those abstracts that share authors often join together to create “chains” of students, advisors, and colleagues.

This is a first version, hastily pulled together, so there are a few problems. The nodes are assigned to authors by name, which can be a problem for authors sharing the same name (which creates more connections than appropriate for a given node) or who have inconsistent reporting of their name (for example, omitting the middle initial or alternate spelling, which can create another erroneous node). I am thinking of addressing the duplicate node issue by using a string similarity metric (e.g., Levenshtein distance) to find strings that contain similar names to combine the connections, but this could be an issue if the names are truly different people. Alternatively, I could incorporate the authors’ affiliations, but this carries similar issues (e.g., I report my affiliation as “University of Giessen” while colleagues report it as “Justus-Liebig-Universität Gießen”).

Although there are lingering issues, it is still an interesting illustration of the connections between the different abstracts being presented at VSS 2014.

Here’s the code on GitHub: visvssrelationships

Batch Handbrake video file conversion with Python

Steven A. Cholewiak — Fri, 11 Apr 2014 07:01:15 +0000

I needed a quick little piece of code that would go recursively iterate through a folder and its subfolders and convert all of the video files to H.264, so I took advantage of the Handbrake command line interface (CLI) and Python 2.7.x to do the work for me. This code snippet is not long or elaborate, but does the job, so hopefully it will be helpful to others.

Note that the Handbrake CLI options are defined in runstr. As-is, the script will convert videos with AVI, DIVX, FLV, M4V, MKV, MOV, MPG, MPEG, and WMV extensions to H.264 MP4s with the following options:

“Normal” preset
Two-pass encoding
Turbo first pass, which “significantly boost[s] the speed of the first pass – with minimal effect on quality”