============ Using Polars ============ When dealing with dataframes in Python the go to library has invariably been Pandas. It is pretty ubiquitous, but when dealing with larger datasets it starts to show its limitations. Firstly it is based on Numpy and for small to medium datasets it works well enough, but on larger datasets it starts to slow down and may not be able to load the data without breaking into chunks. This is where Polars outshines Pandas, it is based on a combination of running on Rust, using Arrow storage and ability to use a lazy method when required, rather than the eager method used by Pandas. Along with a reworking of the methods used, lazy working helps to speed up how quickly the result is processed. It does mean that the whole approach may need revising, although there exist aids to assist conversion from Pandas. One problem is that being fairly new some methods have been superceded so the internet sites date fairly quickly - obviously testing is vital. Using Polars for EDA ==================== Exploratory Data Analysis (EDA) are the initial steps used to query and understand a dataframe. To tie this to a real life case we shall use the Starbucks location dataframe, once the dataframe is mapped out we can use Altair to plot the locations worldwide, as a country, state and city. First make sure we can use Polars. Install Polars ============== We can install using pip or conda:: $ pip install polars or:: $ conda install polars Older Machines -------------- Older computers may have a runtime warning:: RuntimeWarning: Missing required CPU features. The following required CPU features were not detected: avx2, fma, bmi1, bmi2, lzcnt, movbe Continuing to use this version of Polars on this processor will likely result in a crash. Install `polars[rtcompat]` instead of `polars` to run Polars with better compatibility. Hint: If you are on an Apple ARM machine (e.g. M1) this is likely due to running Python under Rosetta. It is recommended to install a native version of Python that does not run under Rosetta x86-64 emulation. They require the following additional installation:: $ pip install polars[rtcompat] or:: $ conda install polars[rtcompat] Lazy versus Eager Loading ========================= Polars was recently delivered when most of the computers were multicore so it was designed to work in parallel rather than sequentially. With smaller databases it is in order to run eager loading, but for larger databases the lazy method waits to process aggregation and selection before loading the relevant parts of the database. This means that we have two approaches in polars, we can read a dataset if it is small or scan the dataset if large:: >> df = pl.read_csv('sales.csv') # Eager (immediate) >> df = pl.scan_csv('sales.csv') # Lazy (deferred) In contrast pandas is restricted to *read_csv*, the eager method. The expression syntax allows the query optimizer to rationalise the code and exclude unnecessary data. Where before we may have split the query into several units it is better to chain all the query together for polars. To see the result after using the expressions use *collect()* this materialises the LazyFrame into a DataFrame. .. hint:: LazyFrame versus DataFrame - A LazyFrame processes instructions (select, filter and/or aggregate), - A DataFrame shows or writes data. Working with Polars and Altair ------------------------------ When working in polars any python GUI can be used - the code is shown in the following as a block with two chevrons at the line start to show a python input. Just under the input the output is shown. .. sidebar:: The Normal Convention is Three Chevrons If three chevrons are used sphinx will load the python input and show the result, however the dataframe is too big and cannot be processed directly in sphinx. When visualising the data, using Altair, switch to Jupyter from the python GUI, the normal python GUIs cannot always show the interactive plots. The code will be shown in a block without any chevrons for the Python input. Directly following the code block sphinx will show the resulting Altair plot. The Starbucks dataframe is normally too large to work unaided in Altair, so before visualising Polars was used to restrict the size before plotting. Data Cleaning ============= When encountering a new dataframe, after the preliminary data has been gathered and assessed, properties of the dataframe are gathered that directly affect data cleansing. The importance of the various properties can only be found out when we know how the dataframe will be used. On larger dataframes visualisation may help to highlight problem areas that need fixing or methods to work around the problem. Load the Starbucks locations dataframe. There are several sources, we will use one of the Github sources. .. sidebar:: Loading from Github If you have a new source in github, to import first navigate to the folder which contains the data, then click on the *raw* data button and copy the URL address, which will show ``https:// raw.github~`` within the browser address bar. In this case :: https://raw.githubusercontent.com/mmcloughlin/starbucks/refs/heads/master/locations.json https://raw.githubusercontent.com/nsujay/Analysis-of-Starbucks-Worldwide-Stores/refs/heads/master/Starbucks_store_locations.csv https://raw.githubusercontent.com/chrismeller/StarbucksLocations/refs/heads/master/stores.csv https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv Use the last one listed which seems to be a clone from kaggles store locations. This can be loaded as an external file which is useful when working with jupyter, also it loads in sphinx:: >> import polars as pl >> q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv') ) >> df0 = q0.collect() >> df0 shape: (25_600, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 47370-257 ┆ Meritxell ┆ Licensed ┆ … ┆ 376818720 ┆ GMT+1:00 ┆ 1.53 ┆ 42.51 │ │ ┆ 954 ┆ , 96 ┆ ┆ ┆ ┆ Europe/An ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ dorra ┆ ┆ │ │ Starbucks ┆ 22331-212 ┆ Ajman ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 55.47 ┆ 25.42 │ │ ┆ 325 ┆ Drive ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ Thru ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 47089-256 ┆ Dana Mall ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 55.47 ┆ 25.39 │ │ ┆ 771 ┆ ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 22126-218 ┆ Twofour ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 54.38 ┆ 24.48 │ │ ┆ 024 ┆ 54 ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 17127-178 ┆ Al Ain ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 54.54 ┆ 24.51 │ │ ┆ 586 ┆ Tower ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 21401-212 ┆ Rex ┆ Licensed ┆ … ┆ 08 3824 ┆ GMT+00000 ┆ 106.7 ┆ 10.78 │ │ ┆ 072 ┆ ┆ ┆ ┆ 4668 ┆ 0 Asia/Sa ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ igon ┆ ┆ │ │ Starbucks ┆ 24010-226 ┆ Panorama ┆ Licensed ┆ … ┆ 08 5413 ┆ GMT+00000 ┆ 106.71 ┆ 10.72 │ │ ┆ 985 ┆ ┆ ┆ ┆ 8292 ┆ 0 Asia/Sa ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ igon ┆ ┆ │ │ Starbucks ┆ 47608-253 ┆ Rosebank ┆ Licensed ┆ … ┆ 278735001 ┆ GMT+00000 ┆ 28.04 ┆ -26.15 │ │ ┆ 804 ┆ Mall ┆ ┆ ┆ 59 ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ │ Starbucks ┆ 47640-253 ┆ Menlyn ┆ Licensed ┆ … ┆ null ┆ GMT+00000 ┆ 28.28 ┆ -25.79 │ │ ┆ 809 ┆ Maine ┆ ┆ ┆ ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ │ Starbucks ┆ 47609-253 ┆ Mall of ┆ Licensed ┆ … ┆ 278735002 ┆ GMT+00000 ┆ 28.11 ┆ -26.02 │ │ ┆ 286 ┆ Africa ┆ ┆ ┆ 15 ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ This shows we have a dataframe 25600 rows by 13 columns. Not all the columns were shown, within LazyFrame we cannot use one of the polars *attributes* to obtain a more complete picture, first run *q.collect()* then the relevant *attribute* since we are showing data. .. sidebar:: Lazy and Eager Output *q0* is the output of LazyFrame after it has loaded the csv file and *df0* is the output after this has been collected into a dataframe. If there was any selection, filtering or aggregation then the LazyFrame output would become *q* and the dataframe *df*. The scan was enclosed in brackets to allow further data questions:: >> df0.glimpse() Rows: 25600 Columns: 13 $ Brand 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks' $ Store Number '47370-257954', '22331-212325', '47089-256771', '22126-218024', '17127-178586', '17688-182164', '18182-182165', '23359-229184', '30781-99022', '20423-205465' $ Store Name 'Meritxell, 96', 'Ajman Drive Thru', 'Dana Mall', 'Twofour 54', 'Al Ain Tower', 'Dalma Mall, Ground Floor', 'Dalma Mall, Level 1', 'Debenhams Yas Mall', 'Khalidiya Street', 'Eastern Mangroves' $ Ownership Type 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed', 'Licensed' $ Street Address 'Av. Meritxell, 96', '1 Street 69, Al Jarf', 'Sheikh Khalifa Bin Zayed St.', 'Al Salam Street', 'Khaldiya Area, Abu Dhabi Island', 'Dalma Mall, Mussafah', 'Dalma Mall, Mussafah', 'Yas Island', 'Khalidiya St.', 'Al Salam Street, The Mangroves' $ City 'Andorra la Vella', 'Ajman', 'Ajman', 'Abu Dhabi', 'Abu Dhabi', 'Abu Dhabi', 'Abu Dhabi', 'Abu Dhabi', 'Abu Dhabi', 'Abu Dhabi' $ State/Province '7', 'AJ', 'AJ', 'AZ', 'AZ', 'AZ', 'AZ', 'AZ', 'AZ', 'AZ' $ Country 'AD', 'AE', 'AE', 'AE', 'AE', 'AE', 'AE', 'AE', 'AE', 'AE' $ Postcode 'AD500', null, null, null, null, null, null, null, null, null $ Phone Number '376818720', null, null, null, null, null, null, null, '26670052', null $ Timezone 'GMT+1:00 Europe/Andorra', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Dubai', 'GMT+04:00 Asia/Muscat', 'GMT+04:00 Asia/Dubai' $ Longitude 1.53, 55.47, 55.47, 54.38, 54.54, 54.49, 54.49, 54.61, 55.69, 54.38 $ Latitude 42.51, 25.42, 25.39, 24.48, 24.51, 24.4, 24.4, 24.46, 24.19, 24.48 *glimpse* had to be made using the dataframe *df0*, it will not work with the lazyframe *q0*. Let's find the number of empty data entries in the first column *Brand*:: >> q = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv').select(pl.col('Brand').is_null().sum()) ) >> df = q.collect() >> df shape: (1, 1) ┌───────┐ │ Brand │ │ --- │ │ u64 │ ╞═══════╡ │ 0 │ └───────┘ But wait we can do better - let's use the LazyFrame more sensibly:: >> q = q0.select(pl.col('Brand').is_null().sum()) >> df = q.collect() >> df shape: (1, 1) ┌───────┐ │ Brand │ │ --- │ │ u64 │ ╞═══════╡ │ 0 │ └───────┘ With such a wide dataframe it is better to use a method that shows the nulls over all the columns:: >> df0.null_count() shape: (1, 13) ┌───────┬────────┬────────────┬───────────┬───┬──────────────┬──────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store Name ┆ Ownership ┆ … ┆ Phone Number ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ --- ┆ Type ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ u64 ┆ --- ┆ u64 ┆ --- ┆ ┆ u64 ┆ u64 ┆ u64 ┆ u64 │ │ ┆ u64 ┆ ┆ u64 ┆ ┆ ┆ ┆ ┆ │ ╞═══════╪════════╪════════════╪═══════════╪═══╪══════════════╪══════════╪═══════════╪══════════╡ │ 0 ┆ 0 ┆ 0 ┆ 0 ┆ … ┆ 6861 ┆ 0 ┆ 1 ┆ 1 │ └───────┴────────┴────────────┴───────────┴───┴──────────────┴──────────┴───────────┴──────────┘ Once again the output was truncated and some of the columns are missing:: >> for col in q0.collect(): print(f'{col.name} - {col.is_null().sum()}') Brand - 0 Store Number - 0 Store Name - 0 Ownership Type - 0 Street Address - 2 City - 14 State/Province - 0 Country - 0 Postcode - 1521 Phone Number - 6861 Timezone - 0 Longitude - 1 Latitude - 1 It looks like almost a full house apart from *Postcode* and *Phone Nomber*. Since we are interested in the locations the two columns *Longitude* and *Latitude* need to be inspected - are we talking of a single row or are the missing values in two separate rows. Which column(s) hold unique values for every entry - the *Store Number* is a likely candidate, *Store Name* and *Street Address* less likely. .. sidebar:: Adding Python Code Beware of using python code such as this in normal queries, it will prevent polars optimising. Better if we had used the inbuilt polars methods:: >> q = ( df0.select(pl.all().is_null().sum()) .unpivot(value_name="missing") .filter(pl.col("missing") > 0) ) >> missing = q.collect() >> print("Missing columns:") >> print(missing) Missing columns: shape: (6, 2) ┌────────────────┬─────────┐ │ variable ┆ missing │ │ --- ┆ --- │ │ str ┆ u64 │ ╞════════════════╪═════════╡ │ Street Address ┆ 2 │ │ City ┆ 14 │ │ Postcode ┆ 1521 │ │ Phone Number ┆ 6861 │ │ Longitude ┆ 1 │ │ Latitude ┆ 1 │ └────────────────┴─────────┘ Check on *Store Name* to see how many unique values we have:: >> q = q0.select("Store Name").unique().count() >> unique_types = q.collect() >> print(unique_types) shape: (1, 1) ┌────────────┐ │ Store Name │ │ --- │ │ u64 │ ╞════════════╡ │ 25364 │ └────────────┘ There must be quite a few duplicates, 236. Also check on *Store Number* as opposed to *Store Name*:: >> q = q0.select("Store Number").unique().count() >> q.collect() shape: (1, 1) ┌──────────────┐ │ Store Number │ │ --- │ │ u64 │ ╞══════════════╡ │ 25599 │ └──────────────┘ That shows we have one duplicate as there were no null values. Are there any duplicated rows:: >> df = q0.collect() >> (df .filter(df.is_duplicated()) ) shape: (0, 13) ┌───────┬────────┬────────────┬───────────┬───┬──────────────┬──────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store Name ┆ Ownership ┆ … ┆ Phone Number ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ --- ┆ Type ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ str ┆ --- ┆ ┆ str ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ ┆ str ┆ ┆ ┆ ┆ ┆ │ ╞═══════╪════════╪════════════╪═══════════╪═══╪══════════════╪══════════╪═══════════╪══════════╡ └───────┴────────┴────────────┴───────────┴───┴──────────────┴──────────┴───────────┴──────────┘ No row duplicates, one less problem. .. sidebar:: Using df We had to use the Dataframe *df* as it is referred within the method, Let's see what is happening in the rows where *Longitude* and *Latitude* are null:: >> q = q0.filter(pl.col("Latitude").is_null()) >> q.collect() shape: (1, 13) ┌───────────┬────────────┬────────────┬────────────┬───┬────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store Name ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ --- ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪════════════╪════════════╪════════════╪═══╪════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 19773-1609 ┆ Yoido IFC ┆ Joint ┆ … ┆ null ┆ GMT+09:00 ┆ null ┆ null │ │ ┆ 73 ┆ Mall - 1F ┆ Venture ┆ ┆ ┆ Asia/Seou ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ │ └───────────┴────────────┴────────────┴────────────┴───┴────────┴───────────┴───────────┴──────────┘ Both *Longitude* and *Latitude* are null in this row. We may be lucky and one of the duplicated *Store Number* might just be this row- In this instance the square brackets are necessary:: >> (df0 ['Store Number'] .value_counts(sort=True) ) shape: (25_599, 2) ┌──────────────┬───────┐ │ Store Number ┆ count │ │ --- ┆ --- │ │ str ┆ u64 │ ╞══════════════╪═══════╡ │ 19773-160973 ┆ 2 │ │ 47370-257954 ┆ 1 │ │ 22331-212325 ┆ 1 │ │ 47089-256771 ┆ 1 │ │ 22126-218024 ┆ 1 │ │ … ┆ … │ │ 21401-212072 ┆ 1 │ │ 24010-226985 ┆ 1 │ │ 47608-253804 ┆ 1 │ │ 47640-253809 ┆ 1 │ │ 47609-253286 ┆ 1 │ └──────────────┴───────┘ .. sidebar:: Polars and Square Brackets Polars has been dropping the use of square brackets wherever possible - and using normal brackets instead. It appears that our Seoul store is the only one duplicated:: >> q = q0.filter((pl.col("Store Number") == "19773-160973")) >> df = q.collect() >> df shape: (2, 13) ┌───────────┬────────────┬────────────┬────────────┬───┬────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store Name ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ --- ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪════════════╪════════════╪════════════╪═══╪════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 19773-1609 ┆ Yoido IFC ┆ Joint ┆ … ┆ null ┆ GMT+09:00 ┆ null ┆ null │ │ ┆ 73 ┆ Mall - 1F ┆ Venture ┆ ┆ ┆ Asia/Seou ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ │ │ Starbucks ┆ 19773-1609 ┆ Yoido IFC ┆ Joint ┆ … ┆ null ┆ GMT+09:00 ┆ 126.92 ┆ 37.53 │ │ ┆ 73 ┆ Mall - 1F ┆ Venture ┆ ┆ ┆ Asia/Seou ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ │ └───────────┴────────────┴────────────┴────────────┴───┴────────┴───────────┴───────────┴──────────┘ Scroll to the end of the output, the second entry has *Longitude* and *Latitude* values which seem to be correct for the Yoido IFC Mall, so we can delete the row with both empty coordinates:: >> df2 = df0.filter(pl.col("Longitude").is_not_null()) >> print("DataFrame after dropping rows with nulls in 'Longitude':\n", df2) DataFrame after dropping rows with nulls in 'Longitude': shape: (25_599, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 47370-257 ┆ Meritxell ┆ Licensed ┆ … ┆ 376818720 ┆ GMT+1:00 ┆ 1.53 ┆ 42.51 │ │ ┆ 954 ┆ , 96 ┆ ┆ ┆ ┆ Europe/An ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ dorra ┆ ┆ │ │ Starbucks ┆ 22331-212 ┆ Ajman ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 55.47 ┆ 25.42 │ │ ┆ 325 ┆ Drive ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ Thru ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 47089-256 ┆ Dana Mall ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 55.47 ┆ 25.39 │ │ ┆ 771 ┆ ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 22126-218 ┆ Twofour ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 54.38 ┆ 24.48 │ │ ┆ 024 ┆ 54 ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ Starbucks ┆ 17127-178 ┆ Al Ain ┆ Licensed ┆ … ┆ null ┆ GMT+04:00 ┆ 54.54 ┆ 24.51 │ │ ┆ 586 ┆ Tower ┆ ┆ ┆ ┆ Asia/Duba ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ i ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 21401-212 ┆ Rex ┆ Licensed ┆ … ┆ 08 3824 ┆ GMT+00000 ┆ 106.7 ┆ 10.78 │ │ ┆ 072 ┆ ┆ ┆ ┆ 4668 ┆ 0 Asia/Sa ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ igon ┆ ┆ │ │ Starbucks ┆ 24010-226 ┆ Panorama ┆ Licensed ┆ … ┆ 08 5413 ┆ GMT+00000 ┆ 106.71 ┆ 10.72 │ │ ┆ 985 ┆ ┆ ┆ ┆ 8292 ┆ 0 Asia/Sa ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ igon ┆ ┆ │ │ Starbucks ┆ 47608-253 ┆ Rosebank ┆ Licensed ┆ … ┆ 278735001 ┆ GMT+00000 ┆ 28.04 ┆ -26.15 │ │ ┆ 804 ┆ Mall ┆ ┆ ┆ 59 ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ │ Starbucks ┆ 47640-253 ┆ Menlyn ┆ Licensed ┆ … ┆ null ┆ GMT+00000 ┆ 28.28 ┆ -25.79 │ │ ┆ 809 ┆ Maine ┆ ┆ ┆ ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ │ Starbucks ┆ 47609-253 ┆ Mall of ┆ Licensed ┆ … ┆ 278735002 ┆ GMT+00000 ┆ 28.11 ┆ -26.02 │ │ ┆ 286 ┆ Africa ┆ ┆ ┆ 15 ┆ 0 Africa/ ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ Johannesb ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ urg ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ We have one less row, recheck with the Store Number and df2 dataframe:: >> df2.filter( pl.col("Store Number") == "19773-160973") shape: (1, 13) ┌───────────┬────────────┬────────────┬────────────┬───┬────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store Name ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ --- ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ str ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪════════════╪════════════╪════════════╪═══╪════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 19773-1609 ┆ Yoido IFC ┆ Joint ┆ … ┆ null ┆ GMT+09:00 ┆ 126.92 ┆ 37.53 │ │ ┆ 73 ┆ Mall - 1F ┆ Venture ┆ ┆ ┆ Asia/Seou ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ │ └───────────┴────────────┴────────────┴────────────┴───┴────────┴───────────┴───────────┴──────────┘ The *Longitude* and *Latitude* show their coordinates. That's a lot of scripts for just one line of cleansing. With such a large dataframe it is unlikely to be able to automate without losing a lot of data. There are many combinations, each requiring a different solution. So if our Seoul data had been a single entry we would have loaded the coordinates - but there were two entries, so we had to find which one had coordinates. If the empty cells had been split then we would have to load additional information to be able to retain the rows. The majority of the dataframe is still unknown. This is where visualisation can help. *State/Province* and *Country* both seem complete, and we know that *Longitude* and *Latitude* are now complete. Visualisation ============= We can create a dataframe in Polars then use it in Altair. By adding a tooltip we can identify points on the chart - which in the following cases will be maps. It is important to reduce the number of points we are using, either by aggregation or selecting one country, state or city at a time. Let's first see how many different occurrences of the different columns there are:: >> (df2 ['State/Province'] .value_counts(sort=True) ) shape: (338, 2) ┌────────────────┬───────┐ │ State/Province ┆ count │ │ --- ┆ --- │ │ str ┆ u64 │ ╞════════════════╪═══════╡ │ CA ┆ 2821 │ │ TX ┆ 1042 │ │ ENG ┆ 787 │ │ WA ┆ 757 │ │ 11 ┆ 705 │ │ … ┆ … │ │ POS ┆ 1 │ │ SFO ┆ 1 │ │ CYQ ┆ 1 │ │ TTT ┆ 1 │ │ TXQ ┆ 1 │ └────────────────┴───────┘ There are 338 distinct values for *State/Province*, California (CA) seems to be the leader. Now let's see what *Country* shows:: >> (df2 ['Country'] .value_counts(sort=True) ) shape: (73, 2) ┌─────────┬───────┐ │ Country ┆ count │ │ --- ┆ --- │ │ str ┆ u64 │ ╞═════════╪═══════╡ │ US ┆ 13608 │ │ CN ┆ 2734 │ │ CA ┆ 1468 │ │ JP ┆ 1237 │ │ KR ┆ 992 │ │ … ┆ … │ │ TT ┆ 3 │ │ ZA ┆ 3 │ │ LU ┆ 2 │ │ MC ┆ 2 │ │ AD ┆ 1 │ └─────────┴───────┘ There are only 73 countries, the majority of outlets concentrated into the US - just over half the total. China (CN) is in aecond place followed by Canada(CA). If we were to use the country we would be stuck with a single blob over the USA - not very useful unless we want to concentrate on the countries outside the USA. Let's see what *City* shows:: >> (df2 ['City'] .value_counts(sort=True) ) shape: (5_471, 2) ┌──────────────┬───────┐ │ City ┆ count │ │ --- ┆ --- │ │ str ┆ u64 │ ╞══════════════╪═══════╡ │ 上海市 ┆ 542 │ │ Seoul ┆ 242 │ │ 北京市 ┆ 234 │ │ New York ┆ 232 │ │ London ┆ 216 │ │ … ┆ … │ │ Rock Springs ┆ 1 │ │ Sheridian ┆ 1 │ │ Johannesburg ┆ 1 │ │ Menlyn ┆ 1 │ │ Midrand ┆ 1 │ └──────────────┴───────┘ There are 5471 separate cities which would probably crowd out many locations, the countries were too few but states appeared to be the best solution we have - this will divide the US block into 50 states, which will be probably manageable. Aggregate on State/Province --------------------------- Using polars, we are going to aggregate State/Province and provide the mean Latitude and Longitude, after the aggregation it should be safe to use altair. Let's also show the number of stores per point, use *len()*, note where this is positioned within *agg()*:: >> q = (q0.group_by("State/Province") .agg( pl.len().alias("count"), pl.col("Longitude").mean(), pl.col("Latitude").mean(), ) .sort("count", descending=True)) >> df = q.collect() >> df shape: (338, 4) ┌────────────────┬───────┬─────────────┬───────────┐ │ State/Province ┆ count ┆ Longitude ┆ Latitude │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ u64 ┆ f64 ┆ f64 │ ╞════════════════╪═══════╪═════════════╪═══════════╡ │ CA ┆ 2821 ┆ -119.505207 ┆ 35.510578 │ │ TX ┆ 1042 ┆ -97.100269 ┆ 31.092726 │ │ ENG ┆ 787 ┆ -1.012211 ┆ 52.163215 │ │ WA ┆ 757 ┆ -121.84465 ┆ 47.416618 │ │ 11 ┆ 706 ┆ 121.629702 ┆ 37.964255 │ │ … ┆ … ┆ … ┆ … │ │ CAJ ┆ 1 ┆ -78.51 ┆ -7.15 │ │ YAR ┆ 1 ┆ 39.87 ┆ 57.63 │ │ TTT ┆ 1 ┆ 121.15 ┆ 22.75 │ │ IR ┆ 1 ┆ 35.87 ┆ 32.58 │ │ POS ┆ 1 ┆ -61.53 ┆ 10.66 │ └────────────────┴───────┴─────────────┴───────────┘ This produces 338 points, one for each State/Province, now first create a world map, available in altair, make this the background and superimpose the points made with the df dataframe. .. altair-plot:: import polars as pl q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv') ) q = (q0.group_by("State/Province") .agg( pl.len().alias("count"), pl.col("Longitude").mean(), pl.col("Latitude").mean(), ) .sort("count", descending=True)) df = q.collect() import altair as alt from altair.datasets import data source = alt.topo_feature(data.world_110m.url, 'countries') background = alt.Chart(source).mark_geoshape( fill='lightgray', stroke='white' ).properties( width=500, height=300 ).project('naturalEarth1') points = alt.Chart(df).mark_circle( color="tomato" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["State/Province", "count"] ) background + points This will show tomato coloured points over the world map. Move your mouse pointer over the points and the State/Province is shown with their number. Most of the points seem to be located in the correct place. Quite a few have strayed into the Atlantic, presumably pulled over by strays. This might be solved by adding *Country* to our group_by command:: df2.group_by(['State/Province','Country']).agg(pl.len()).sort(['len'],descending=True) shape: (545, 3) State/Province Country len str str u64 "CA" "US" 2821 "TX" "US" 1042 "ENG" "GB" 787 "WA" "US" 757 "FL" "US" 694 … … … "44" "TH" 1 "70" "TH" 1 "20" "TR" 1 "CHA" "TT" 1 "CYQ" "TW" 1 This will increase the point number, so increase the map size. .. altair-plot:: q = (q0.group_by(("State/Province", "Country")) .agg( pl.len().alias("count"), pl.col("Longitude").mean(), pl.col("Latitude").mean(), )) df = q.collect() # alt.data_transformers.enable("vegafusion") source = alt.topo_feature(data.world_110m.url, 'countries') background = alt.Chart(source).mark_geoshape( fill='lightgray', stroke='white' ).properties( width=600, height=360 ).project('naturalEarth1') points = alt.Chart(df).mark_circle( color="steelblue" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["State/Province", "Country", "count"] ) background + points We added *Country* to the tooltip and changed the point colour. Most of the obvious outlyers have disappeared. All the *State/Province* with numbers seem to be in east Asia. If we add a *len()* we can find the count of each point by adding *len* to the tooltip. Some parts of the map are a continuous colour, to see what the situation is we need to increase the map size country by country or reduce the number of points.- As we have seen *State/Province* requires a major overhaul if used by itself. London Starbucks ================ If we wish to choose a city with an Altair map then London might fit the bill, Altair has a map of greater London within its dataset collection showing the London boroughs:: q = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv' )) q = q0.filter( (pl.col('City') == 'London') & (pl.col('Country') == 'GB')) q.collect() shape: (195, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 12929-137 ┆ Clapham - ┆ Company ┆ … ┆ 207924467 ┆ GMT+0:00 ┆ -0.17 ┆ 51.46 │ │ ┆ 795 ┆ St. ┆ Owned ┆ ┆ 0 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ John's ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ Road ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 7013-1500 ┆ Brixton ┆ Company ┆ … ┆ 207274797 ┆ GMT+0:00 ┆ -0.11 ┆ 51.46 │ │ ┆ 64 ┆ Station ┆ Owned ┆ ┆ 9 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ Starbucks ┆ 14945-154 ┆ Stansted ┆ Franchise ┆ … ┆ 01279 ┆ GMT+0:00 ┆ 0.25 ┆ 51.88 │ │ ┆ 366 ┆ Airport ┆ ┆ ┆ 680361 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ DT ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ Starbucks ┆ 12949-140 ┆ Juxon ┆ Company ┆ … ┆ 207248035 ┆ GMT+0:00 ┆ -0.1 ┆ 51.51 │ │ ┆ 506 ┆ House - ┆ Owned ┆ ┆ 9 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ St. Pauls ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ Church… ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 12600-982 ┆ Pentonvil ┆ Company ┆ … ┆ 207812107 ┆ GMT+0:00 ┆ -0.12 ┆ 51.53 │ │ ┆ 09 ┆ le Road ┆ Owned ┆ ┆ 3 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 12364-774 ┆ Oxford ┆ Company ┆ … ┆ 207491946 ┆ GMT+0:00 ┆ -0.15 ┆ 51.51 │ │ ┆ 60 ┆ Street - ┆ Owned ┆ ┆ 3 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ Selfridge ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ s ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 1304-1438 ┆ Conduit ┆ Company ┆ … ┆ 207493975 ┆ GMT+0:00 ┆ -0.14 ┆ 51.51 │ │ ┆ 65 ┆ Street ┆ Owned ┆ ┆ 4 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ Starbucks ┆ 12397-856 ┆ Chiswick ┆ Company ┆ … ┆ 208994159 ┆ GMT+0:00 ┆ -0.26 ┆ 51.49 │ │ ┆ 93 ┆ - High ┆ Owned ┆ ┆ 8 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ Road/Fish ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ er's … ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 12101-119 ┆ Covent ┆ Company ┆ … ┆ 207836623 ┆ GMT+0:00 ┆ -0.12 ┆ 51.51 │ │ ┆ 03 ┆ Garden - ┆ Owned ┆ ┆ 1 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ Russell ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ St ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 47901-260 ┆ London ┆ Company ┆ … ┆ 207407736 ┆ GMT+0:00 ┆ -0.09 ┆ 51.51 │ │ ┆ 029 ┆ Bridge-Sh ┆ Owned ┆ ┆ 4 ┆ Europe/Lo ┆ ┆ │ │ ┆ ┆ ard ┆ ┆ ┆ ┆ ndon ┆ ┆ │ │ ┆ ┆ Arcade ┆ ┆ ┆ ┆ ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ Note that we added `('Country') == 'GB'` to ensure that no other London is included. ` We should be able to show 195 points on a London map. .. altair-plot:: import polars as pl import altair as alt from altair.datasets import data q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv' )) q = q0.filter( (pl.col('City') == 'London') & (pl.col('Country') == 'GB')) df = q.collect() boroughs = alt.topo_feature(data.london_boroughs.url, 'boroughs') boroughs = alt.Chart(boroughs, width=700, height=500).mark_geoshape( stroke='white', strokeWidth=2 ).encode( color=alt.value('#ccc'), ) boroughs Now add the London Starbucks. .. altair-plot:: London = alt.Chart(df).mark_circle( color="tomato" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["State/Province", "City", "Country", "Store Name"] ) boroughs + London Adding the London Starbucks locations to the London map we find that the majority of locations are at the bottom of the map, one location is at the top about a third of the way towards the centre - which shows as Victoria Station - one at Stanstead airport - well outside the borough map - one at Borehamwood - in the home counties, but maybe within the boroughs map. First Victoria Station needs new coordinates and Stanstead airport can be dropped. Victoria Station's existing coordinates put it well away from other coordinates, so update using a *when, then, otherwise* method. Victoria Station is at Longitude -0.1441 and Latitude 51.4952, a far cry from the original:: df = df.with_columns( pl.when(pl.col("Longitude") == -2.87).then(-0.1441).otherwise(pl.col("Longitude")).alias("Longitude") ) df = df.with_columns( pl.when(pl.col("Latitude") == 55.74).then(51.4952).otherwise(pl.col("Latitude")).alias("Latitude") ) Dropping Stansted Airport was more difficult because the *Store Name* showed up as ``Stansted Airport DT`` on the tooltip - trying to use the *Store Name* failed with "Stansted Airport DT" and "Stansted Airport DT " - going back to the coordinates using Longitude 0.25 showed that the *Store Name* was :: "Stansted Airport DT" df.filter(pl.col("Longitude") == 0.25) shape: (1, 13) Brand Store Number Store Name Ownership Type Street Address City State/Province Country Postcode Phone Number Timezone Longitude Latitude str str str str str str str str str str str f64 f64 "Starbucks" "14945-154366" "Stansted "Franchise" "A120 Thermal Lane, London Stan…"London" "ENG" "GB" "CM24 1PY" "01279 680361" "GMT+0:00 Europe/London" 0.25 51.88 Airport DT" We can drop Stansted using Longitude. .. altair-plot:: df = df.with_columns( pl.when(pl.col("Longitude") == -2.87).then(-0.1441).otherwise(pl.col("Longitude")).alias("Longitude") ) df = df.with_columns( pl.when(pl.col("Latitude") == 55.74).then(51.4952).otherwise(pl.col("Latitude")).alias("Latitude") ) df = df.filter(pl.col("Longitude") != 0.25) London = alt.Chart(df).mark_circle( color="tomato" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["State/Province", "City", "Country", "Store Name"] ) boroughs + London This now showed the locations in the general outline of the London Boroughs, if necessary we could add the borough names to the map. Noticeable is the fact that the coordinates have been massaged, there is no way that the stores appear in straight rows, certainly not in London. To add the borough names we can use the Altair dataset *london_centroids* which gives the name and the centroid coordinates for each London borough. *london_centroids* is a small database, it can be queried directly in altair:: data.london_centroids().info() RangeIndex: 33 entries, 0 to 32 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 33 non-null string 1 cx 33 non-null Float64 2 cy 33 non-null Float64 dtypes: Float64(2), string(1) memory usage: 990.0 bytes Note how the borough names are positioned. .. altair-plot:: centroids = data.london_centroids.url labels = alt.Chart(centroids).mark_text().encode( longitude='cx:Q', latitude='cy:Q', text='bLabel:N', size=alt.value(8), opacity=alt.value(0.8) ).transform_calculate( "bLabel", "indexof (datum.name,' ') > 0 ? substring(datum.name,0,indexof(datum.name, ' ')) : datum.name" ) boroughs + labels This looks well enough, now all together. .. altair-plot:: boroughs + labels + London We spotted Victoria Station required new coordinates, one in 195 rows, which over 25600 rows is probably amounting to about 130 poorly made coordinates, plus a column *State/Province* poorly attributed or not clearly controlled. Unfortunately we cannot use the *Store Number* to help, as seen on the London Starbucks, the numbers are probably allocated sequentially on a date basis, certainly not according to *Country*, *State/Province* or *City*. Empty City ========== When we inirially looked at the data within the column *City* there were 14 missing data entries in *City*. Maybe we will be lucky and be able to find the *City* from the empty *City* - anyway worth a look:: >> q = q0.filter(pl.col("City").is_null()) >> q.collect() shape: (14, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 31657-104 ┆ سان ┆ Licensed ┆ … ┆ 201208002 ┆ GMT+2:00 ┆ 29.96 ┆ 31.24 │ │ ┆ 436 ┆ ستيفانو ┆ ┆ ┆ 87 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 32152-109 ┆ النايل ┆ Licensed ┆ … ┆ 201208003 ┆ GMT+2:00 ┆ 31.23 ┆ 30.07 │ │ ┆ 504 ┆ سيتى ┆ ┆ ┆ 07 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 32314-115 ┆ أسكندرية ┆ Licensed ┆ … ┆ 201850222 ┆ GMT+2:00 ┆ 31.03 ┆ 30.06 │ │ ┆ 172 ┆ الصحراوى ┆ ┆ ┆ 14 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 31479-105 ┆ مكرم عبيد ┆ Licensed ┆ … ┆ 201208003 ┆ GMT+2:00 ┆ 31.34 ┆ 30.09 │ │ ┆ 246 ┆ ┆ ┆ ┆ 32 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 31756-107 ┆ سيتى ┆ Licensed ┆ … ┆ 201208003 ┆ GMT+2:00 ┆ 31.33 ┆ 30.06 │ │ ┆ 161 ┆ ستارز 1 ┆ ┆ ┆ 50 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 31646-106 ┆ مطار ┆ Licensed ┆ … ┆ 201208003 ┆ GMT+2:00 ┆ 31.41 ┆ 30.11 │ │ ┆ 547 ┆ القاهرة ┆ ┆ ┆ 35 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 31755-107 ┆ سنافير - ┆ Licensed ┆ … ┆ 201208003 ┆ GMT+2:00 ┆ 34.33 ┆ 27.91 │ │ ┆ 182 ┆ نعمه بيه ┆ ┆ ┆ 27 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 32389-107 ┆ المركاتو ┆ Licensed ┆ … ┆ 201850222 ┆ GMT+2:00 ┆ 34.33 ┆ 27.92 │ │ ┆ 342 ┆ مول2 ┆ ┆ ┆ 17 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 32490-111 ┆ خان لاجونا ┆ Licensed ┆ … ┆ 201898885 ┆ GMT+2:00 ┆ 34.43 ┆ 28.04 │ │ ┆ 349 ┆ ┆ ┆ ┆ 47 ┆ Africa/Ca ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ iro ┆ ┆ │ │ Starbucks ┆ 31429-102 ┆ ابراج ┆ Licensed ┆ … ┆ 966257190 ┆ GMT+03:00 ┆ 39.83 ┆ 21.42 │ │ ┆ 231 ┆ البيت 1 ┆ ┆ ┆ 12 ┆ Asia/Riya ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ dh ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ Interesting the *Store Name* all had a similar script and the *Timezone* was Cairo. Cairo is 31.23 longitude and 30.03 latitude - not so far from some of our empty *City*. First check what the *Country* is:: >> q = q0.filter(pl.col("City").is_null()).select(["Country"]) >> pl.Config.set_tbl_rows(20) >> q.collect() shape: (14, 1) ┌─────────┐ │ Country │ │ --- │ │ str │ ╞═════════╡ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ EG │ │ SA │ └─────────┘ One store was from Saudi Arabia (SA), the rest were Egypt (EG). Let's see how it plots. .. altair-plot:: import polars as pl import altair as alt q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv') ) q = q0.filter( ( (pl.col('Country') == 'EG'))) df = q.collect() url = "https://raw.githubusercontent.com/samateja/D3topoJson/refs/heads/master/egypt.json" source = alt.topo_feature(url, "egypt") eg = alt.Chart(source, width=700, height=500).mark_geoshape().encode() star = alt.Chart(df).mark_circle( color="darkorange" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["Store Number", "State/Province", "City", "Country", "Store Name", "Street Address"] ) eg + star Not too bad, there is one stray Cairo plot on the north coast next to the Alexandria entry. Two appear to be at Sharm El-Sheikh, the rest are concentrated at Cairo. The two at Sharm El-Sheikh both are null for the *City*. As my Arabic is non-existant we'll leave as is, maybe the one labelled *Cairo* on the Mediteranean is on the Cairo-Alexander Road about 28km from the Cairo centre. Rhode Island Starbucks ====================== Check what is potting with the US states - let's choose Rhode Island it's small by American standards:: >>import polars as pl >> q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv') ) >> q = q0.filter( (pl.col('State/Province') == 'RI') ) >> q.collect() shape: (29, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 15755-164 ┆ Megamall ┆ Licensed ┆ … ┆ null ┆ GMT+00000 ┆ 104.06 ┆ 1.13 │ │ ┆ 624 ┆ Batam ┆ ┆ ┆ ┆ 0 Asia/Ma ┆ ┆ │ │ ┆ ┆ Center ┆ ┆ ┆ ┆ kassar ┆ ┆ │ │ Starbucks ┆ 15756-164 ┆ Hang ┆ Licensed ┆ … ┆ null ┆ GMT+07:00 ┆ 106.93 ┆ -6.22 │ │ ┆ 625 ┆ Nadim ┆ ┆ ┆ ┆ Asia/Jaka ┆ ┆ │ │ ┆ ┆ Intl ┆ ┆ ┆ ┆ rta ┆ ┆ │ │ ┆ ┆ Airport - ┆ ┆ ┆ ┆ ┆ ┆ │ │ ┆ ┆ BTH ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 7272-763 ┆ The Barri ┆ Company ┆ … ┆ (401) ┆ GMT-05:00 ┆ -71.3 ┆ 41.74 │ │ ┆ ┆ ngton ┆ Owned ┆ ┆ 247-4687 ┆ America/N ┆ ┆ │ │ ┆ ┆ Shopping ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ ┆ ┆ Center ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 7355-1046 ┆ Garden ┆ Company ┆ … ┆ (401) ┆ GMT-05:00 ┆ -71.46 ┆ 41.76 │ │ ┆ ┆ City-Cran ┆ Owned ┆ ┆ 464-8235 ┆ America/N ┆ ┆ │ │ ┆ ┆ ston ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ Starbucks ┆ 7678-3784 ┆ East Gree ┆ Company ┆ … ┆ 401-885-1 ┆ GMT-05:00 ┆ -71.45 ┆ 41.66 │ │ ┆ 0 ┆ nwich/Mai ┆ Owned ┆ ┆ 291 ┆ America/N ┆ ┆ │ │ ┆ ┆ n St ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 15660-156 ┆ Macy's ┆ Licensed ┆ … ┆ 401-468-3 ┆ GMT-05:00 ┆ -71.48 ┆ 41.72 │ │ ┆ 764 ┆ Warwick ┆ ┆ ┆ 778 ┆ America/N ┆ ┆ │ │ ┆ ┆ Mall #39 ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ Starbucks ┆ 7430-942 ┆ Warwick-B ┆ Company ┆ … ┆ 401-823-9 ┆ GMT-05:00 ┆ -71.5 ┆ 41.7 │ │ ┆ ┆ aldhill ┆ Owned ┆ ┆ 645 ┆ America/N ┆ ┆ │ │ ┆ ┆ Rd. ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ Starbucks ┆ 76938-129 ┆ Target ┆ Licensed ┆ … ┆ (401) ┆ GMT-05:00 ┆ -71.48 ┆ 41.72 │ │ ┆ 827 ┆ Warwick ┆ ┆ ┆ 244-1972 ┆ America/N ┆ ┆ │ │ ┆ ┆ North ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ ┆ ┆ T-2430 ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 78039-112 ┆ PVD Main ┆ Licensed ┆ … ┆ 401-732-5 ┆ GMT-05:00 ┆ -71.44 ┆ 41.73 │ │ ┆ 125 ┆ Food ┆ ┆ ┆ 040 ┆ America/N ┆ ┆ │ │ ┆ ┆ Court ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ │ Starbucks ┆ 78027-112 ┆ PVD ┆ Licensed ┆ … ┆ 401-732-5 ┆ GMT-05:00 ┆ -71.44 ┆ 41.73 │ │ ┆ 124 ┆ Baggage ┆ ┆ ┆ 140 ┆ America/N ┆ ┆ │ │ ┆ ┆ Claim ┆ ┆ ┆ ┆ ew_York ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ We have 29 outlets attributed to RI (Rhode Island) in the USA, let's see how they plot on a map, there was no altair map at a large enough scale so we imported a map which happens to be a GeoJson type. Hover over the points. .. altair-plot:: import altair as alt import polars as pl q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv') ) q = q0.filter( (pl.col('State/Province') == 'RI') ) df = q.collect() url_geojson = 'https://raw.githubusercontent.com/khalid-alali/geojson/refs/heads/master/ri_rhode_island_zip_codes_geo.min.json' data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json')) # chart object map = alt.Chart(data_geojson_remote, width=700, height=500).mark_geoshape( stroke='white', strokeWidth=2).encode( color=alt.value('#ccc'), ).project( type='identity', reflectY=True) Rhode = alt.Chart(df).mark_circle( color="tomato" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["Store Number", "Store Name", "State/Province", "City", "Country", "Store Name","Longitude", "Latitude"] ) map + Rhode You should see a blank map with three points, the two on the right are obviously from somewhere else - the coordinates are nowhere near Rhode Island - they both come from Batam in Indonesia in the Riau Islands - hence the RI State. Because there were do few points it was perfectly alright to use all the columns - hence the large tooltip. The third is at the Rhode Island TF Green Int'l Airport, which should be alright. We can restrict the Rhode Island selection by adding the US as the country. .. altair-plot:: q = q0.filter( (pl.col('State/Province') == 'RI') & (pl.col('Country') == 'US')) df = q.collect() Rhode = alt.Chart(df).mark_circle( color="tomato" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["Store Number", "Store Name", "State/Province", "City", "Country", "Store Name","Longitude", "Latitude"] ) map + Rhode This appears to be what we want, there are now 27 outlets on our map, there is a group at Providence, the state capital, all the rest are single points. Our baggage reclaim is some way away from the main airport buildings. Otherwise the positions seem alright but could be checked more thoroughly if required. Altair can import external files to generate a map, but the type has to be determined, also if the map has multiple layers the layer or feature has to be declared. Belgium Starbucks ================= Let's see what happens with a country, such as Belgium, mostly because there is a TopoJSON map in Github which loads. Rhode Island was mapped by geoJson and required inverting together with type *identity* to straighten it up. The country code for Belgium is *BE*:: >> import polars as pl q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv' )) q = q0.filter( ( (pl.col('Country') == 'BE'))) df = q.collect() df shape: (19, 13) ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐ │ Brand ┆ Store ┆ Store ┆ Ownership ┆ … ┆ Phone ┆ Timezone ┆ Longitude ┆ Latitude │ │ --- ┆ Number ┆ Name ┆ Type ┆ ┆ Number ┆ --- ┆ --- ┆ --- │ │ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ str ┆ f64 ┆ f64 │ │ ┆ str ┆ str ┆ str ┆ ┆ str ┆ ┆ ┆ │ ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡ │ Starbucks ┆ 48462-260 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.38 ┆ 50.84 │ │ ┆ 556 ┆ Schuman ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Metro ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ ┆ ┆ Station ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 26401-240 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.35 ┆ 50.83 │ │ ┆ 667 ┆ Louise ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Metro ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ ┆ ┆ Station ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 26399-240 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.33 ┆ 50.83 │ │ ┆ 668 ┆ - Gare du ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Midi (M) ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ Starbucks ┆ 48459-260 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.41 ┆ 50.84 │ │ ┆ 557 ┆ Metro ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Station ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ Starbucks ┆ 18497-185 ┆ Antwerp ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.4 ┆ 51.22 │ │ ┆ 504 ┆ Groenplaa ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ ts ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ Starbucks ┆ 32639-137 ┆ Brussels ┆ Licensed ┆ … ┆ +32 2 719 ┆ GMT+1:00 ┆ 4.49 ┆ 50.9 │ │ ┆ 287 ┆ Airport ┆ ┆ ┆ 77 00 ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Depart. ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ ┆ ┆ Hall ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 1338-2459 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.48 ┆ 50.87 │ │ ┆ 53 ┆ Airport ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Concourse ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ ┆ ┆ A ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 6030-1372 ┆ Brussels ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.48 ┆ 50.9 │ │ ┆ 90 ┆ Airport ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Concourse ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ ┆ ┆ B ┆ ┆ ┆ ┆ ┆ ┆ │ │ Starbucks ┆ 28830-248 ┆ Liege Gui ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 5.57 ┆ 50.63 │ │ ┆ 724 ┆ llemins ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ Station ┆ ┆ ┆ ┆ ussels ┆ ┆ │ │ Starbucks ┆ 19372-201 ┆ Gare de ┆ Licensed ┆ … ┆ null ┆ GMT+1:00 ┆ 4.86 ┆ 50.47 │ │ ┆ 645 ┆ Namur ┆ ┆ ┆ ┆ Europe/Br ┆ ┆ │ │ ┆ ┆ ┆ ┆ ┆ ┆ ussels ┆ ┆ │ └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘ We have only 19 outlets. The source Bart Mesuere kindly showed that the feature we need is *municipalities*, he also stated not to use the properties part but the geo-part is good. .. altair-plot:: import altair as alt url = "https://raw.githubusercontent.com/bmesuere/belgium-topojson/refs/heads/master/belgium.json" source = alt.topo_feature(url, "municipalities") be = alt.Chart(source, width=700, height=500).mark_geoshape().encode() be It seems to load without any great fuss, we are leaving the map as just the country plot without any provincial boundaries. .. altair-plot:: import polars as pl q0 = ( pl.scan_csv('https://raw.githubusercontent.com/sunny2309/datasets/refs/heads/master/starbucks_store_locations.csv' )) q = q0.filter( ( (pl.col('Country') == 'BE'))) df = q.collect() star = alt.Chart(df).mark_circle( color="darkorange" ).encode( longitude="Longitude:Q", latitude="Latitude:Q", tooltip = ["State/Province", "City", "Country", "Store Name"] ) be + star The plot should show a few single outlets at the larger towns and a small crowd at Brussels. No outlyers were apparent. It seems when working with country codes we are less likely to have outlets from other lands sharing the same code. Provided we can find a TopoJSON that plots it is easier working with this format when plotting. That was rather tame - more interesting would be to look at one of the outlyers found on the world map when using *State/Province*.