Streams API - benefits?

Are there any speed benefits to using the Streams API for uploading data?

 

For example if you wanted to upload a billion rows as quick as possible.

 

I've written a python script to upload data via the streams api and it doesnt seem much faster than the workbench. I've even tried writiing the script in an asynchronous manner hoping that the csv files / "parts" would upload concurrently and much faster.

 

After all this, it seems to be about the same speed of upload as using the workbench.

 

Any clarification for use cases for this API? Is it faster?

 

 

Thanks,

 

Seth

Comments

  • I too am trying to upload rows as quickly as possible. I'm not an expert, but it seems that if the bottleneck is the upload bandwidth of one machine/network, you could use the Streams API to distribute the uploading of parts across multiple machines/networks.  Do you agree?  I think Workbench only operates from one machine.

     

  • After further research and testing i realize that the "slowness" to the streams api that I was seeing was really just an error in my script. 

     

    Once i correctly executed the asynchronous upload using the streams api I was able to upload a csv that was 6.2 Million rows and around 130 columns wide in 3.5 minutes. 

     

     

     

     

  • Nice work. Did you try/ find any advantages to using gzip compression?  Or is that just uncompressed CSV?

  • That was actually uncompressed. If I had Gzipped it, im sure the speed would have increased. Probably to somewhere around 1 minute vs the 3.5 minutes i was seeing. We just tested a similar dataset and successfully uploaded a gzipped 6 Million row csv in 60-90 seconds.

     

    On the Domo workbench, the send portion of the upload (which is really the only part of the upload that is being sped up by using the Streams API) took 36 minutes for that same file. That is a significant increase in speed. However, you still have to split and gzip the file (which the workbench has to do as well) which takes a significant amount of time. I am fairly certain that the domo workbench does NOT asynchonrously read. If you can find a way to asynchronously split and gzip the csv (or results from a SQL query) that would be a huge performance boost as well.

     

     

     

     

  • Thanks @Medinacus your comments are really helpful.  I frequently upload a (new) 400 million row (and growing) dataset with 200 columns, and I'm always looking for ways to save time. Have you found a part size that works well? I think the latest documentation recommends 20MB - 100MB (compressed size) per part, but I'm curious if you have any input on that. I'm trying to optimize not only the upload time, but also the time for Domo to "process" it after I commit the upload. (sorry if this should actually be a new question in forum).

  • Glad that you find them at least somewhat helpful @robsmith!

     

    We actually havent experimented too much with part size. On our tests, our split files were 100K rows each. That turned out to be gzipped files that were about 10-12 mb. Maybe too small? However, They still uploaded very quickly as mentioned above.

     

    How long has it been taking you to upload your 400M row file? How long does it take for the split and gzip portion and then sending the parts?

     

     

     

  • I upload the dataset in about 3.5 hours, but one day I'll further distribute the uploader to see if I can improve that.  But after committing the execution, Domo's "processing" stage takes 6+ hours.  I'm hoping that by increasing my part sizes to ~100MB gzipped I can minimize the Domo processing time.

     

    I'm not pulling from a database table that I have to split and gzip. Instead, as I collect the data, I store in gzipped csv files intended for Domo.

     

  • Ahh gotchya.

     

    Thanks for the info. Very helpful as we test and try to optimize our own script.  

     

    Let me know how the 100 MB file sizes work out!

     

     

  • Hey - I was wondering how you uploading actual files to Domo through the Streams API?

     

    I know you can upload file-like objects (StringIO), but can I actually upload a real file? I modified pydomo a bit on my server to allow for streaming upload (with open csvfile as f:), but I am still not able to actually upload a folder of files without opening or reading the files in the first place.

This discussion has been closed.